All posts by Hajer Bouafif

Hybrid Search with Amazon OpenSearch Service

2024-03-19 Hajer Bouafif

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/hybrid-search-with-amazon-opensearch-service/

Amazon OpenSearch Service has been a long-standing supporter of both lexical and semantic search, facilitated by its utilization of the k-nearest neighbors (k-NN) plugin. By using OpenSearch Service as a vector database, you can seamlessly combine the advantages of both lexical and vector search. The introduction of the neural search feature in OpenSearch Service 2.9 further simplifies integration with artificial intelligence (AI) and machine learning (ML) models, facilitating the implementation of semantic search.

Lexical search using TF/IDF or BM25 has been the workhorse of search systems for decades. These traditional lexical search algorithms match user queries with exact words or phrases in your documents. Lexical search is more suitable for exact matches, provides low latency, and offers good interpretability of results and generalizes well across domains. However, this approach does not consider the context or meaning of the words, which can lead to irrelevant results.

In the past few years, semantic search methods based on vector embeddings have become increasingly popular to enhance search. Semantic search enables a more context-aware search, understanding the natural language questions of user queries. However, semantic search powered by vector embeddings requires fine-tuning of the ML model for the associated domain (such as healthcare or retail) and more memory resources compared to basic lexical search.

Both lexical search and semantic search have their own strengths and weaknesses. Combining lexical and vector search improves the quality of search results by using their best features in a hybrid model. OpenSearch Service 2.11 now supports out-of-the-box hybrid query capabilities that make it straightforward for you to implement a hybrid search model combining lexical search and semantic search.

This post explains the internals of hybrid search and how to build a hybrid search solution using OpenSearch Service. We experiment with sample queries to explore and compare lexical, semantic, and hybrid search. All the code used in this post is publicly available in the GitHub repository.

Hybrid search with OpenSearch Service

In general, hybrid search to combine lexical and semantic search involves the following steps:

Run a semantic and lexical search using a compound search query clause.
Each query type provides scores on different scales. For example, a Lucene lexical search query will return a score between 1 and infinity. On the other hand, a semantic query using the Faiss engine returns scores between 0 and 1. Therefore, you need to normalize the scores coming from each type of query to put them on the same scale before combining the scores. In a distributed search engine, this normalization needs to happen at the global level rather than shard or node level.
After the scores are all on the same scale, they’re combined for every document.
Reorder the documents based on the new combined score and render the documents as a response to the query.

Prior to OpenSearch Service 2.11, search practitioners would need to use compound query types to combine lexical and semantic search queries. However, this approach does not address the challenge of global normalization of scores as mentioned in Step 2.

OpenSearch Service 2.11 added the support of hybrid query by introducing the score normalization processor in search pipelines. Search pipelines take away the heavy lifting of building normalization of score results and combination outside your OpenSearch Service domain. Search pipelines run inside the OpenSearch Service domain and support three types of processors: search request processor, search response processor, and search phase results processor.

In a hybrid search, the search phase results processor runs between the query phase and fetch phase at the coordinator node (global) level. The following diagram illustrates this workflow.

Search Pipeline

The hybrid search workflow in OpenSearch Service contains the following phases:

Query phase – The first phase of a search request is the query phase, where each shard in your index runs the search query locally and returns the document ID matching the search request with relevance scores for each document.
Score normalization and combination – The search phase results processor runs between the query phase and fetch phase. It uses the normalization processer to normalize scoring results from BM25 and KNN subqueries. The search processor supports min_max and L2-Euclidean distance normalization methods. The processor combines all scores, compiles the final list of ranked document IDs, and passes them to the fetch phase. The processor supports arithmetic_mean, geometric_mean, and harmonic_mean to combine scores.
Fetch phase – The final phase is the fetch phase, where the coordinator node retrieves the documents that matches the final ranked list and returns the search query result.

Solution overview

In this post, you build a web application where you can search through a sample image dataset in the retail space, using a hybrid search system powered by OpenSearch Service. Let’s assume that the web application is a retail shop and you as a consumer need to run queries to search for women’s shoes.

For a hybrid search, you combine a lexical and semantic search query against the text captions of images in the dataset. The end-to-end search application high-level architecture is shown in the following figure.

Solution Architecture

The workflow contains the following steps:

You use an Amazon SageMaker notebook to index image captions and image URLs from the Amazon Berkeley Objects Dataset stored in Amazon Simple Storage Service (Amazon S3) into OpenSearch Service using the OpenSearch ingest pipeline. This dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. You only use the item images and item names in US English. For demo purposes, you use approximately 1,600 products.
OpenSearch Service calls the embedding model hosted in SageMaker to generate vector embeddings for the image caption. You use the GPT-J-6B variant embedding model, which generates 4,096 dimensional vectors.
Now you can enter your search query in the web application hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance (c5.large). The application client triggers the hybrid query in OpenSearch Service.
OpenSearch Service calls the SageMaker embedding model to generate vector embeddings for the search query.
OpenSearch Service runs the hybrid query, combines the semantic search and lexical search scores for the documents, and sends back the search results to the EC2 application client.

Let’s look at Steps 1, 2, 4, and 5 in more detail.

Step 1: Ingest the data into OpenSearch

In Step 1, you create an ingest pipeline in OpenSearch Service using the text_embedding processor to generate vector embeddings for the image captions.

After you define a k-NN index with the ingest pipeline, you run a bulk index operation to store your data into the k-NN index. In this solution, you only index the image URLs, text captions, and caption embeddings where the field type for the caption embeddings is k-NN vector.

Step 2 and Step 4: OpenSearch Service calls the SageMaker embedding model

In these steps, OpenSearch Service uses the SageMaker ML connector to generate the embeddings for the image captions and query. The blue box in the preceding architecture diagram refers to the integration of OpenSearch Service with SageMaker using the ML connector feature of OpenSearch. This feature is available in OpenSearch Service starting from version 2.9. It enables you to create integrations with other ML services, such as SageMaker.

Step 5: OpenSearch Service runs the hybrid search query

OpenSearch Service uses the search phase results processor to perform a hybrid search. For hybrid scoring, OpenSearch Service uses the normalization, combination, and weights configuration settings that are set in the normalization processor of the search pipeline.

Prerequisites

Before you deploy the solution, make sure you have the following prerequisites:

An AWS account
Familiarity with the Python programming language
Familiarity with AWS Identity and Access Management (IAM), Amazon EC2, OpenSearch Service, and SageMaker

Deploy the hybrid search application to your AWS account

To deploy your resources, use the provided AWS CloudFormation template. Supported AWS Regions are us-east-1, us-west-2, and eu-west-1. Complete the following steps to launch the stack:

On the AWS CloudFormation console, create a new stack.
For Template source, select Amazon S3 URL.
For Amazon S3 URL, enter the path for the template for deploying hybrid search.
Choose Next.
Name the stack hybridsearch.
Keep the remaining settings as default and choose Submit.
The template stack should take 15 minutes to deploy. When it’s done, the stack status will show as CREATE_COMPLETE.
When the stack is complete, navigate to the stack Outputs tab.
Choose the SagemakerNotebookURL link to open the SageMaker notebook in a separate tab.
In the SageMaker notebook, navigate to the AI-search-with-amazon-opensearch-service/opensearch-hybridsearch directory and open HybridSearch.ipynb.
If the notebook prompts to set the kernel, Choose the conda_pytorch_p310 kernel from the drop-down menu, then choose Set Kernel.
The notebook should look like the following screenshot.

Now that the notebook is ready to use, follow the step-by-step instructions in the notebook. With these steps, you create an OpenSearch SageMaker ML connector and a k-NN index, ingest the dataset into an OpenSearch Service domain, and host the web search application on Amazon EC2.

Run a hybrid search using the web application

The web application is now deployed in your account and you can access the application using the URL generated at the end of the SageMaker notebook.

Open the Application using the URL

Copy the generated URL and enter it in your browser to launch the application.

Application UI components

Complete the following steps to run a hybrid search:

Use the search bar to enter your search query.
Use the drop-down menu to select the search type. The available options are Keyword Search, Vector Search, and Hybrid Search.
Choose GO to render results for your query or regenerate results based on your new settings.
Use the left pane to tune your hybrid search configuration:
- Under Weight for Semantic Search, adjust the slider to choose the weight for semantic subquery. Be aware that the total weight for both lexical and semantic queries should be 1.0. The closer the weight is to 1.0, the more weight is given to the semantic subquery, and this setting minus 1.0 goes as weightage to the lexical query.
- For Select the normalization type, choose the normalization technique (min_max or L2).
- For Select the Score Combination type, choose the score combination techniques: arithmetic_mean, geometric_mean, or harmonic_mean.

Experiment with Hybrid Search

In this post, you run four experiments to understand the differences between the outputs of each search type.

As a customer of this retail shop, you are looking for women’s shoes, and you don’t know yet what style of shoes you would like to purchase. You expect that the retail shop should be able to help you decide according to the following parameters:

Not to deviate from the primary attributes of what you search for.
Provide versatile options and styles to help you understand your preference of style and then choose one.

As your first step, enter the search query “women shoes” and choose 5 as the number of documents to output.

Next, run the following experiments and review the observation for each search type

Experiment 1: Lexical search

Experiment 1: Lexical search For a lexical search, choose Keyword Search as your search type, then choose GO.

The keyword search runs a lexical query, looking for same words between the query and image captions. In the first four results, two are women’s boat-style shoes identified by common words like “women” and “shoes.” The other two are men’s shoes, linked by the common term “shoes.” The last result is of style “sandals,” and it’s identified based on the common term “shoes.”

In this experiment, the keyword search provided three relevant results out of five—it doesn’t completely capture the user’s intention to have shoes only for women.

Experiment 2: Semantic search

For a semantic search, choose Semantic search as the search type, then choose GO.

The semantic search provided results that all belong to one particular style of shoes, “boots.” Even though the term “boots” was not part of the search query, the semantic search understands that terms “shoes” and “boots” are similar because they are found to be nearest neighbors in the vector space.

In this experiment, when the user didn’t mention any specific shoe styles like boots, the results limited the user’s choices to a single style. This hindered the user’s ability to explore a variety of styles and make a more informed decision on their preferred style of shoes to purchase.

Let’s see how hybrid search can help in this use case.

Experiment 3: Hybrid search

Choose Hybrid Search as the search type, then choose GO.

In this example, the hybrid search uses both lexical and semantic search queries. The results show two “boat shoes” and three “boots,” reflecting a blend of both lexical and semantic search outcomes.

In the top two results, “boat shoes” directly matched the user’s query and were obtained through lexical search. In the lower-ranked items, “boots” was identified through semantic search.

In this experiment, the hybrid search gave equal weighs to both lexical and semantic search, which allowed users to quickly find what they were looking for (shoes) while also presenting additional styles (boots) for them to consider.

Experiment 4: Fine-tune the hybrid search configuration

In this experiment, set the weight of the vector subquery to 0.8, which means the keyword search query has a weightage of 0.2. Keep the normalization and score combination settings set to default. Then choose GO to generate new results for the preceding query.

Providing more weight to the semantic search subquery resulted in higher scores to the semantic search query results. You can see a similar outcome as the semantic search results from the second experiment, with five images of boots for women.

You can further fine-tune the hybrid search results by adjusting the combination and normalization techniques.

In a benchmark conducted by the OpenSearch team using publicly available datasets such as BEIR and Amazon ESCI, they concluded that the min_max normalization technique combined with the arithmetic_mean score combination technique provides the best results in a hybrid search.

You need to thoroughly test the different fine-tuning options to choose what is the most relevant to your business requirements.

Overall observations

From all the previous experiments, we can conclude that the hybrid search in the third experiment had a combination of results that looks relevant to the user in terms of giving exact matches and also additional styles to choose from. The hybrid search matches the expectation of the retail shop customer.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all the resources you created as part of this post.

To clean up your resources, make sure you delete the S3 bucket you created within the application before you delete the CloudFormation stack.

OpenSearch Service integrations

In this post, you deployed a CloudFormation template to host the ML model in a SageMaker endpoint and spun up a new OpenSearch Service domain, then you used a SageMaker notebook to run steps to create the SageMaker-ML connector and deploy the ML model in OpenSearch Service.

You can achieve the same setup for an existing OpenSearch Service domain by using the ready-made CloudFormation templates from the OpenSearch Service console integrations. These templates automate the steps of SageMaker model deployment and SageMaker ML connector creation in OpenSearch Service.

Conclusion

In this post, we provided a complete solution to run a hybrid search with OpenSearch Service using a web application. The experiments in the post provided an example of how you can combine the power of lexical and semantic search in a hybrid search to improve the search experience for your end-users for a retail use case.

We also explained the new features available in version 2.9 and 2.11 in OpenSearch Service that make it effortless for you to build semantic search use cases such as remote ML connectors, ingest pipelines, and search pipelines. In addition, we showed you how the new score normalization processor in the search pipeline makes it straightforward to establish the global normalization of scores within your OpenSearch Service domain before combining multiple search scores.

Learn more about ML-powered search with OpenSearch and set up hybrid search in your own environment using the guidelines in this post. The solution code is also available on the GitHub repo.

About the Authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Praveen Mohan Prasad is an Analytics Specialist Technical Account Manager at Amazon Web Services and helps customers with pro-active operational reviews on analytics workloads. Praveen actively researches on applying machine learning to improve search relevance.

Amazon OpenSearch Service H1 2023 in review

2023-08-24 Hajer Bouafif

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-h1-2023-in-review/

Since its release in January 2021, the OpenSearch project has released 14 versions through June 2023. Amazon OpenSearch Service supports the latest versions of OpenSearch up to version 2.7.

OpenSearch Service provides two configuration options to deploy and operate OpenSearch at scale in the cloud. With OpenSearch Service managed domains, you specify a hardware configuration and OpenSearch Service provisions the required hardware and takes care of software patching, failure recovery, backups, and monitoring. With managed domains, you can use advanced capabilities at no extra cost such as cross-cluster search, cross-cluster replication, anomaly detection, semantic search, security analytics, and more. You don’t need a large team to maintain and operate your OpenSearch Service domain at scale. Your team should be familiar with sharding concepts and OpenSearch best practices to use the OpenSearch managed offering.

Amazon OpenSearch Serverless provides a straightforward and fully auto scaled deployment option. When you use OpenSearch Serverless, you create a collection (a set of indexes that work together on one workload) and use OpenSearch’s APIs, and OpenSearch Serverless does the rest. You don’t need to worry about sizing, capacity planning, or tuning your OpenSearch cluster.

In this post, we provide a review of all the exciting features releases in OpenSearch Service in the first half of 2023.

Build powerful search solutions

In this section, we discuss some of the features in OpenSearch Service that enable you to build powerful search solutions.

OpenSearch Serverless and the serverless vector engine

Earlier this year, we announced the general availability of OpenSearch Serverless. OpenSearch Serverless separates storage and compute components, and indexing and query compute, so they can be managed and scaled independently. It uses Amazon Simple Storage Service (Amazon S3) as the primary data storage for indexes, adding durability for your data. Collections are able to take advantage of the S3 storage layer to reduce the need for hot storage, and reduce cost, by bringing data into local store when it’s accessed.

When you create a serverless collection, you set a collection type. OpenSearch Serverless optimizes resource use depending on the type you set. At release, you could create search and time series collections for full-text search and log analytics use cases, respectively. In July 2023, we previewed support for a third collection type: vector search. The vector engine for OpenSearch Serverless is a simple, scalable, and high-performing vector store and query engine that enables generative AI, semantic search, image search, and more. Built on OpenSearch Serverless, the vector engine inherits and benefits from its robust architecture. With the vector engine, you don’t have to worry about sizing, tuning, and scaling the backend infrastructure. The vector engine automatically adjusts resources by adapting to changing workload patterns and demand to provide consistently fast performance and scale. The vector engine uses approximate nearest neighbor (ANN) algorithms from the Non-Metric Space Library (NMSLIB) and FAISS libraries to power k-NN search.

You can start using the new vector engine capabilities by selecting Vector search when creating your collection on the OpenSearch Service console. Refer to Introducing the vector engine for Amazon OpenSearch Serverless, now in preview for more information about the new vector search option with OpenSearch Serverless.

Configure collection settings

Point in Time

Point in Time (PIT) search, released in version 2.4 of OpenSearch Project and supported in OpenSearch 2.5 in OpenSearch Service, provides consistency in search pagination even when new documents are ingested or deleted within a specific index. For example, let’s say your website user searched for “blue couch” and spent a few minutes looking at the results. During those few minutes, the application added some additional couches to the index, shifting the order of the first 20 documents. If the user then navigates from page 1 to page 2, they may see results that were already on page 1 but have shifted down in the result order. The pagination is not stable over the addition of new data to the index. If you use PIT search, the result order is guaranteed to remain the same across pages, regardless of changes to the index. To learn more about PIT capabilities, refer to Launch highlight: Paginate with Point in Time.

Search relevance plugin

Ever wondered what would happen if you adjusted your relevance function—would the results be better, or worse? With the search relevance plugin, you can now view a side-by-side comparison of results in OpenSearch Dashboards. A UI view makes it simple to see how the results have changed and dial in your relevance to perfection.

Additional field types

OpenSearch 2.7 (available in OpenSearch Service) supports the following new object mapping types:

Cartesian field type – OpenSearch 2.7 in OpenSearch Service adds deeper support for GEO data. If you are building a virtual reality application, computer-aided design (CAD), or sporting venue mapping, you can benefit from the support of Cartesian field types xy point field and xy shape field.
Flat object type – When you set your field’s mapping to flat_object, OpenSearch indexes any JSON objects in the field to let you search for leaf values, even if you don’t know the field name, and lets you search via dotted-path notation. Refer to Use flat object in OpenSearch to learn more about how the flat object mapping type simplifies index mappings and the search experience in OpenSearch.

Geographical analysis

Starting from OpenSearch 2.7 in OpenSearch Service, you can run GeoHex grid aggregation queries on datasets built with the Hexagonal Hierarchical Geospatial Indexing System (H3) open-source library. H3 provides precision down to the square meter or less, making it useful for cases that require a high degree of precision. Because high-precision requests are compute heavy, you should be sure to limit the geographic area using filters.

Take Observability to the next level

Observability in OpenSearch is a collection of plugins and features that let you explore, query and visualize telemetry data stored in OpenSearch. In this section, we discuss how OpenSearch Service enables you to take Observability to the next level.

Simple schema for observability

With version 2.6, the OpenSearch Project released a new unified schema for Observability named Simple Schema for Observability (SS4O) (supported in OpenSearch 2.7 in OpenSearch Service). SS4O is inspired by both OpenTelemetry and the Elastic Common Schema (ECS) and uses Amazon Elastic Container Service (Amazon ECS) event logs and OpenTelemetry (OTel) metadata. SS4O specifies the index structure (mapping), index naming conventions, an integration feature for adding preconfigured dashboards and visualizations, and a JSON schema for enforcing and validating the structure. SS4O complies with the OTEL schema for logs, traces, and metrics.

Jaeger traces support

With the release of OpenSearch 2.5, you can now integrate Jaeger trace data in OpenSearch and use the Observability plugin to analyze your trace data in Jaeger format.

Observability provides you with visibility on the health of your system and microservice applications. OpenSearch Dashboards comes with an Observability plugin, which provides a unified experience for collecting and monitoring metrics, logs, and traces from common data sources. With the Observability plugin, you can monitor and alert on your logs, metrics, and traces to ensure that your application is available, performant, and error-free.

In the first half of 2023, we added the capability to create Observability dashboards and standard dashboards from the OpenSearch Dashboards main menu. Before that, you needed to navigate to the Observability plugin to create event analytics visualizations using Piped Processing Language (PPL). With this release, we made this feature more accessible by integrating a new type of visualization named “PPL” within the list of visualization types on the Dashboards main menu. This helps you correlate both business insights and observability analytics in a single place.

“PPL” visualization type

Build serverless ingestion pipelines

In April of 2023, OpenSearch Service released Amazon OpenSearch Ingestion, a fully managed and auto scaled ingestion pipeline for OpenSearch Service domains and OpenSearch Serverless collections. OpenSearch ingestion is powered by Data Prepper, with source and sink plugins to process, sample, filter, enrich, and deliver data for downstream analysis. Refer to Supported plugins and options for Amazon OpenSearch Ingestion pipelines to learn more.

The service automatically accommodates your workload demands by scaling up and down the OpenSearch Compute units (OCUs). Each OCU provides an estimated 8 GB per hour of throughput (your workload will determine the actual throughput) and is a combination of 8 GiB of memory and 2 vCPUs. You can scale up to 96 OCUs.

OpenSearch ingestion provides out-of-the-box pipeline blueprints that provide configuration templates for the most common ingestion pipelines. For more information, refer to Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service.

Enable your business with security features

In this section, we discuss how you can use OpenSearch Service to enable your business with security features.

Enable SAML during domain creation

SAML authentication for OpenSearch Dashboards was introduced in OpenSearch Service domains with Elasticsearch version 6.7 or higher and OpenSearch version 1.0 or higher, but you had to wait for the domain to be created to enable SAML. In February 2023, we enabled you to specify SAML support during domain creation. Support is available when you create domains on the AWS Management Console, AWS SDK, or AWS CloudFormation templates. SAML authentication for OpenSearch Dashboards enables you to integrate directly with identity providers (IdPs) such as Okta, Ping Identity, OneLogin, Auth0, Active Directory Federation Services (ADFS), and Azure Active Directory.

Security analytics with OpenSearch

OpenSearch 2.5 in OpenSearch Service launched support for OpenSearch’s security analytics plugin. In the past, identifying actionable security alerts and gaining valuable insights required significant expertise and familiarity with various security products. However, with security analytics, you can now benefit from simplified workflows that facilitate correlating multiple security logs and investigating security incidents, all within the OpenSearch environment, even without prior security experience. The security analytics plugin is bundled with an extensive collection of over 2,200 open-source Sigma security rules. These rules play a crucial role in detecting potential security threats in real time from your event logs. With the security analytics plugin, you can also design custom rules, tailor security alerts based on threat severity, and receive automated notifications at your preferred destination, such as email or a Slack channel. For more information about creating detectors and configuring rules, refer to Identify and remediate security threats to your business using security analytics with Amazon OpenSearch Service.

Security Analytics plugin - Alerts and findings

Ingest events from Amazon Security Lake

In June 2023, OpenSearch Ingestion added support for real-time ingestion of events from Amazon Security Lake, reducing indexing time for security data in OpenSearch Service. With Amazon Security Lake centralizing security data from various sources, you can take advantage of the extensive security analytics capabilities and rich dashboard visualizations of OpenSearch Service to gain valuable insights quickly. Using the Open Cybersecurity Schema Framework (OCSF), Amazon Security Lake normalizes and combines data from diverse enterprise security sources in Apache Parquet format. OpenSearch Ingestion now enables ingestion in Parquet format, with built-in processors to convert data into JSON documents before indexing. Additionally, there’s a specialized blueprint for ingesting data from Amazon Security Lake and support for Data Prepper 2.3.0, offering new features like S3 sink, Avro codec, obfuscation processor, event tagging, advanced expressions, and tail sampling.

Amazon Security Lake blueprint in OpenSearch Ingestion

Simplify cluster operations

In this section, we discuss how you can use OpenSearch Service to simplify cluster operations.

Enhanced dry run for configuration changes

OpenSearch Service has introduced an enhanced dry run option that allows you to validate configuration changes before applying them to your clusters. This feature ensures that any potential validation errors that might occur during the deployment of configuration changes are checked and summarized for your review. Additionally, the dry run will indicate whether a blue/green deployment is necessary to apply a change, enabling you to plan accordingly.

Ensure high availability and consistent performance

OpenSearch Service now offers 99.99% availability with Multi-AZ with Standby deployment. This new capability makes your business-critical workloads more resilient to potential infrastructure failures such as Availability Zone failure. Prior to this new launch, OpenSearch Service automatically recovered from Availability Zone outages by allocating more capacity in the impacted Availability Zone and automatically redistributing shards. However, this approach is a reactive approach to infrastructure and network failures, and usually led to high latency and increased resource utilization across the nodes. The Multi-AZ with Standby feature deploys infrastructure in three Availability Zones, while keeping two zones as active and one zone as standby. It requires a minimum of two replicas to maintain data redundancy across Availability Zones for a recovery time in less than a minute.

Multi AZ with stand-by feature

Skip unavailable clusters in cross-cluster search

With the release of the Skip unavailable clusters option for cross-cluster search in June 2023, your cross-cluster search queries will return results even if you have unavailable shards or indexes on one of the remote clusters. The feature is enabled by default when you request connection to a remote cluster on the OpenSearch Service console.

Cross-cluster search feature

Enhance your experience with OpenSearch Dashboards

The release of OpenSearch 2.5 and OpenSearch 2.7 in OpenSearch Service has brought new features to manage data streams and indexes on the OpenSearch Dashboards UI.

Snapshot management

By default, OpenSearch Service takes hourly snapshots of your data with a retention time of 14 days. The automatic snapshots are incremental in nature and help you recover from data loss or cluster failure. In addition to the default hourly snapshots, OpenSearch Service provides the capability to run manual snapshots and store them in an S3 bucket. You can use snapshot management to create manual snapshots, define a snapshot retention policy, and set up the frequency and timing of snapshot creation. Snapshot management is available under the index management plugin in OpenSearch Dashboards.

Snapshot management plugin

Index and data streams management

With the support of OpenSearch 2.5 and OpenSearch 2.7 in OpenSearch Service , you can now use the index management plugin in OpenSearch dashboards to manage data streams, index templates, and index aliases.

The index management UI provides expended capabilities to include running manual rollover and force merge actions for data streams. You can also visually manage multiple index templates and define index mappings, number of primary shards, number of replicas, and refresh internal for your indexes.

index management UI

Conclusion

It’s been a busy first half of the year! OpenSearch Project and OpenSearch Service have launched OpenSearch Serverless to use OpenSearch without worrying about infrastructure, index, or shards; OpenSearch Ingestion to ingest your data; the vector engine for OpenSearch Serverless; security analytics to analyze data from Amazon Security Lake; operational improvements to bring 99.99% availability; and improvements to the Observability plugin. OpenSearch Service provides a full suite of capabilities, including a vector database, semantic search, and log analytics engine. We invite you to check out the features described in this post and we appreciate providing us your valuable feedback.

You can get started by having hands-on experience with the publicly available workshops for semantic search, microservice observability, and OpenSearch Serverless. You can also learn more about the service features and use cases by checking out more OpenSearch Service blog posts.

About the Authors

Aish Gunasekar is a Specialist Solutions Architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service

2023-07-17 Hajer Bouafif

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/build-a-serverless-log-analytics-pipeline-using-amazon-opensearch-ingestion-with-managed-amazon-opensearch-service/

In this post, we show how to build a log ingestion pipeline using the new Amazon OpenSearch Ingestion, a fully managed data collector that delivers real-time log and trace data to Amazon OpenSearch Service domains. OpenSearch Ingestion is powered by the open-source data collector Data Prepper. Data Prepper is part of the open-source OpenSearch project. With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open-source project, visit Data Prepper.

In this post, we explore the logging infrastructure for a fictitious company, AnyCompany. We explore the components of the end-to-end solution and then show how to configure OpenSearch Ingestion’s main parameters and how the logs come in and out of OpenSearch Ingestion.

Solution overview

Consider a scenario in which AnyCompany collects Apache web logs. They use OpenSearch Service to monitor web access and identify possible root causes to error logs of type 4xx and 5xx. The following architecture diagram outlines the use of every component used in the log analytics pipeline: Fluent Bit collects and forwards logs; OpenSearch Ingestion processes, routes, and ingests logs; and OpenSearch Service analyzes the logs.

The workflow contains the following stages:

Generate and collect – Fluent Bit collects the generated logs and forwards them to OpenSearch Ingestion. In this post, you create fake logs that Fluent Bit forwards to OpenSearch Ingestion. Check the list of supported clients to review the required configuration for each client supported by OpenSearch Ingestion.
Process and ingest – OpenSearch Ingestion filters the logs based on response value, processes the logs using a grok processor, and applies conditional routing to ingest the error logs to an OpenSearch Service index.
Store and analyze – We can analyze the Apache httpd error logs using OpenSearch Dashboards.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

An AWS account
An active AWS Cloud9 environment on an Amazon Elastic Compute Cloud (Amazon EC2) instance
Pre-installed Docker compose in the AWS Cloud9 environment
AWS managed temporary credentials disabled in AWS Cloud9
An active OpenSearch Service domain

Configure OpenSearch Ingestion

First, you define the appropriate AWS Identity and Access Management (IAM) permissions to write to and from OpenSearch Ingestion. Then you set up the pipeline configuration in the OpenSearch Ingestion. Let’s explore each step in more detail.

Configure IAM permissions

OpenSearch Ingestion works with IAM to secure communications into and out of OpenSearch Ingestion. You need two roles, authenticated using AWS Signature V4 (SigV4) signed requests. The originating entity requires permissions to write to OpenSearch Ingestion. OpenSearch Ingestion requires permissions to write to your OpenSearch Service domain. Finally, you must create an access policy using OpenSearch Service’s fine-grained access control, which allows OpenSearch Ingestion to create indexes and write to them in your domain.

The following diagram illustrates the IAM permissions to allow OpenSearch Ingestion to write to an OpenSearch Service domain. Refer to Setting up roles and users in Amazon OpenSearch Ingestion to get more details on roles and permissions required to use OpenSearch Ingestion.

In the demo, you use the AWS Cloud9 EC2 instance profile’s credentials to sign requests sent to OpenSearch Ingestion. You use Fluent Bit to fetch the credentials and assume the role you pass in the aws_role_arn you configure later.

Create an ingestion role (called IngestionRole) to allow Fluent Bit to ingest the logs into your pipeline.

Create a trust relationship to allow Fluent Bit to assume the ingestion role, as shown in the following code. Fluent Bit attempts to fetch the credentials in the following order. In configuring the access policy for this role, you grant permission for the osis:Ingest.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "{your-account-id}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create a pipeline role (called PipelineRole) with a trust relationship for OpenSearch Ingestion to assume that role. The domain-level access policy of the OpenSearch domain grants the pipeline role access to the domain.

Finally, configure your domain’s security plugin to enable OpenSearch Ingestion’s assumed role to create indexes and write data to the domain.

In this demo, the OpenSearch Service domain uses fine-grained access control for authentication, so you need to map the OpenSearch Ingestion pipeline role to the OpenSearch backend role all_access. For instructions, refer to Step 2: Include the pipeline role in the domain access policy page.

Create the pipeline in OpenSearch Ingestion

To create an OpenSearch Ingestion pipeline, complete the following steps:

On the OpenSearch Service console, choose Pipelines in the navigation pane.
Choose Create pipeline.
For Pipeline name, enter a name.

Input the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.

Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.

In the Pipeline configuration section, configure Data Prepper to process your data by choosing the appropriate blueprint configuration template on the Configuration blueprints menu. For this post, we choose AWS-LogAggregationWithConditionalRouting.

The OpenSearch Ingestion pipeline configuration consists of four sections:

Source – This is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. In this post, you use the http_source plugin and provide the Fluent Bit output URI value within the path attribute.
Processors – This represents an intermediate processing to filter, transform, and enrich your input data. Refer to Supported plugins for more details on the list of operations supported in OpenSearch Ingestion. In this post, we use the grok processor COMMONAPACHELOG, which matches input logs against the common Apache log pattern and makes it easy to query in OpenSearch Service.
Sink – This is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. In this post, you define an OpenSearch Service domain and index as sink.
Route – This is the part of a processor that allows the pipeline to route the data into different sinks based on specific conditions. In this example, you create four routes based in the response field value of the log. If the response field value of the log line matches 2xx or 3xx, the log is sent to the OpenSearch Service index aggregated_2xx_3xx. If the response field value matches 4xx, the log is sent to the index aggregated_4xx. If the response field value matches 5xx, the log is sent to the index aggregated_5xx.

Update the blueprint based on your use case. The following code shows an example of the pipeline configuration YAML file:

version: "2"
log-aggregate-pipeline:
  source:
    http:
      # Provide the FluentBit output URI value.
      path: "/log/ingest"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG_DATATYPED}" ]
  route:
    - 2xx_status: "/response >= 200 and /response < 300"
    - 3xx_status: "/response >= 300 and /response < 400"
    - 4xx_status: "/response >= 400 and /response < 500"
    - 5xx_status: "/response >= 500 and /response < 600"
  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_2xx_3xx"
        routes:
          - 2xx_status
          - 3xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_4xx"
        routes:
          - 4xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_5xx"
        routes:
          - 5xx_status

Provide the relevant values for your domain endpoint, account ID, and Region related to your configuration.

Check the health of your configuration setup by choosing Validate pipeline when you finish the update.

When designing a production workload, deploy your pipeline within a VPC. For instructions, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.

For this post, select Public access under Network.

In the Log publishing options section, select Publish to CloudWatch logs and Create new group.

OpenSearch Ingestion uses the log levels of INFO, WARN, ERROR, and FATAL. Enabling log publishing helps you monitor your pipelines in production.

Choose Next and Create pipeline.
Select the pipeline and choose View details to see the progress of the pipeline creation.

Wait until the status changes to Active to start using the pipeline.

Send logs to the OpenSearch Ingestion pipeline

To start sending logs to the OpenSearch Ingestion pipeline, complete the following steps:

On the AWS Cloud9 console, create a Fluent Bit configuration file and update the following attributes:
- Host – Enter the ingestion URL of your OpenSearch Ingestion pipeline.
- aws_service – Enter osis.
- aws_role_arn – Enter the ARN of the IAM role IngestionRole.

The following code shows an example of the Fluent-bit.conf file:

[SERVICE]
    parsers_file          ./parsers.conf
    
[INPUT]
    name                  tail
    refresh_interval      5
    path                  /var/log/*.log
    read_from_head        true
[FILTER]
    Name parser
    Key_Name log
    Parser apache
[OUTPUT]
    Name http
    Match *
    Host {Ingestion URL}
    Port 443
    URI /log/ingest
    format json
    aws_auth true
    aws_region {AWS_region}
    aws_role_arn arn:aws:iam::{your-account-id}:role/IngestionRole
    aws_service osis
    Log_Level trace
    tls On

In the AWS Cloud9 environment, create a docker-compose YAML file to deploy Fluent Bit and Flog containers:

version: '3'
services:
  fluent-bit:
    container_name: fluent-bit
    image: docker.io/amazon/aws-for-fluent-bit
    volumes:
      - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
      - ./apache-logs:/var/log
  flog:
    container_name: flog
    image: mingrammer/flog
    command: flog -t log -f apache_common -o web/log/test.log -w -n 100000 -d 1ms -p 1000
    volumes:
      - ./apache-logs:/web/log

Before you start the Docker containers, you need to update the IAM EC2 instance role in AWS Cloud9 so it can sign the requests sent to OpenSearch Ingestion.

For demo purposes, create an IAM service-linked role and choose EC2 under Use case to allow the AWS Cloud9 EC2 instance to call OpenSearch Ingestion on your behalf.
Add the OpenSearch Ingestion policy, which is the same policy you used with IngestionRole.
Add the AdministratorAccess permission policy to the role as well.

Your role definition should look like the following screenshot.

After you create the role, go back to AWS Cloud9, select your demo environment, and choose View details.
On the EC2 instance tab, choose Manage EC2 instance to view the details of the EC2 instance attached to your AWS Cloud9 environment.

On the Amazon EC2 console, replace the IAM role of your AWS Cloud9 EC2 instance with the new role.
Open a terminal in AWS Cloud9 and run the command docker-compose up.

Check the output in the terminal—if everything is working correctly, you get status 200.

Fluent Bit collects logs from the /var/log repository in the container and pushes the data to the OpenSearch Ingestion pipeline.

Open OpenSearch Dashboards, navigate to Dev Tools, and run the command GET _cat/indices to validate that the data has been delivered by OpenSearch Ingestion to your OpenSearch Service domain.

You should see the three indexes created: aggregated_2xx_3xx, aggregated_4xx, and aggregated_5xx.

Now you can focus on analyzing your log data and reinvent your business without having to worry about any operational overhead regarding your ingestion pipeline.

Best practices for monitoring

You can monitor the Amazon CloudWatch metrics made available to you to maintain the right performance and availability of your pipeline. Check the list of available pipeline metrics related to the source, buffer, processor, and sink plugins.

Navigate to the Metrics tab for your specific OpenSearch Ingestion pipeline to explore the graphs available to each metric, as shown in the following screenshot.

In your production workloads, make sure to configure the following CloudWatch alarms to notify you when the pipeline metrics breach a specific threshold so you can promptly remediate each issue.

Managing cost

While OpenSearch Ingestion automatically provisions and scales the OCUs for your spiky workloads, you only pay for the compute resources actively used by your pipeline to ingest, process, and route data. Therefore, setting up a maximum capacity of Ingestion OCUs allows you to handle your workload peak demand while controlling cost.

For production workloads, make sure to configure a minimum of 2 Ingestion OCUs to ensure 99.9% availability for the ingestion pipeline. Check the sizing recommendations and learn how OpenSearch Ingestion responds to workload spikes.

Clean up

Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

On the AWS Cloud9 console, choose Environments in the navigation pane.
Select the environment you want to delete and choose Delete.
On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
Select the domain you want to delete and choose Delete.
Select Pipelines under Ingestion in the navigation pane.
Select the pipeline you want to delete and on the Actions menu, choose Delete.

Conclusion

In this post, you learned how to create a serverless ingestion pipeline to deliver Apache access logs to an OpenSearch Service domain using OpenSearch Ingestion. You learned the IAM permissions required to start using OpenSearch Ingestion and how to use a pipeline blueprint instead of creating a pipeline configuration from scratch.

You used Fluent Bit to collect and forward Apache logs, and used OpenSearch Ingestion to process and conditionally route the log data to different indexes in OpenSearch Service. For more examples about writing to OpenSearch Ingestion pipelines, refer to Sending data to Amazon OpenSearch Ingestion pipelines.

Finally, the post provided you with recommendations and best practices to deploy OpenSearch Ingestion pipelines in a production environment while controlling cost.

Follow this post to build your serverless log analytics pipeline, and refer to Top strategies for high volume tracing with Amazon OpenSearch Ingestion to learn more about high volume tracing with OpenSearch Ingestion.

About the authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Francisco Losada is an Analytics Specialist Solutions Architect based out of Madrid, Spain. He works with customers across EMEA to architect, implement, and evolve analytics solutions at AWS. He advocates for OpenSearch, the open-source search and analytics suite, and supports the community by sharing code samples, writing content, and speaking at conferences. In his spare time, Francisco enjoys playing tennis and running.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Hybrid search with OpenSearch Service

Solution overview

Step 1: Ingest the data into OpenSearch

Step 2 and Step 4: OpenSearch Service calls the SageMaker embedding model

Step 5: OpenSearch Service runs the hybrid search query

Prerequisites

Deploy the hybrid search application to your AWS account

Run a hybrid search using the web application

Experiment with Hybrid Search

Experiment 1: Lexical search

Experiment 2: Semantic search

Experiment 3: Hybrid search

Experiment 4: Fine-tune the hybrid search configuration

Overall observations

Clean up

OpenSearch Service integrations

Conclusion

About the Authors

Build powerful search solutions

OpenSearch Serverless and the serverless vector engine

Point in Time

Search relevance plugin

Additional field types

Geographical analysis

Take Observability to the next level

Simple schema for observability

Jaeger traces support

Build serverless ingestion pipelines

Enable your business with security features

Enable SAML during domain creation

Security analytics with OpenSearch

Ingest events from Amazon Security Lake

Simplify cluster operations

Enhanced dry run for configuration changes

Ensure high availability and consistent performance

Skip unavailable clusters in cross-cluster search

Enhance your experience with OpenSearch Dashboards

Snapshot management

Index and data streams management

Conclusion

About the Authors

Solution overview

Prerequisites

Configure OpenSearch Ingestion

Configure IAM permissions

Create the pipeline in OpenSearch Ingestion

Send logs to the OpenSearch Ingestion pipeline

Best practices for monitoring

Managing cost

Clean up

Conclusion

About the authors

The collective thoughts of the interwebz