All posts by Jon Handler

Amazon OpenSearch H2 2023 in review

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-h2-2023-in-review/

2023 was been a busy year for Amazon OpenSearch Service! Learn more about the releases that OpenSearch Service launched in the first half of 2023.

In the second half of 2023, OpenSearch Service added the support of two new OpenSearch versions: 2.9 and 2.11 These two versions introduce new features in the search space, machine learning (ML) search space, migrations, and the operational side of the service.

With the release of zero-ETL integration with Amazon Simple Storage Service (Amazon S3), you can analyze your data sitting in your data lake using OpenSearch Service to build dashboards and query the data without the need to move your data from Amazon S3.

OpenSearch Service also announced a new zero-ETL integration with Amazon DynamoDB through the DynamoDB plugin for Amazon OpenSearch Ingestion. OpenSearch Ingestion takes care of bootstrapping and continuously streams data from your DynamoDB source.

OpenSearch Serverless announced the general availability of the Vector Engine for Amazon OpenSearch Serverless along with other features to enhance your experience with time series collections, manage your cost for development environments, and quickly scale your resources to match your workload demands.

In this post, we discuss the new releases in OpenSearch Service to empower your business with search, observability, security analytics, and migrations.

Build cost-effective solutions with OpenSearch Service

With the zero-ETL integration for Amazon S3, OpenSearch Service now lets you query your data in place, saving cost on storage. Data movement is an expensive operation because you need to replicate data across different data stores. This increases your data footprint and drives cost. Moving data also adds the overhead of managing pipelines to migrate the data from one source to a new destination.

OpenSearch Service also added new instance types for data nodes—Im4gn and OR1—to help you further optimize your infrastructure cost. With a maximum 30 TB non-volatile memory (NVMe) solid state drives (SSD), the Im4gn instance provides dense storage and better performance. OR1 instances use segment replication and remote-backed storage to greatly increase throughput for indexing-heavy workloads.

Zero-ETL from DynamoDB to OpenSearch Service

In November 2023, DynamoDB and OpenSearch Ingestion introduced a zero-ETL integration for OpenSearch Service. OpenSearch Service domains and OpenSearch Serverless collections provide advanced search capabilities, such as full-text and vector search, on your DynamoDB data. With a few clicks on the AWS Management Console, you can now seamlessly load and synchronize your data from DynamoDB to OpenSearch Service, eliminating the need to write custom code to extract, transform, and load the data.

Direct query (zero-ETL for Amazon S3 data, in preview)

OpenSearch Service announced a new way for you to query operational logs in Amazon S3 and S3-based data lakes without needing to switch between tools to analyze operational data. Previously, you had to copy data from Amazon S3 into OpenSearch Service to take advantage of OpenSearch’s rich analytics and visualization features to understand your data, identify anomalies, and detect potential threats.

However, continuously replicating data between services can be expensive and requires operational work. With the OpenSearch Service direct query feature, you can access operational log data stored in Amazon S3, without needing to move the data itself. Now you can perform complex queries and visualizations on your data without any data movement.

Support of Im4gn with OpenSearch Service

Im4gn instances are optimized for workloads that manage large datasets and need high storage density per vCPU. Im4gn instances come in sizes large through 16xlarge, with up to 30 TB in NVMe SSD disk size. Im4gn instances are built on AWS Nitro System SSDs, which offer high-throughput, low-latency disk access for best performance. OpenSearch Service Im4gn instances support all OpenSearch versions and Elasticsearch versions 7.9 and above. For more details, refer to Supported instance types in Amazon OpenSearch Service.

Introducing OR1, an OpenSearch Optimized Instance family for indexing heavy workloads

In November 2023, OpenSearch Service launched OR1, the OpenSearch Optimized Instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon S3 to provide 11 9s of durability. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon S3 as it arrives. OR1 instances use OpenSearch’s segment replication feature to enable replica shards to read data directly from Amazon S3, avoiding the resource cost of indexing in both primary and replica shards. The OR1 instance family also supports automatic data recovery in the event of failure. For more information about OR1 instance type options, refer to Current generation instance types in OpenSearch Service.

Enable your business with security analytics features

The Security Analytics plugin in OpenSearch Service supports out-of-the-box prepackaged log types and provides security detection rules (SIGMA rules) to detect potential security incidents.

In OpenSearch 2.9, the Security Analytics plugin added support for customer log types and native support for Open Cybersecurity Schema Framework (OCSF) data format. With this new support, you can build detectors with OCSF data stored in Amazon Security Lake to analyze security findings and mitigate any potential incident. The Security Analytics plugin has also added the possibility to create your own custom log types and create custom detection rules.

Build ML-powered search solutions

In 2023, OpenSearch Service invested in eliminating the heavy lifting required to build next-generation search applications. With features such as search pipelines, search processors, and AI/ML connectors, OpenSearch Service enabled rapid development of search applications powered by neural search, hybrid search, and personalized results. Additionally, enhancements to the kNN plugin improved storage and retrieval of vector data. Newly launched optional plugins for OpenSearch Service enable seamless integration with additional language analyzers and Amazon Personalize.

Search pipelines

Search pipelines provide new ways to enhance search queries and improve search results. You define a search pipeline and then send your queries to it. When you define the search pipeline, you specify processors that transform and augment your queries, and re-rank your results. The prebuilt query processors include date conversion, aggregation, string manipulation, and data type conversion. The results processor in the search pipeline intercepts and adapts results on the fly before rendering to next phase. Both request and response processing for the pipeline are performed on the coordinator node, so there is no shard-level processing.

Optional plugins

OpenSearch Service lets you associate preinstalled optional OpenSearch plugins to use with your domain. An optional plugin package is compatible with a specific OpenSearch version, and can only be associated to domains with that version. Available plugins are listed on the Packages page on the OpenSearch Service console. The optional plugin includes the Amazon Personalize plugin, which integrates OpenSearch Service with Amazon Personalize, and new language analyzers such as Nori, Sudachi, STConvert, and Pinyin.

Support for new language analyzers

OpenSearch Service added support for four new language analyzer plugins: Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese). These are available in all AWS Regions as optional plugins that you can associate with domains running any OpenSearch version. You can use the Packages page on the OpenSearch Service console to associate these plugins to your domain, or use the Associate Package API.

Neural search feature

Neural search is generally available with OpenSearch Service version 2.9 and later. Neural search allows you to integrate with ML models that are hosted remotely using the model serving framework. When you use a neural query during search, neural search converts the query text into vector embeddings, uses vector search to compare the query and document embedding, and returns the closest results. During ingestion, neural search transforms document text into vector embedding and indexes both the text and its vector embeddings in a vector index.

Integration with Amazon Personalize

OpenSearch Service introduced an optional plugin to integrate with Amazon Personalize in OpenSearch versions 2.9 or later. The OpenSearch Service plugin for Amazon Personalize Search Ranking allows you to improve the end-user engagement and conversion from your website and application search by taking advantage of the deep learning capabilities offered by Amazon Personalize. As an optional plugin, the package is compatible with OpenSearch version 2.9 or later, and can only be associated to domains with that version.

Efficient query filtering with OpenSearch’s k-NN FAISS

OpenSearch Service introduced efficient query filtering with OpenSearch’s k-NN FAISS in version 2.9 and later. OpenSearch’s efficient vector query filters capability intelligently evaluates optimal filtering strategies—pre-filtering with approximate nearest neighbor (ANN) or filtering with exact k-nearest neighbor (k-NN)—to determine the best strategy to deliver accurate and low-latency vector search queries. In earlier OpenSearch versions, vector queries on the FAISS engine used post-filtering techniques, which enabled filtered queries at scale, but potentially returning less than the requested “k” number of results. Efficient vector query filters deliver low latency and accurate results, enabling you to employ hybrid search across vector and lexical techniques.

Byte-quantized vectors in OpenSearch Service

With the new byte-quantized vector introduced with 2.9, you can reduce memory requirements by a factor of 4 and significantly reduce search latency, with minimal loss in quality (recall). With this feature, the usual 32-bit floats that are used for vectors are quantized or converted to 8-bit signed integers. For many applications, existing float vector data can be quantized with little loss in quality. Comparing benchmarks, you will find that using byte vectors rather than 32-bit floats results in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. An internal benchmark showed the storage usage was reduced by up to 78%, and RAM usage was reduced by up to 59% (for the glove-200-angular dataset). Recall values for angular datasets were lower than those of Euclidean datasets.

AI/ML connectors

OpenSearch 2.9 and later supports integrations with ML models hosted on AWS services or third-party platforms. This allows system administrators and data scientists to run ML workloads outside of their OpenSearch Service domain. The ML connectors come with a supported set of ML blueprints—templates that define the set of parameters you need to provide when sending API requests to a specific connector. OpenSearch Service provides connectors for several platforms, such as Amazon SageMaker, Amazon Bedrock, OpenAI ChatGPT, and Cohere.

OpenSearch Service console integrations

OpenSearch 2.9 and later added a new integrations feature on the console. Integrations provides you with an AWS CloudFormation template to build your semantic search use case by connecting to your ML models hosted on SageMaker or Amazon Bedrock. The CloudFormation template generates the model endpoint and registers the model ID with the OpenSearch Service domain you provide as input to the template.

Hybrid search and range normalization

The normalization processor and hybrid query builds on top of the two features released earlier in 2023—neural search and search pipelines. Because lexical and semantic queries return relevance scores on different scales, fine-tuning hybrid search queries was difficult.

OpenSearch Service 2.11 now supports a combination and normalization processor for hybrid search. You can now perform hybrid search queries, combining a lexical and a natural language-based k-NN vector search queries. OpenSearch Service also enables you to tune your hybrid search results for maximum relevance using multiple scoring combination and normalization techniques.

Multimodal search with Amazon Bedrock

OpenSearch Service 2.11 launches the support of multimodal search that allows you to search text and image data using multimodal embedding models. To generate vector embeddings, you need to create an ingest pipeline that contains a text_image_embedding processor, which converts the text or image binaries in a document field to vector embeddings. You can use the neural query clause, either in the k-NN plugin API or Query DSL queries, to do a combination of text and images searches. You can use the new OpenSearch Service integration features to quickly start with multimodal search.

Neural sparse retrieval

Neural sparse search, a new efficient method of semantic retrieval, is available in OpenSearch Service 2.11. Neural sparse search operates in two modes: bi-encoder and document-only. With the bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, only documents are passed through deep encoders, while search queries are tokenized. A document-only sparse encoder generates an index that is 10.4% of the size of a dense encoding index. For a bi-encoder, the index size is 7.2% of the size of a dense encoding index. Neural sparse search is enabled by sparse encoding models that create sparse vector embeddings: a set of <token: weight> pairs representing the text entry and its corresponding weight in the sparse vector. To learn more about the pre-trained models for sparse neural search, refer to Sparse encoding models.

Neural sparse search reduces costs, improves search relevance, and has lower latency. You can use the new OpenSearch Service integrations features to quickly start with neural sparse search.

OpenSearch Ingestion updates

OpenSearch Ingestion is a fully managed and auto scaled ingestion pipeline that delivers your data to OpenSearch Service domains and OpenSearch Serverless collections. Since its release in 2023, OpenSearch Ingestion continues to add new features to make it straightforward to transform and move your data from supported sources to downstream destinations like OpenSearch Service, OpenSearch Serverless, and Amazon S3.

New migration features in OpenSearch Ingestion

In November 2023, OpenSearch Ingestion announced the release of new features to support data migration from self-managed Elasticsearch version 7.x domains to the latest versions of OpenSearch Service.

OpenSearch Ingestion also supports the migration of data from OpenSearch Service managed domains running OpenSearch version 2.x to OpenSearch Serverless collections.

Learn how you can use OpenSearch Ingestion to migrate your data to OpenSearch Service.

Improve data durability with OpenSearch Ingestion

In November 2023, OpenSearch Ingestion introduced persistent buffering for push-based sources likes HTTP sources (HTTP, Fluentd, FluentBit) and OpenTelemetry collectors.

By default, OpenSearch Ingestion uses in-memory buffering. With persistent buffering, OpenSearch Ingestion stores your data in a disk-based store that is more resilient. If you have existing ingestion pipelines, you can enable persistent buffering for these pipelines, as shown in the following screenshot.

Support of new plugins

In early 2023, OpenSearch Ingestion added support for Amazon Managed Streaming for Apache Kafka (Amazon MSK). OpenSearch Ingestion uses the Kafka plugin to stream data from Amazon MSK to OpenSearch Service managed domains or OpenSearch Serverless collections. To learn more about setting up Amazon MSK as a data source, see Using an OpenSearch Ingestion pipeline with Amazon Managed Streaming for Apache Kafka.

OpenSearch Serverless updates

OpenSearch Serverless continued to enhance your serverless experience with OpenSearch by introducing the support of a new collection of type vector search to store embeddings and run similarity search. OpenSearch Serverless now supports shard replica scaling to handle spikes in query throughput. And if you are using a time series collection, you can now set up your custom data retention policy to match your data retention requirements.

Vector Engine for OpenSearch Serverless

In November 2023, we launched the vector engine for Amazon OpenSearch Serverless. The vector engine makes it straightforward to build modern ML-augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure. It also enables you to run hybrid search, combining vector search and full-text search in the same query, removing the need to manage and maintain separate data stores or a complex application stack.

OpenSearch Serverless lower-cost dev and test environments

OpenSearch Serverless now supports development and test workloads by allowing you to avoid running a replica. Removing replicas eliminates the need to have redundant OCUs in another Availability Zone solely for availability purposes. If you are using OpenSearch Serverless for development and testing, where availability is not a concern, you can drop your minimum OCUs from 4 to 2.

OpenSearch Serverless supports automated time-based data deletion using data lifecycle policies

In December 2023, OpenSearch Serverless announced support for managing data retention of time series collections and indexes. With the new automated time-based data deletion feature, you can specify how long you want to retain data. OpenSearch Serverless automatically manages the lifecycle of the data based on this configuration. To learn more, refer to Amazon OpenSearch Serverless now supports automated time-based data deletion.

OpenSearch Serverless announced support for scaling up replicas at shard level

At launch, OpenSearch Serverless supported increasing capacity automatically in response to growing data sizes. With the new shard replica scaling feature, OpenSearch Serverless automatically detects shards under duress due to sudden spikes in query rates and dynamically adds new shard replicas to handle the increased query throughput while maintaining fast response times. This approach proves to be more cost-efficient than simply adding new index replicas.

AWS user notifications to monitor your OCU usage

With this launch, you can configure the system to send notifications when OCU utilization is approaching or has reached maximum configured limits for search or ingestion. With the new AWS User Notification integration, you can configure the system to send notifications whenever the capacity threshold is breached. The User Notification feature eliminates the need to monitor the service constantly. For more information, see Monitoring Amazon OpenSearch Serverless using AWS User Notifications.

Enhance your experience with OpenSearch Dashboards

OpenSearch 2.9 in OpenSearch Service introduced new features to make it straightforward to quickly analyze your data in OpenSearch Dashboards. These new features include the new out-of-the box, preconfigured dashboards with OpenSearch Integrations, and the ability to create alerting and anomaly detection from an existing visualization in your dashboards.

OpenSearch Dashboard integrations

OpenSearch 2.9 added the support of OpenSearch integrations in OpenSearch Dashboards. OpenSearch integrations include preconfigured dashboards so you can quickly start analyzing your data coming from popular sources such as AWS CloudFront, AWS WAF, AWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC) flow logs.

Alerting and anomalies in OpenSearch Dashboards

In OpenSearch Service 2.9, you can create a new alerting monitor directly from your line chart visualization in OpenSearch Dashboards. You can also associate the existing monitors or detectors previously created in OpenSearch to the dashboard visualization.

This new feature helps reduce context switching between dashboards and both the Alerting or Anomaly Detection plugins. Refer to the following dashboard to add an alerting monitor to detect drops in average data volume in your services.

OpenSearch expands geospatial aggregations support

With OpenSearch version 2.9, OpenSearch Service added the support of three types of geoshape data aggregation through API: geo_bounds, geo_hash, and geo_tile.

The geoshape field type provides the possibility to index location data in different geographic formats such as a point, a polygon, or a linestring. With the new aggregation types, you have more flexibility to aggregate documents from an index using metric and multi-bucket geospatial aggregations.

OpenSearch Service operational updates

OpenSearch Service removed the need to run blue/green deployment when changing the domain managed nodes. Additionally, the service improved the Auto-Tune events with the support of new Auto-Tune metrics to track the changes within your OpenSearch Service domain.

OpenSearch Service now lets you update domain manager nodes without blue/green deployment

As of early H2 of 2023, OpenSearch Service allowed you to modify the instance type or instance count of dedicated cluster manager nodes without the need for blue/green deployment. This enhancement allows quicker updates with minimal disruption to your domain operations, all while avoiding any data movement.

Previously, updating your dedicated cluster manager nodes on OpenSearch Service meant using a blue/green deployment to make the change. Although blue/green deployments are meant to avoid any disruption to your domains, because the deployment utilizes additional resources on the domain, it is recommended that you perform them during low-traffic periods. Now you can update cluster manager instance types or instance counts without requiring a blue/green deployment, so these updates can complete faster while avoiding any potential disruption to your domain operations. In cases where you modify both the domain manager instance type and count, OpenSearch Service will still use a blue/green deployment to make the change. You can use the dry-run option to check whether your change requires a blue/green deployment.

Enhanced Auto-Tune experience

In September 2023, OpenSearch Service added new Auto-Tune metrics and improved Auto-Tune events that give you better visibility into the domain performance optimizations made by Auto-Tune.

Auto-Tune is an adaptive resource management system that automatically updates OpenSearch Service domain resources to improve efficiency and performance. For example, Auto-Tune optimizes memory-related configuration such as queue sizes, cache sizes, and Java virtual machine (JVM) settings on your nodes.

With this launch, you can now audit the history of the changes, as well as track them in real time from the Amazon CloudWatch console.

Additionally, OpenSearch Service now publishes details of the changes to Amazon EventBridge when Auto-Tune settings are recommended or applied to an OpenSearch Service domain. These Auto-Tune events will also be visible on the Notifications page on the OpenSearch Service console.

Accelerate your migration to OpenSearch Service with the new Migration Assistant solution

In November 2023, the OpenSearch team launched a new open-source solution—Migration Assistant for Amazon OpenSearch Service. The solution supports data migration from self-managed Elasticsearch and OpenSearch domains to OpenSearch Service, supporting Elasticsearch 7.x (<=7.10), OpenSearch 1.x, and OpenSearch 2.x as migration sources. The solution facilitates the migration of the existing and live data between source and destination.

Conclusion

In this post, we covered the new releases in OpenSearch Service to help you innovate your business with search, observability, security analytics, and migrations. We provided you with information about when to use each new feature in OpenSearch Service, OpenSearch Ingestion, and OpenSearch Serverless.

Learn more about OpenSearch Dashboards and OpenSearch plugins and the new exciting OpenSearch assistant using OpenSearch playground.

Check out the features described in this post, and we appreciate you providing us your valuable feedback.


About the Authors

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Muslim Abu Taha is a Sr. OpenSearch Specialist Solutions Architect dedicated to guiding clients through seamless search workload migrations, fine-tuning clusters for peak performance, and ensuring cost-effectiveness. With a background as a Technical Account Manager (TAM), Muslim brings a wealth of experience in assisting enterprise customers with cloud adoption and optimize their different set of workloads. Muslim enjoys spending time with his family, traveling and exploring new places.

Amazon OpenSearch Service’s vector database capabilities explained

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. It comprises a search engine, OpenSearch, which delivers low-latency search and aggregations, OpenSearch Dashboards, a visualization and dashboarding tool, and a suite of plugins that provide advanced capabilities like alerting, fine-grained access control, observability, security monitoring, and vector storage and processing. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.

As an end-user, when you use OpenSearch’s search capabilities, you generally have a goal in mind—something you want to accomplish. Along the way, you use OpenSearch to gather information in support of achieving that goal (or maybe the information is the original goal). We’ve all become used to the “search box” interface, where you type some words, and the search engine brings back results based on word-to-word matching. Let’s say you want to buy a couch in order to spend cozy evenings with your family around the fire. You go to Amazon.com, and you type “a cozy place to sit by the fire.” Unfortunately, if you run that search on Amazon.com, you get items like fire pits, heating fans, and home decorations—not what you intended. The problem is that couch manufacturers probably didn’t use the words “cozy,” “place,” “sit,” and “fire” in their product titles or descriptions.

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus. By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes. It also provides other database functionality like managing vector data alongside other data types, workload management, access control and more. OpenSearch’s k-NN plugin provides core vector database functionality for OpenSearch, so when your customer searches for “a cozy place to sit by the fire” in your catalog, you can encode that prompt and use OpenSearch to perform a nearest neighbor query to surface that 8-foot, blue couch with designer arranged photographs in front of fireplaces.

Using OpenSearch Service as a vector database

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic search

With semantic search, you improve the relevance of retrieved results using language-based embeddings on search documents. You enable your search customers to use natural language queries, like “a cozy place to sit by the fire” to find their 8-foot-long blue couch. For more information, refer to Building a semantic search engine in OpenSearch to learn how semantic search can deliver a 15% relevance improvement, as measured by normalized discounted cumulative gain (nDCG) metrics compared with keyword search. For a concrete example, our Improve search relevance with ML in Amazon OpenSearch Service workshop explores the difference between keyword and semantic search, based on a Bidirectional Encoder Representations from Transformers (BERT) model, hosted by Amazon SageMaker to generate vectors and store them in OpenSearch. The workshop uses product question answers as an example to show how keyword search using the keywords/phrases of the query leads to some irrelevant results. Semantic search is able to retrieve more relevant documents by matching the context and semantics of the query. The following diagram shows an example architecture for a semantic search application with OpenSearch Service as the vector database.

Architecture diagram showing how to use Amazon OpenSearch Service to perform semantic search to improve relevance

Retrieval Augmented Generation with LLMs

RAG is a method for building trustworthy generative AI chatbots using generative LLMs like OpenAI, ChatGPT, or Amazon Titan Text. With the rise of generative LLMs, application developers are looking for ways to take advantage of this innovative technology. One popular use case involves delivering conversational experiences through intelligent agents. Perhaps you’re a software provider with knowledge bases for product information, customer self-service, or industry domain knowledge like tax reporting rules or medical information about diseases and treatments. A conversational search experience provides an intuitive interface for users to sift through information through dialog and Q&A. Generative LLMs on their own are prone to hallucinations—a situation where the model generates a believable but factually incorrect response. RAG solves this problem by complementing generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

As illustrated in the following diagram, the query workflow starts with a question that is encoded and used to retrieve relevant knowledge articles from the vector database. Those results are sent to the generative LLM whose job is to augment those results, typically by summarizing the results as a conversational response. By complementing the generative model with a knowledge base, RAG grounds the model on facts to minimize hallucinations. You can learn more about building a RAG solution in the Retrieval Augmented Generation module of our semantic search workshop.

Architecture diagram showing how to use Amazon OpenSearch Service to perform retrieval-augmented generation

Recommendation engine

Recommendations are a common component in the search experience, especially for ecommerce applications. Adding a user experience feature like “more like this” or “customers who bought this also bought that” can drive additional revenue through getting customers what they want. Search architects employ many techniques and technologies to build recommendations, including Deep Neural Network (DNN) based recommendation algorithms such as the two-tower neural net model, YoutubeDNN. A trained embedding model encodes products, for example, into an embedding space where products that are frequently bought together are considered more similar, and therefore are represented as data points that are closer together in the embedding space. Another possibility
is that product embeddings are based on co-rating similarity instead of purchase activity. You can employ this affinity data through calculating the vector similarity between a particular user’s embedding and vectors in the database to return recommended items. The following diagram shows an example architecture of building a recommendation engine with OpenSearch as a vector store.

Architecture diagram showing how to use Amazon OpenSearch Service as a recommendation engine

Media search

Media search enables users to query the search engine with rich media like images, audio, and video. Its implementation is similar to semantic search—you create vector embeddings for your search documents and then query OpenSearch Service with a vector. The difference is you use a computer vision deep neural network (e.g. Convolutional Neural Network (CNN)) such as ResNet to convert images into vectors. The following diagram shows an example architecture of building an image search with OpenSearch as the vector store.

Architecture diagram showing how to use Amazon OpenSearch Service to search rich media like images, videos, and audio files

Understanding the technology

OpenSearch uses approximate nearest neighbor (ANN) algorithms from the NMSLIB, FAISS, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. The engine details are as follows:

  • Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
  • Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
  • Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information in this section to help determine which engine will best meet your requirements.

In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option.

.

NMSLIB-HNSW

FAISS-HNSW

FAISS-IVF

Lucene-HNSW

Max Dimension

16,000

16,000

16,000

1024

Filter

Post filter

Post filter

Post filter

Filter while search

Training Required

No

No

Yes

No

Similarity Metrics

l2, innerproduct, cosinesimil, l1, linf

l2, innerproduct

l2, innerproduct

l2, cosinesimil

Vector Volume

Tens of billions

Tens of billions

Tens of billions

< Ten million

Indexing latency

Low

Low

Lowest

Low

Query Latency & Quality

Low latency & high quality

Low latency & high quality

Low latency & low quality

High latency & high quality

Vector Compression

Flat

Flat

Product Quantization

Flat

Product Quantization

Flat

Memory Consumption

High

High

Low with PQ

Medium

Low with PQ

High

Approximate and exact nearest-neighbor search

The OpenSearch Service k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors: approximate k-NN, score script (exact k-NN), and painless extensions (exact k-NN).

Approximate k-NN

The first method takes an approximate nearest neighbor approach—it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints, and more scalable search. Approximate k-NN is the best choice for searches over large indexes (that is, hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the score script method or painless extensions.

Score script

The second method extends the OpenSearch Service score script functionality to run a brute force, exact k-NN search over knn_vector fields or fields that can represent binary objects. With this approach, you can run k-NN search on a subset of vectors in your index (sometimes referred to as a pre-filter search). This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

Painless extensions

The third method adds the distance functions as painless extensions that you can use in more complex combinations. Similar to the k-NN score script, you can use this method to perform a brute force, exact k-NN search across an index, which also supports pre-filtering. This approach has slightly slower query performance compared to the k-NN score script. If your use case requires more customization over the final score, you should use this approach over score script k-NN.

Vector search algorithms

The simple way to find similar vectors is to use k-nearest neighbors (k-NN) algorithms, which compute the distance between a query vector and the other vectors in the vector database. As we mentioned earlier, the score script k-NN and painless extensions search methods use the exact k-NN algorithms under the hood. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate nearest neighbor (ANN) search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. There are different ANN search algorithms; for example, locality sensitive hashing, tree-based, cluster-based, and graph-based. OpenSearch implements two ANN algorithms: Hierarchical Navigable Small Worlds (HNSW) and Inverted File System (IVF). For a more detailed explanation of how the HNSW and IVF algorithms work in OpenSearch, see blog post “Choose the k-NN algorithm for your billion-scale use case with OpenSearch”.

Hierarchical Navigable Small Worlds

The HNSW algorithm is one of the most popular algorithms out there for ANN search. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

Inverted File

The IVF algorithm separates your index vectors into a set of buckets, then, to reduce your search time, only searches through a subset of these buckets. However, if the algorithm just randomly split up your vectors into different buckets, and only searched a subset of them, it would yield a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

Vector similarity metrics

All search engines use a similarity metric to rank and sort results and bring the most relevant results to the top. When you use a plain text query, the similarity metric is called TF-IDF, which measures the importance of the terms in the query and generates a score based on the number of textual matches. When your query includes a vector, the similarity metrics are spatial in nature, taking advantage of proximity in the vector space. OpenSearch supports several similarity or distance measures:

  • Euclidean distance – The straight-line distance between points.
  • L1 (Manhattan) distance – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.
  • L-infinity (chessboard) distance – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.
  • Inner product – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.
  • Cosine similarity – The cosine of the angle between two vectors in a vector space.
  • Hamming distance – For binary-coded vectors, the number of bits that differ between the two vectors.

Advantage of OpenSearch as a vector database

When you use OpenSearch Service as a vector database, you can take advantage of the service’s features like usability, scalability, availability, interoperability, and security. More importantly, you can use OpenSearch’s search features to enhance the search experience. For example, you can use Learning to Rank in OpenSearch to integrate user clickthrough behavior data into your search application and improve search relevance. You can also combine OpenSearch text search and vector search capabilities to search documents with keyword and semantic similarity. You can also use other fields in the index to filter documents to improve relevance. For advanced users, you can use a hybrid scoring model to combine OpenSearch’s text-based relevance score, computed with the Okapi BM25 function and its vector search score to improve the ranking of your search results.

Scale and limits

OpenSearch as vector database support billions of vector records. Keep in mind the following calculator regarding number of vectors and dimensions to size your cluster.

Number of vectors

OpenSearch VectorDB takes advantage of the sharding capabilities of OpenSearch and can scale to billions of vectors at single-digit millisecond latencies by sharding vectors and scale horizontally by adding more nodes. The number of vectors that can fit in a single machine is a function of the off-heap memory availability on the machine. The number of nodes required will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. The more nodes, the more memory and better performance. The amount of memory available per node is computed as memory_available = (node_memoryjvm_size) * circuit_breaker_limit, with the following parameters:

  • node_memory – The total memory of the instance.
  • jvm_size – The OpenSearch JVM heap size. This is set to half of the instance’s RAM, capped at approximately 32 GB.
  • circuit_breaker_limit – The native memory usage threshold for the circuit breaker. This is set to 0.5.

Total cluster memory estimation depends on total number of vector records and algorithms. HNSW and IVF have different memory requirements. You can refer to Memory Estimation for more details.

Number of dimensions

OpenSearch’s current dimension limit for the vector field knn_vector is 16,000 dimensions. Each dimension is represented as a 32-bit float. The more dimensions, the more memory you’ll need to index and search. The number of dimensions is usually determined by the embedding models that translate the entity to a vector. There are a lot of options to choose from when building your knn_vector field. To determine the correct methods and parameters to choose, refer to Choosing the right method.

Customer stories:

Amazon Music

Amazon Music is always innovating to provide customers with unique and personalized experiences. One of Amazon Music’s approaches to music recommendations is a remix of a classic Amazon innovation, item-to-item collaborative filtering, and vector databases. Using data aggregated based on user listening behavior, Amazon Music has created an embedding model that encodes music tracks and customer representations into a vector space where neighboring vectors represent tracks that are similar. 100 million songs are encoded into vectors, indexed into OpenSearch, and served across multiple geographies to power real-time recommendations. OpenSearch currently manages 1.05 billion vectors and supports a peak load of 7,100 vector queries per second to power Amazon Music recommendations.

The item-to-item collaborative filter continues to be among the most popular methods for online product recommendations because of its effectiveness at scaling to large customer bases and product catalogs. OpenSearch makes it easier to operationalize and further the scalability of the recommender by providing scale-out infrastructure and k-NN indexes that grow linearly with respect to the number of tracks and similarity search in logarithmic time.

The following figure visualizes the high-dimensional space created by the vector embedding.

A visualization of the vector encoding of Amazon Music entries in the large vector space

Brand protection at Amazon

Amazon strives to deliver the world’s most trustworthy shopping experience, offering customers the widest possible selection of authentic products. To earn and maintain our customers’ trust, we strictly prohibit the sale of counterfeit products, and we continue to invest in innovations that ensure only authentic products reach our customers. Amazon’s brand protection programs build trust with brands by accurately representing and completely protecting their brand. We strive to ensure that public perception mirrors the trustworthy experience we deliver. Our brand protection strategy focuses on four pillars: (1) Proactive Controls (2) Powerful Tools to Protect Brands (3) Holding Bad Actors Accountable (4) Protecting and Educating Customers. Amazon OpenSearch Service is a key part of Amazon’s Proactive Controls.

In 2022, Amazon’s automated technology scanned more than 8 billion attempted changes daily to product detail pages for signs of potential abuse. Our proactive controls found more than 99% of blocked or removed listings before a brand ever had to find and report it. These listings were suspected of being fraudulent, infringing, counterfeit, or at risk of other forms of abuse. To perform these scans, Amazon created tooling that uses advanced and innovative techniques, including the use of advanced machine learning models to automate the detection of intellectual property infringements in listings across Amazon’s stores globally. A key technical challenge in implementing such automated system is the ability to search for protected intellectual property within a vast billion-vector corpus in a fast, scalable and cost effective manner. Leveraging Amazon OpenSearch Service’s scalable vector database capabilities and distributed architecture, we successfully developed an ingestion pipeline that has indexed a total of 68 billion, 128- and 1024-dimension vectors into OpenSearch Service to enable brands and automated systems to conduct infringement detection, in real-time, through a highly available and fast (sub-second) search API.

Conclusion

Whether you’re building a generative AI solution, searching rich media and audio, or bringing more semantic search to your existing search-based application, OpenSearch is a capable vector database. OpenSearch supports a variety of engines, algorithms, and distance measures that you can employ to build the right solution. OpenSearch provides a scalable engine that can support vector search at low latency and up to billions of vectors. With OpenSearch and its vector DB capabilities, your users can find that 8-foot-blue couch easily, and relax by a cozy fire.


About the Authors

Jon Handler is a Senior Principal Solutions Architect with AWSJon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Dylan Tong is a Senior Product Manager at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

Vamshi Vijay Nakkirtha is a Software Engineering Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.

Serverless logging with Amazon OpenSearch Service and Amazon Kinesis Data Firehose

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/serverless-logging-with-amazon-opensearch-service-and-amazon-kinesis-data-firehose/

In this post, you will learn how you can use Amazon Kinesis Data Firehose to build a log ingestion pipeline to send VPC flow logs to Amazon OpenSearch Serverless. First, you create the OpenSearch Serverless collection you use to store VPC flow logs, then you create a Kinesis Data Firehose delivery pipeline that forwards the flow logs to OpenSearch Serverless. Finally, you enable delivery of VPC flow logs to your Firehose delivery stream. The following diagram illustrates the solution workflow.

OpenSearch Serverless is a new serverless option offered by Amazon OpenSearch Service. OpenSearch Serverless makes it simple to run petabyte-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. OpenSearch Serverless automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

Kinesis Data Firehose is a popular service that delivers streaming data from over 20 AWS services to over 15 analytical and observability tools such as OpenSearch Serverless. Kinesis Data Firehose is great for those looking for a fast and easy way to send your VPC flow logs data to your OpenSearch Serverless collection in minutes without a single line of code and without building or managing your own data ingestion and delivery infrastructure.

VPC flow logs capture the traffic information going to and from your network interfaces in your VPC. With the launch of Kinesis Data Firehose support to OpenSearch Serverless, it makes an easy solution to analyze your VPC flow logs with just a few clicks. Kinesis Data Firehose provides a true end-to-end serverless mechanism to deliver your flow logs to OpenSearch Serverless, where you can use OpenSearch Dashboards to search through those logs, create dashboards, detect anomalies, and send alerts. VPC flow logs helps you to answer questions like:

  • What percentage of your traffic is getting dropped?
  • How much traffic is getting generated for specific sources and destinations?

Create your OpenSearch Serverless collection

To get started, you first create a collection. An OpenSearch Serverless collection is a logical grouping of one or more indexes that represent an analytics workload. Complete the following steps:

  1. On the OpenSearch Service console, choose Collections under Serverless in the navigation pane.
  2. Choose Create a collection.
  3. For Collection name, enter a name (for example, vpc-flow-logs).
  4. For Collection type¸ choose Time series.
  5. For Encryption, choose your preferred encryption setting:
    1. Choose Use AWS owned key to use an AWS managed key.
    2. Choose a different AWS KMS key to use your own AWS Key Management Service (AWS KMS) key.
  6. For Network access settings, choose your preferred setting:
    1. Choose VPC to use a VPC endpoint.
    2. Choose Public to use a public endpoint.

AWS recommends that you use a VPC endpoint for all production workloads. For this walkthrough, select Public.

  1. Choose Create.

It should take couple of minutes to create the collection.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless collection via the preceding steps.

At this point, you have successfully created a collection for OpenSearch Serverless. Next, you create a delivery pipeline for Kinesis Data Firehose.

Create a Kinesis Data Firehose delivery stream

To set up a delivery stream for Kinesis Data Firehose, complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, specify Direct PUT.

Check out Source, Destination, and Name to learn more about different sources supported by Kinesis Data Firehose.

  1. For Destination, choose Amazon OpenSearch Serverless.
  2. For Delivery stream name, enter a name (for example, vpc-flow-logs).
  3. Under Destination settings, in the OpenSearch Serverless collection settings, choose Browse.
  4. Select vpc-flow-logs.
  5. Choose Choose.

If your collection is still creating, wait a few minutes and try again.

  1. For Index, specify vpc-flow-logs.
  2. In the Backup settings section, select Failed data only for the Source record backup in Amazon S3.

Kinesis Data Firehose uses Amazon Simple Storage Service (Amazon S3) to back up failed data that it attempts to deliver to your chosen destination. If you want to keep all data, select All data.

  1. For S3 Backup Bucket, choose Browse to select an existing S3 bucket, or choose Create to create a new bucket.
  2. Choose Create delivery stream.

The following graphic gives a quick demonstration of creating the Kinesis Data Firehose delivery stream via the preceding steps.

At this point, you have successfully created a delivery stream for Kinesis Data Firehose, which you will use to stream data from your VPC flow logs and send it to your OpenSearch Serverless collection.

Set up the data access policy for your OpenSearch Serverless collection

Before you send any logs to OpenSearch Serverless, you need to create a data access policy within OpenSearch Serverless that allows Kinesis Data Firehose to write to the vpc-flow-logs index in your collection. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose the Configuration tab on the details page for the vpc-flow-logs delivery stream you just created.
  2. In the Permissions section, note down the AWS Identity and Access Management (IAM) role.
  3. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  4. Under Data access, choose Manage data access.
  5. Choose Create access policy.
  6. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  7. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Provide the IAM role name that you got from the permissions page of the Firehose delivery stream, and the account ID for your AWS account.
    [
      {
        "Rules": [
          {
            "ResourceType": "index",
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:WriteDocument",
              "aoss:CreateIndex",
              "aoss:UpdateIndex"
            ]
          }
        ],
        "Principal": [
          "arn:aws:sts::<aws-account-id>:assumed-role/<IAM-role-name>/*"
        ]
      }
    ]

  8. Choose Create.

The following graphic gives a quick demonstration of creating the data access policy via the preceding steps.

Set up VPC flow logs

In the final step of this post, you enable flow logs for your VPC with the destination as Kinesis Data Firehose, which sends the data to OpenSearch Serverless.

  1. Navigate to the AWS Management Console.
  2. Search for “VPC” and then choose Your VPCs in the search result (hover over the VPC rectangle to reveal the link).
  3. Choose the VPC ID link for one of your VPCs.
  4. On the Flow Logs tab, choose Create flow log.
  5. For Name, enter a name.
  6. Leave the Filter set to All. You can limit the traffic by selecting Accept or Reject.
  7. Under Destination, select Send to Kinesis Firehose in the same account.
  8. For Kinesis Firehose delivery stream name, choose vpc-flow-logs.
  9. Choose Create flow log.

The following graphic gives a quick demonstration of creating a flow log for your VPC following the preceding steps.

Examine the VPC flow logs data in your collection using OpenSearch Dashboards

You won’t be able to access your collection data until you configure data access. Data access policies allow users to access the actual data within a collection.

To create a data access policy for OpenSearch Dashboards, complete the following steps:

  1. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  2. Under Data access, choose Manage data access.
  3. Choose Create access policy.
  4. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  5. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Additionally, provide the IAM user and the account ID for your AWS account. You need to make sure that you have the AWS access and secret keys for the principal that you specified as an IAM user.
    [
      {
        "Rules": [
          {
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:ReadDocument"
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          "arn:aws:iam::<aws-account-id>:user/<IAM-user-name>"
        ]
      }
    ]

  6. Choose Create.
  7. Navigate to OpenSearch Serverless and choose the collection you created (vpc-flow-logs).
  8. Choose the OpenSearch Dashboards URL and log in with your IAM access key and secret key for the user you specified under Principal.
  9. Navigate to dev tools within OpenSearch Dashboards and run the following query to retrieve the VPC flow logs for your VPC:
    GET <index-name>/_search
    {
      "query": {
        "match_all": {}
      }
    }

The query returns the data as shown in the following screenshot, which contains information such as account ID, interface ID, source IP address, destination IP address, and more.

Create dashboards

After the data is flowing into OpenSearch Serverless, you can easily create dashboards to monitor the activity in your VPC. The following example dashboard shows overall traffic, accepted and rejected traffic, bytes transmitted, and some charts with the top sources and destinations.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

  1. Return to the AWS console and in the VPCs section, disable the flow logs for your VPC.
  2. In the OpenSearch Serverless dashboard, delete your vpc-flow-logs collection.
  3. On the Kinesis Data Firehose console, delete your vpc-flow-logs delivery stream.

Conclusion

In this post, you created an end-to-end serverless pipeline to deliver your VPC flow logs to OpenSearch Serverless using Kinesis Data Firehose. In this example, you built a delivery pipeline for your VPC flow logs, but you can also use Kinesis Data Firehose to send logs from Amazon Kinesis Data Streams and Amazon CloudWatch, which in turn can be sent to OpenSearch Serverless collections for running analytics on those logs. With serverless solutions on AWS, you can focus on your application development rather than worrying about the ingestion pipeline and tools to visualize your logs.

Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and check out other pipelines for analyzing your logs.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Monitor your Amazon ES domains with Amazon Elasticsearch Service Monitor

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/monitor-your-amazon-es-domains-with-amazon-elasticsearch-service-monitor/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services.

Amazon ES provides a wealth of information about your domain, surfaced through Amazon CloudWatch metrics (for more information, see Instance metrics). Your domain’s dashboard on the AWS Management Console collects key metrics and provides a view of what’s going on with that domain. This view is limited to that single domain, and for a subset of the available metrics. What if you’re running many domains? How can you see all their metrics in one place? You can set CloudWatch alarms at the single domain level, but what about anomaly detection and centralized alerting?

In this post, we detail Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions, backed by a set of AWS CloudFormation templates delivered through the AWS Cloud Development Kit (AWS CDK). The templates deploy an Amazon ES domain in a VPC, an Nginx proxy for Kibana access, and an AWS Lambda function. The function is invoked by CloudWatch Events to pull metrics from all your Amazon ES domains and send them to the previously created monitoring domain for your review.

Your Amazon ES monitoring domain is an ideal way to monitor your Amazon ES infrastructure. We provide dashboards at the account and individual domain level. We also provide basic alerts that you can use as a template to build your own alerting solution.

Prerequisites

To bootstrap the solution, you need a few tools in your development environment:

Create and deploy the AWS CDK monitoring tool

Complete the following steps to set up the AWS CDK monitoring tool in your environment. Depending on your operating system, the commands may differ. This walkthrough uses Linux and bash.

Clone the code from the GitHub repo:

# clone the repo
$ git clone https://github.com/aws-samples/amazon-elasticsearch-service-monitor.git
# move to directory
$ cd amazon-elasticsearch-service-monitor

We provide a bash bootstrap script to prepare your environment for running the AWS CDK and deploying the architecture. The bootstrap.sh script is in the amazon-elasticsearch-service-monitor directory. The script creates a Python virtual environment and downloads some further dependencies. It creates an Amazon Elastic Compute Cloud (Amazon EC2) key pair to facilitate accessing Kibana, then adds that key pair to your local SSH setup. Finally, it prompts for an email address where the stack sends alerts. You can edit email_default in the script or enter it at the command line when you run the script. See the following code:

$ bash bootstrap.sh
Collecting astroid==2.4.2
  Using cached astroid-2.4.2-py3-none-any.whl (213 kB)
Collecting attrs==20.3.0
  Using cached attrs-20.3.0-py2.py3-none-any.whl (49 kB)

After the script is complete, enter the Python virtual environment:

$ source .env/bin/activate
(.env) $

Bootstrap the AWS CDK

The AWS CDK creates resources in your AWS account to enable it to track your deployments. You bootstrap the AWS CDK with the bootstrap command:

# bootstrap the cdk
(.env) $ cdk bootstrap aws://yourAccountID/yourRegion

Deploy the architecture

The monitoring_cdk directory collects all the components that enable the AWS CDK to deploy the following architecture.

You can review amazon-elasticsearch-service-monitor/monitoring_cdk/monitoring_cdk_stack.py for further details.

The architecture has the following components:

  • An Amazon Virtual Private Cloud (Amazon VPC) spanning two Amazon EC2 Availability Zones.
  • An Amazon ES cluster with two t3.medium data nodes, one in each Availability Zone, with 100 GB of EBS storage.
  • An Amazon DynamoDB table for tracking the timestamp for the last pull from CloudWatch.
  • A Lambda function to fetch CloudWatch metrics across all Regions and all domains. By default, it fetches the data every 5 minutes, which you can change if needed.
  • An EC2 instance that acts as an SSH tunnel to access Kibana, because our setup is secured and in a VPC.
  • A default Kibana dashboard to visualize metrics across all domains.
  • Default email alerts to the newly launched Amazon ES cluster.
  • An index template and Index State Management (ISM) policy to delete indexes older than 366 days. (You can change this to a different retention period if needed.)
  • A monitoring stack with the option to enable UltraWarm (UW), which is disabled by default. You can change the settings in the monitoring_cdk_stack.py file to enable UW.

The monitoring_cdk_stack.py file contains several constants at the top that let you control the domain configuration, its sizing, and the Regions to monitor. It also specifies the username and password for the admin user of your domain. You should edit and replace those constants with your own values.

For example, the following code indicates which Regions to monitor:

REGIONS_TO_MONITOR='["us-east-1", "us-east-2", "us-west-1", "us-west-2", "af-south-1", "ap-east-1", "ap-south-1", "ap-northeast-1", "ap-northeast-2", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "eu-north-1", "eu-south-1", "me-south-1",   "sa-east-1"]'

Run the following command:

(.env)$ cdk deploy

The AWS CDK prompts you to apply security changes; enter y for yes.

After the app is deployed, you get the Kibana URL, user, and password to access Kibana. After you log in, use the following sections to navigate around dashboards and alerts.

After the stack is deployed, you receive an email to confirm the subscription; make sure to confirm the email to start getting the alerts.

Pre-built monitoring dashboards

The monitoring tool comes with pre-built dashboards. To access them, complete the following steps:

  1. Navigate to the IP obtained after deployment.
  2. Log in to Kibana.
    Be sure to use the endpoint you received, provided as an output from the cdk deploy command
  3. In the navigation pane, choose Dashboard.

The Dashboards page displays the default dashboards.

The Domain Metrics At A glance dashboard gives a 360-degree view of all Amazon ES domains across Regions.

The Domain Overview dashboard gives more detailed metrics for a particular domain, to help you deep dive into issues in a specific domain.

Pre-built alerts

The monitoring framework comes with pre-built alerts, as summarized in the following table. These alerts notify you on key resources like CPU, disk space, and JVM. We also provide alerts for cluster status, snapshot failures, and more. You can use the following alerts as a template to create your own alerts and monitoring for search and indexing latencies and volumes, for example.

Alert Type Frequency
Cluster Health – Red 5 Min
Cluster Index Writes Blocked 5 Min
Automated Snapshot Failure 5 Min
JVM Memory Pressure > 80% 5 Min
CPU Utilization > 80% 15 Min
No Kibana Healthy Nodes 15 Min
Invalid Host Header Requests 15 Min
Cluster Health – Yellow 30 Min

Clean up

To clean up the stacks, destroy the monitoring-cdk stack; all other stacks are torn down due to dependencies:

# Enter into python virtual environment
$ source .env/bin/activate
(.env)$ cdk destroy

CloudWatch logs need to be removed separately.

Pricing

Running this solution incurs charges of less than $10 per day for one domain, with an additional $2 per day for each additional domain.

Conclusion

In this post, we discussed Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions. Amazon ES monitoring domains are an ideal way to monitor your Amazon ES infrastructure. Try it out and leave your thoughts in the comments.


About the Authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

Prashant Agrawal is a Specialist Solutions Architect at Amazon Web Services based in Seattle, WA.. Prashant works closely with Amazon Elasticsearch team, helping customers migrate their workloads to the AWS Cloud. Before joining AWS, Prashant helped various customers use Elasticsearch for their search and analytics use cases.

Get started with fine-grained access control in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/security/get-started-with-fine-grained-access-control-in-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) provides fine-grained access control, powered by the Open Distro for Elasticsearch security plugin. The security plugin adds Kibana authentication and access control at the cluster, index, document, and field levels that can help you secure your data. You now have many different ways to configure your Amazon ES domain to provide access control. In this post, I offer basic configuration information to get you started.

Figure 1: A high-level view of data flow and security

Figure 1: A high-level view of data flow and security

Figure 1 details the authentication and access control provided in Amazon ES. The left half of the diagram details the different methods of authenticating. Looking horizontally, requests originate either from Kibana or directly access the REST API. When using Kibana, you can use a login screen powered by the Open Distro security plugin, your SAML identity provider, or Amazon Cognito. Each of these methods results in an authenticated identity: SAML providers via the response, Amazon Cognito via an AWS Identity and Access Management (IAM) identity, and Open Distro via an internal user identity. When you use the REST API, you can use AWS Signature V4 request signing (SigV4 signing), or user name and password authentication. You can also send unauthenticated traffic, but your domain should be configured to reject all such traffic.

The right side of the diagram details the access control points. You can consider the handling of access control in two phases to better understand it—authentication at the edge by IAM and authentication in the Amazon ES domain by the Open Distro security plugin.

First, requests from Kibana or direct API calls have to reach your domain endpoint. If you follow best practices and the domain is in an Amazon Virtual Private Cloud (VPC), you can use Amazon Elastic Compute Cloud (Amazon EC2) security groups to allow or deny traffic based on the originating IP address or security group of the Amazon EC2 instances. Best practice includes least privilege based on subnet ACLs and security group ingress and egress restrictions. In this post, we assume that your requests are legitimate, meet your access control criteria, and can reach your domain.

When a request reaches the domain endpoint—the edge of your domain—, it can be anonymous or it can carry identity and authentication information as described previously. Each Amazon ES domain carries a resource-based IAM policy. With this policy, you can allow or deny traffic based on an IAM identity attached to the request. When your policy specifies an IAM principal, Amazon ES evaluates the request against the allowed Actions in the policy and allows or denies the request. If you don’t have an IAM identity attached to the request (SAML assertion, or user name and password) you should leave the domain policy open and pass traffic through to fine-grained access control in Amazon ES without any checks. You should employ IAM security best practices and add additional IAM restrictions for direct-to-API access control once your domain is set up.

The Open Distro for Elasticsearch security plugin has its own internal user database for user name and password authentication and handles access control for all users. When traffic reaches the Elasticsearch cluster, the plugin validates any user name and password authentication information against this internal database to identify the user and grant a set of permissions. If a request comes with identity information from either SAML or an IAM role, you map that backend role onto the roles or users that you have created in Open Distro security.

Amazon ES documentation and the Open Distro for Elasticsearch documentation give more information on all of these points. For this post, I walk through a basic console setup for a new domain.

Console set up

The Amazon ES console provides a guided wizard that lets you configure—and reconfigure—your Amazon ES domain. Step 1 offers you the opportunity to select some predefined configurations that carry through the wizard. In step 2, you choose the instances to deploy in your domain. In Step 3, you configure the security. This post focuses on step 3. See also these tutorials that explain using an IAM master user and using an HTTP-authenticated master user.

Note: At the time of writing, you cannot enable fine-grained access control on existing domains; you must create a new domain and enable the feature at domain creation time. You can use fine-grained access control with Elasticsearch versions 6.8 and later.

Set your endpoint

Amazon ES gives you a DNS name that resolves to an IP address that you use to send traffic to the Elasticsearch cluster in the domain. The IP address can be in the IP space of the public internet, or it can resolve to an IP address in your VPC. While—with fine-grained access control—you have the means of securing your cluster even when the endpoint is a public IP address, we recommend using VPC access as the more secure option. Shown in Figure 2.

Figure 2: Select VPC access

Figure 2: Select VPC access

With the endpoint in your VPC, you use security groups to control which ports accept traffic and limit access to the endpoints of your Amazon ES domain to IP addresses in your VPC. Make sure to use least privilege when setting up security group access.

Enable fine-grained access control

You should enable fine-grained access control. Shown in Figure 3.

Figure 3: Enabled fine-grained access control

Figure 3: Enabled fine-grained access control

Set up the master user

The master user is the administrator identity for your Amazon ES domain. This user can set up additional users in the Amazon ES security plugin, assign roles to them, and assign permissions for those roles. You can choose user name and password authentication for the master user, or use an IAM identity. User name and password authentication, shown in Figure 4, is simpler to set up and—with a strong password—may provide sufficient security depending on your use case. We recommend you follow your organization’s policy for password length and complexity. If you lose this password, you can return to the domain’s dashboard in the AWS Management Console and reset it. You’ll use these credentials to log in to Kibana. Following best practices on choosing your master user, you should move to an IAM master user once setup is complete.

Note: Password strength is a function of length, complexity of characters (e.g., upper and lower case letters, numbers, and special characters), and unpredictability to decrease the likelihood the password could be guessed or cracked over a period of time.

 

Figure 4: Setting up the master username and password

Figure 4: Setting up the master username and password

Do not enable Amazon Cognito authentication

When you use Kibana, Amazon ES includes a login experience. You currently have three choices for the source of the login screen:

  1. The Open Distro security plugin
  2. Amazon Cognito
  3. Your SAML-compliant system

You can apply fine-grained access control regardless of how you log in. However, setting up fine-grained access control for the master user and additional users is most straightforward if you use the login experience provided by the Open Distro security plugin. After your first login, and when you have set up additional users, you should migrate to either Cognito or SAML for login, taking advantage of the additional security they offer. To use the Open Distro login experience, disable Amazon Cognito authentication, as shown in Figure 5.

Figure 5: Amazon Cognito authentication is not enabled

Figure 5: Amazon Cognito authentication is not enabled

If you plan to integrate with your SAML identity provider, check the Prepare SAML authentication box. You will complete the set up when the domain is active.

Figure 6: Choose Prepare SAML authentication if you plan to use it

Figure 6: Choose Prepare SAML authentication if you plan to use it

Use an open access policy

When you create your domain, you attach an IAM policy to it that controls whether your traffic must be signed with AWS SigV4 request signing for authentication. Policies that specify an IAM principal require that you use AWS SigV4 signing to authenticate those requests. The domain sends your traffic to IAM, which authenticates signed requests to resolve the user or role that sent the traffic. The domain and IAM apply the policy access controls and either accept the traffic or reject it based on the commands. This is done down to the index level for single-index API calls.

When you use fine-grained access control, your traffic is also authenticated by the Amazon ES security plugin, which makes the IAM authentication redundant. Create an open access policy, as shown in Figure 7, which doesn’t specify a principal and so doesn’t require request signing. This may be acceptable, since you can choose to require an authenticated identity on all traffic. The security plugin authenticates the traffic as above, providing access control based on the internal database.

Figure 7: Selected open access policy

Figure 7: Selected open access policy

Encrypted data

Amazon ES provides an option to encrypt data in transit and at rest for any domain. When you enable fine-grained access control, you must use encryption with the corresponding checkboxes automatically checked and not changeable. These include Transport Layer Security (TLS) for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through AWS Key Management Service (KMS). Shown in Figure 8.

Figure 8: Enabled encryption

Figure 8: Enabled encryption

Accessing Kibana

When you complete the domain creation wizard, it takes about 10 minutes for your domain to activate. Return to the console and the Overview tab of your Amazon ES dashboard. When the Domain Status is Active, select the Kibana URL. Since you created your domain in your VPC, you must be able to access the Kibana endpoint via proxy, VPN, SSH tunnel, or similar. Use the master user name and password that you configured earlier to log in to Kibana, as shown in Figure 9. As detailed above, you should only ever log in as the master user to set up additional users—administrators, users with read-only access, and others.

Figure 9: Kibana login page

Figure 9: Kibana login page

Conclusion

Congratulations, you now know the basic steps to set up the minimum configuration to access your Amazon ES domain with a master user. You can examine the settings for fine-grained access control in the Kibana console Security tab. Here, you can add additional users, assign permissions, map IAM users to security roles, and set up your Kibana tenancy. We’ll cover those topics in future posts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jon Handler

Jon is a Principal Solutions Architect at AWS. He works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Author

Sajeev Attiyil Bhaskaran

Sajeev is a Senior Cloud Engineer focused on big data and analytics. He works with AWS customers to provide architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions. He also does onsite and online sessions for customers to design best solutions for their use cases. In his free time, he enjoys spending time with his wife and daughter.