Tag Archives: Amazon OpenSearch Service

Amazon OpenSearch Service’s vector database capabilities explained

2023-06-22 Jon Handler

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. It comprises a search engine, OpenSearch, which delivers low-latency search and aggregations, OpenSearch Dashboards, a visualization and dashboarding tool, and a suite of plugins that provide advanced capabilities like alerting, fine-grained access control, observability, security monitoring, and vector storage and processing. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.

As an end-user, when you use OpenSearch’s search capabilities, you generally have a goal in mind—something you want to accomplish. Along the way, you use OpenSearch to gather information in support of achieving that goal (or maybe the information is the original goal). We’ve all become used to the “search box” interface, where you type some words, and the search engine brings back results based on word-to-word matching. Let’s say you want to buy a couch in order to spend cozy evenings with your family around the fire. You go to Amazon.com, and you type “a cozy place to sit by the fire.” Unfortunately, if you run that search on Amazon.com, you get items like fire pits, heating fans, and home decorations—not what you intended. The problem is that couch manufacturers probably didn’t use the words “cozy,” “place,” “sit,” and “fire” in their product titles or descriptions.

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus. By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes. It also provides other database functionality like managing vector data alongside other data types, workload management, access control and more. OpenSearch’s k-NN plugin provides core vector database functionality for OpenSearch, so when your customer searches for “a cozy place to sit by the fire” in your catalog, you can encode that prompt and use OpenSearch to perform a nearest neighbor query to surface that 8-foot, blue couch with designer arranged photographs in front of fireplaces.

Using OpenSearch Service as a vector database

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic search

With semantic search, you improve the relevance of retrieved results using language-based embeddings on search documents. You enable your search customers to use natural language queries, like “a cozy place to sit by the fire” to find their 8-foot-long blue couch. For more information, refer to Building a semantic search engine in OpenSearch to learn how semantic search can deliver a 15% relevance improvement, as measured by normalized discounted cumulative gain (nDCG) metrics compared with keyword search. For a concrete example, our Improve search relevance with ML in Amazon OpenSearch Service workshop explores the difference between keyword and semantic search, based on a Bidirectional Encoder Representations from Transformers (BERT) model, hosted by Amazon SageMaker to generate vectors and store them in OpenSearch. The workshop uses product question answers as an example to show how keyword search using the keywords/phrases of the query leads to some irrelevant results. Semantic search is able to retrieve more relevant documents by matching the context and semantics of the query. The following diagram shows an example architecture for a semantic search application with OpenSearch Service as the vector database.

Architecture diagram showing how to use Amazon OpenSearch Service to perform semantic search to improve relevance

Retrieval Augmented Generation with LLMs

RAG is a method for building trustworthy generative AI chatbots using generative LLMs like OpenAI, ChatGPT, or Amazon Titan Text. With the rise of generative LLMs, application developers are looking for ways to take advantage of this innovative technology. One popular use case involves delivering conversational experiences through intelligent agents. Perhaps you’re a software provider with knowledge bases for product information, customer self-service, or industry domain knowledge like tax reporting rules or medical information about diseases and treatments. A conversational search experience provides an intuitive interface for users to sift through information through dialog and Q&A. Generative LLMs on their own are prone to hallucinations—a situation where the model generates a believable but factually incorrect response. RAG solves this problem by complementing generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

As illustrated in the following diagram, the query workflow starts with a question that is encoded and used to retrieve relevant knowledge articles from the vector database. Those results are sent to the generative LLM whose job is to augment those results, typically by summarizing the results as a conversational response. By complementing the generative model with a knowledge base, RAG grounds the model on facts to minimize hallucinations. You can learn more about building a RAG solution in the Retrieval Augmented Generation module of our semantic search workshop.

Architecture diagram showing how to use Amazon OpenSearch Service to perform retrieval-augmented generation

Recommendation engine

Recommendations are a common component in the search experience, especially for ecommerce applications. Adding a user experience feature like “more like this” or “customers who bought this also bought that” can drive additional revenue through getting customers what they want. Search architects employ many techniques and technologies to build recommendations, including Deep Neural Network (DNN) based recommendation algorithms such as the two-tower neural net model, YoutubeDNN. A trained embedding model encodes products, for example, into an embedding space where products that are frequently bought together are considered more similar, and therefore are represented as data points that are closer together in the embedding space. Another possibility
is that product embeddings are based on co-rating similarity instead of purchase activity. You can employ this affinity data through calculating the vector similarity between a particular user’s embedding and vectors in the database to return recommended items. The following diagram shows an example architecture of building a recommendation engine with OpenSearch as a vector store.

Architecture diagram showing how to use Amazon OpenSearch Service as a recommendation engine

Media search

Media search enables users to query the search engine with rich media like images, audio, and video. Its implementation is similar to semantic search—you create vector embeddings for your search documents and then query OpenSearch Service with a vector. The difference is you use a computer vision deep neural network (e.g. Convolutional Neural Network (CNN)) such as ResNet to convert images into vectors. The following diagram shows an example architecture of building an image search with OpenSearch as the vector store.

Architecture diagram showing how to use Amazon OpenSearch Service to search rich media like images, videos, and audio files

Understanding the technology

OpenSearch uses approximate nearest neighbor (ANN) algorithms from the NMSLIB, FAISS, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. The engine details are as follows:

Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information in this section to help determine which engine will best meet your requirements.

In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option.

.	NMSLIB-HNSW	FAISS-HNSW	FAISS-IVF	Lucene-HNSW
Max Dimension	16,000	16,000	16,000	1024
Filter	Post filter	Post filter	Post filter	Filter while search
Training Required	No	No	Yes	No
Similarity Metrics	l2, innerproduct, cosinesimil, l1, linf	l2, innerproduct	l2, innerproduct	l2, cosinesimil
Vector Volume	Tens of billions	Tens of billions	Tens of billions	< Ten million
Indexing latency	Low	Low	Lowest	Low
Query Latency & Quality	Low latency & high quality	Low latency & high quality	Low latency & low quality	High latency & high quality
Vector Compression	Flat	Flat Product Quantization	Flat Product Quantization	Flat
Memory Consumption	High	High Low with PQ	Medium Low with PQ	High

Approximate and exact nearest-neighbor search

The OpenSearch Service k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors: approximate k-NN, score script (exact k-NN), and painless extensions (exact k-NN).

Approximate k-NN

The first method takes an approximate nearest neighbor approach—it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints, and more scalable search. Approximate k-NN is the best choice for searches over large indexes (that is, hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the score script method or painless extensions.

Score script

The second method extends the OpenSearch Service score script functionality to run a brute force, exact k-NN search over knn_vector fields or fields that can represent binary objects. With this approach, you can run k-NN search on a subset of vectors in your index (sometimes referred to as a pre-filter search). This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

Painless extensions

The third method adds the distance functions as painless extensions that you can use in more complex combinations. Similar to the k-NN score script, you can use this method to perform a brute force, exact k-NN search across an index, which also supports pre-filtering. This approach has slightly slower query performance compared to the k-NN score script. If your use case requires more customization over the final score, you should use this approach over score script k-NN.

Vector search algorithms

The simple way to find similar vectors is to use k-nearest neighbors (k-NN) algorithms, which compute the distance between a query vector and the other vectors in the vector database. As we mentioned earlier, the score script k-NN and painless extensions search methods use the exact k-NN algorithms under the hood. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate nearest neighbor (ANN) search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. There are different ANN search algorithms; for example, locality sensitive hashing, tree-based, cluster-based, and graph-based. OpenSearch implements two ANN algorithms: Hierarchical Navigable Small Worlds (HNSW) and Inverted File System (IVF). For a more detailed explanation of how the HNSW and IVF algorithms work in OpenSearch, see blog post “Choose the k-NN algorithm for your billion-scale use case with OpenSearch”.

Hierarchical Navigable Small Worlds

The HNSW algorithm is one of the most popular algorithms out there for ANN search. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

Inverted File

The IVF algorithm separates your index vectors into a set of buckets, then, to reduce your search time, only searches through a subset of these buckets. However, if the algorithm just randomly split up your vectors into different buckets, and only searched a subset of them, it would yield a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

Vector similarity metrics

All search engines use a similarity metric to rank and sort results and bring the most relevant results to the top. When you use a plain text query, the similarity metric is called TF-IDF, which measures the importance of the terms in the query and generates a score based on the number of textual matches. When your query includes a vector, the similarity metrics are spatial in nature, taking advantage of proximity in the vector space. OpenSearch supports several similarity or distance measures:

Euclidean distance – The straight-line distance between points.
L1 (Manhattan) distance – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.
L-infinity (chessboard) distance – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.
Inner product – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.
Cosine similarity – The cosine of the angle between two vectors in a vector space.
Hamming distance – For binary-coded vectors, the number of bits that differ between the two vectors.

Advantage of OpenSearch as a vector database

When you use OpenSearch Service as a vector database, you can take advantage of the service’s features like usability, scalability, availability, interoperability, and security. More importantly, you can use OpenSearch’s search features to enhance the search experience. For example, you can use Learning to Rank in OpenSearch to integrate user clickthrough behavior data into your search application and improve search relevance. You can also combine OpenSearch text search and vector search capabilities to search documents with keyword and semantic similarity. You can also use other fields in the index to filter documents to improve relevance. For advanced users, you can use a hybrid scoring model to combine OpenSearch’s text-based relevance score, computed with the Okapi BM25 function and its vector search score to improve the ranking of your search results.

Scale and limits

OpenSearch as vector database support billions of vector records. Keep in mind the following calculator regarding number of vectors and dimensions to size your cluster.

Number of vectors

OpenSearch VectorDB takes advantage of the sharding capabilities of OpenSearch and can scale to billions of vectors at single-digit millisecond latencies by sharding vectors and scale horizontally by adding more nodes. The number of vectors that can fit in a single machine is a function of the off-heap memory availability on the machine. The number of nodes required will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. The more nodes, the more memory and better performance. The amount of memory available per node is computed as memory_available = (node_memory – jvm_size) * circuit_breaker_limit, with the following parameters:

node_memory – The total memory of the instance.
jvm_size – The OpenSearch JVM heap size. This is set to half of the instance’s RAM, capped at approximately 32 GB.
circuit_breaker_limit – The native memory usage threshold for the circuit breaker. This is set to 0.5.

Total cluster memory estimation depends on total number of vector records and algorithms. HNSW and IVF have different memory requirements. You can refer to Memory Estimation for more details.

Number of dimensions

OpenSearch’s current dimension limit for the vector field knn_vector is 16,000 dimensions. Each dimension is represented as a 32-bit float. The more dimensions, the more memory you’ll need to index and search. The number of dimensions is usually determined by the embedding models that translate the entity to a vector. There are a lot of options to choose from when building your knn_vector field. To determine the correct methods and parameters to choose, refer to Choosing the right method.

Customer stories:

Amazon Music

Amazon Music is always innovating to provide customers with unique and personalized experiences. One of Amazon Music’s approaches to music recommendations is a remix of a classic Amazon innovation, item-to-item collaborative filtering, and vector databases. Using data aggregated based on user listening behavior, Amazon Music has created an embedding model that encodes music tracks and customer representations into a vector space where neighboring vectors represent tracks that are similar. 100 million songs are encoded into vectors, indexed into OpenSearch, and served across multiple geographies to power real-time recommendations. OpenSearch currently manages 1.05 billion vectors and supports a peak load of 7,100 vector queries per second to power Amazon Music recommendations.

The item-to-item collaborative filter continues to be among the most popular methods for online product recommendations because of its effectiveness at scaling to large customer bases and product catalogs. OpenSearch makes it easier to operationalize and further the scalability of the recommender by providing scale-out infrastructure and k-NN indexes that grow linearly with respect to the number of tracks and similarity search in logarithmic time.

The following figure visualizes the high-dimensional space created by the vector embedding.

A visualization of the vector encoding of Amazon Music entries in the large vector space

Brand protection at Amazon

Amazon strives to deliver the world’s most trustworthy shopping experience, offering customers the widest possible selection of authentic products. To earn and maintain our customers’ trust, we strictly prohibit the sale of counterfeit products, and we continue to invest in innovations that ensure only authentic products reach our customers. Amazon’s brand protection programs build trust with brands by accurately representing and completely protecting their brand. We strive to ensure that public perception mirrors the trustworthy experience we deliver. Our brand protection strategy focuses on four pillars: (1) Proactive Controls (2) Powerful Tools to Protect Brands (3) Holding Bad Actors Accountable (4) Protecting and Educating Customers. Amazon OpenSearch Service is a key part of Amazon’s Proactive Controls.

In 2022, Amazon’s automated technology scanned more than 8 billion attempted changes daily to product detail pages for signs of potential abuse. Our proactive controls found more than 99% of blocked or removed listings before a brand ever had to find and report it. These listings were suspected of being fraudulent, infringing, counterfeit, or at risk of other forms of abuse. To perform these scans, Amazon created tooling that uses advanced and innovative techniques, including the use of advanced machine learning models to automate the detection of intellectual property infringements in listings across Amazon’s stores globally. A key technical challenge in implementing such automated system is the ability to search for protected intellectual property within a vast billion-vector corpus in a fast, scalable and cost effective manner. Leveraging Amazon OpenSearch Service’s scalable vector database capabilities and distributed architecture, we successfully developed an ingestion pipeline that has indexed a total of 68 billion, 128- and 1024-dimension vectors into OpenSearch Service to enable brands and automated systems to conduct infringement detection, in real-time, through a highly available and fast (sub-second) search API.

Conclusion

Whether you’re building a generative AI solution, searching rich media and audio, or bringing more semantic search to your existing search-based application, OpenSearch is a capable vector database. OpenSearch supports a variety of engines, algorithms, and distance measures that you can employ to build the right solution. OpenSearch provides a scalable engine that can support vector search at low latency and up to billions of vectors. With OpenSearch and its vector DB capabilities, your users can find that 8-foot-blue couch easily, and relax by a cozy fire.

About the Authors

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Dylan Tong is a Senior Product Manager at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

Vamshi Vijay Nakkirtha is a Software Engineering Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

2023-06-20 Kevin Fallis

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/ingest-transform-and-deliver-events-published-by-amazon-security-lake-to-amazon-opensearch-service/

With the recent introduction of Amazon Security Lake, it has never been simpler to access all your security-related data in one place. Whether it’s findings from AWS Security Hub, DNS query data from Amazon Route 53, network events such as VPC Flow Logs, or third-party integrations provided by partners such as Barracuda Email Protection, Cisco Firepower Management Center, or Okta identity logs, you now have a centralized environment in which you can correlate events and findings using a broad range of tools in the AWS and partner ecosystem.

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization. You can also improve the protection of your workloads, applications, and data. Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), an open standard. With OCSF support, the service can normalize and combine security data from AWS and a broad range of enterprise security data sources.

When it comes to near-real-time analysis of data as it arrives in Security Lake and responding to security events your company cares about, Amazon OpenSearch Service provides the necessary tooling to help you make sense of the data found in Security Lake.

OpenSearch Service is a fully managed and scalable log analytics framework that is used by customers to ingest, store, and visualize data. Customers use OpenSearch Service for a diverse set of data workloads, including healthcare data, financial transactions information, application performance data, observability data, and much more. Additionally, customers use the managed service for its ingest performance, scalability, low query latency, and ability to analyze large datasets.

This post shows you how to ingest, transform, and deliver Security Lake data to OpenSearch Service for use by your SecOps teams. We also walk you through how to use a series of prebuilt visualizations to view events across multiple AWS data sources provided by Security Lake.

Understanding the event data found in Security Lake

Security Lake stores the normalized OCSF security events in Apache Parquet format—an optimized columnar data storage format with efficient data compression and enhanced performance to handle complex data in bulk. Parquet format is a foundational format in the Apache Hadoop ecosystem and is integrated into AWS services such as Amazon Redshift Spectrum, AWS Glue, Amazon Athena, and Amazon EMR. It’s a portable columnar format, future proofed to support additional encodings as technology develops, and it has library support across a broad set of languages like Python, Java, and Go. And the best part is that Apache Parquet is open source!

The intent of OCSF is to provide a common language for data scientists and analysts that work with threat detection and investigation. With a diverse set of sources, you can build a complete view of your security posture on AWS using Security Lake and OpenSearch Service.

Understanding the event architecture for Security Lake

Security Lake provides a subscriber framework to provide access to the data stored in Amazon S3. Services such as Amazon Athena and Amazon SageMaker use query access. The solution, in this post, uses data access to respond to events generated by Security Lake.

When you subscribe for data access, events arrive via Amazon Simple Queue Service (Amazon SQS). Each SQS event contains a notification object that has a “pointer” via data used to create a URL to the Parquet object on Amazon S3. Your subscriber processes the event, parses the data found in the object, and transforms it to whatever format makes sense for your implementation.

The solution we provide in this post uses a subscriber for data access. Let’s drill down into what the implementation looks like so that you understand how it works.

Solution overview

The high-level architecture for integrating Security Lake with OpenSearch Service is as follows.

The workflow contains the following steps:

Security Lake persists Parquet formatted data into an S3 bucket as determined by the administrator of Security Lake.
A notification is placed in Amazon SQS that describes the key to get access to the object.
Java code in an AWS Lambda function reads the SQS notification and prepares to read the object described in the notification.
Java code uses Hadoop, Parquet, and Avro libraries to retrieve the object from Amazon S3 and transform the records in the Parquet object into JSON documents for indexing in your OpenSearch Service domain.
The documents are gathered and then sent to your OpenSearch Service domain, where index templates map the structure into a schema optimized for Security Lake logs in OCSF format.

Steps 1–2 are managed by Security Lake; steps 3–5 are managed by the customer. The shaded components are your responsibility. The subscriber implementation for this solution uses Lambda and OpenSearch Service, and these resources are managed by you.

If you are evaluating this as solution for your business, remember that Lambda has a 15-minute maximum execution time at the time of this writing. Security Lake can produce up to 256MB object sizes and this solution may not be effective for your company’s needs at large scale. Various levers in Lambda have impacts on the cost of the solution for log delivery. Make cost conscious decisions when evaluating sample solutions. This implementation using Lambda is suitable for smaller companies where to volume of logs for CloudTrail and VPC flow logs are more suitable for a Lambda based approach where the cost to transform and deliver logs to Amazon OpenSearch Service are more budget friendly.

Now that you have some context, let’s start building the implementation for OpenSearch Service!

Prerequisites

Creation of Security Lake for your AWS accounts is a prerequisite for building this solution. Security Lake integrates with an AWS Organizations account to enable the offering for selected accounts in the organization. For a single AWS account that doesn’t use Organizations, you can enable Security Lake without the need for Organizations. You must have administrative access to perform these operations. For multiple accounts, it’s suggested that you delegate the Security Lake activities to another account in your organization. For more information about enabling Security Lake in your accounts, review Getting started.

Additionally, you may need to take the provided template and adjust it to your specific environment. The sample solution relies on access to a public S3 bucket hosted for this blog so egress rules and permissions modifications may be required if you use S3 endpoints.

This solution assumes that you’re using a domain deployed in a VPC. Additionally, it assumes that you have fine-grained access controls enabled on the domain to prevent unauthorized access to data you store as part of the integration with Security Lake. VPC-deployed domains are privately routable and have no access to the public internet by design. If you want to access your domain in a more public setting, you need to create a NGINX proxy to broker a request between public and private settings.

The remaining sections in this post are focused on how to create the integration with OpenSearch Service.

Create the subscriber

To create your subscriber, complete the following steps:

On the Security Lake console, choose Subscribers in the navigation pane.
Choose Create subscriber.
Under Subscriber details, enter a meaningful name and description.
Under Log and event sources, specify what the subscriber is authorized to ingest. For this post, we select All log and event sources.
For Data access method, select S3.
Under Subscriber credentials, provide the account ID and an external ID for which AWS account you want to provide access.
For Notification details, select SQS queue.
Choose Create when you are finished filling in the form.

It will take a minute or so to initialize the subscriber framework, such as the SQS integration and the permission generated so that you can access the data from another AWS account. When the status changes from Creating to Created, you have access to the subscriber endpoint on Amazon SQS.

Save the following values found in the subscriber Details section:
1. AWS role ID
2. External ID
3. Subscription endpoint

Use AWS CloudFormation to provision Lambda integration between the two services

An AWS CloudFormation template takes care of a large portion of the setup for the integration. It creates the necessary components to read the data from Security Lake, transform it into JSON, and then index it into your OpenSearch Service domain. The template also provides the necessary AWS Identity and Access Management (IAM) roles for integration, the tooling to create an S3 bucket for the Java JAR file used in the solution by Lambda, and a small Amazon Elastic Compute Cloud (Amazon EC2) instance to facilitate the provisioning of templates in your OpenSearch Service domain.

To deploy your resources, complete the following steps:

On the AWS CloudFormation console, create a new stack.
For Prepare template, select Template is ready.
Specify your template source as Amazon S3 URL.

You can either save the template to your local drive or copy the link for use on the AWS CloudFormation console. In this example, we use the template URL that points to a template stored on Amazon S3. You can either use the URL on Amazon S3 or install it from your device.

Choose Next.
Enter a name for your stack. For this post, we name the stack blog-lambda. Start populating your parameters based on the values you copied from Security Lake and OpenSearch Service. Ensure that the endpoint for the OpenSearch domain has a forward slash / at the end of the URL that you copy from OpenSearch Service.
Populate the parameters with values you have saved or copied from OpenSearch Service and Security Lake, then choose Next.
Select Preserve successfully provisioned resources to preserve the resources in case the stack roles back so you can debug the issues.
Scroll to bottom of page and choose Next.
On the summary page, select the check box that acknowledges IAM resources will be created and used in this template.
Choose Submit.

The stack will take a few minutes to deploy.

After the stack has deployed, navigate to the Outputs tab for the stack you created.
Save the CommandProxyInstanceID for executing scripts and save the two role ARNs to use in the role mappings step.

You need to associate the IAM roles for the tooling instance and the Lambda function with OpenSearch Service security roles so that the processes can work with the cluster and the resources within.

Provision role mappings for integrations with OpenSearch Service

With the template-generated IAM roles, you need to map the roles using role mapping to the predefined all_access role in your OpenSearch Service cluster. You should evaluate your specific use of any roles and ensure they are aligned with your company’s requirements.

In OpenSearch Dashboards, choose Security in the navigation pane.
Choose Roles in the navigation pane and look up the all_access role.
On the role details page, on the Mapped users tab, choose Manage mapping.
Add the two IAM roles found in the outputs of the CloudFormation template, then choose Map.

Provision the index templates used for OCSF format in OpenSearch Service

Index templates have been provided as part of the initial setup. These templates are crucial to the format of the data so that ingestion is efficient and tuned for aggregations and visualizations. Data that comes from Security Lake is transformed into a JSON format, and this format is based directly on the OCSF standard.

For example, each OCSF category has a common Base Event class that contains multiple objects that represent details like the cloud provider in a Cloud object, enrichment data using an Enrichment object that has a common structure across events but can have different values based on the event, and even more complex structures that have inner objects, which themselves have more inner objects such as the Metadata object, still part of the Base Event class. The Base Event class is the foundation for all categories in OCSF and helps you with the effort of correlating events written into Security Lake and analyzed in OpenSearch.

OpenSearch is technically schema-less. You don’t have to define a schema up front. The OpenSearch engine will try to guess the data types and the mappings found in the data coming from Security Lake. This is known as dynamic mapping. The OpenSearch engine also provides you with the option to predefine the data you are indexing. This is known as explicit mapping. Using explicit mappings to identifying your data source types and how they are stored at time of ingestion is key to getting high volume ingest performance for time-centric data indexed at heavy load.

In summary, the mapping templates use composable templates. In this construct, the solution establishes an efficient schema for the OCSF standard and gives you the capability to correlate events and specialize on specific categories in the OCSF standard.

You load the templates using the tools proxy created by your CloudFormation template.

On the stack’s Outputs tab, find the parameter CommandProxyInstanceID.

We use that value to find the instance in AWS Systems Manager.

On the Systems Manager console, choose Fleet manager in the navigation pane.
Locate and select your managed node.
On the Node actions menu, choose Start terminal session.

When you’re connected to the instance, run the following commands:

cd;pwd
. /usr/share/es-scripts/es-commands.sh | grep -o '{\"acknowledged\":true}' | wc -l

You should see a final result of 42 occurrences of {“acknowledged”:true}, which demonstrates the commands being sent were successful. Ignore the warnings you see for migration. The warnings don’t affect the scripts and as of this writing can’t be muted.

Navigate to Dev Tools in OpenSearch Dashboards and run the following command:
```
GET _cat/templates
```

This confirms that the scripts were successful.

Install index patterns, visualizations, and dashboards for the solution

For this solution, we prepackaged a few visualizations so that you can make sense of your data. Download the visualizations to your local desktop, then complete the following steps:

In OpenSearch Dashboards, navigate to Stack Management and Saved Objects.
Choose Import.
Choose the file from your local device, select your import options, and choose Import.

You will see numerous objects that you imported. You can use the visualizations after you start importing data.

Enable the Lambda function to start processing events into OpenSearch Service

The final step is to go into the configuration of the Lambda function and enable the triggers so that the data can be read from the subscriber framework in Security Lake. The trigger is currently disabled; you need to enable it and save the config. You will notice the function is throttled, which is by design. You need to have templates in the OpenSearch cluster so that the data indexes in the desired format.

On the Lambda console, navigate to your function.
On the Configurations tab, in the Triggers section, select your SQS trigger and choose Edit.
Select Activate trigger and save the setting.
Choose Edit concurrency.
Configure your concurrency and choose Save.

Enable the function by setting the concurrency setting to 1. You can adjust the setting as needed for your environment.

You can review the Amazon CloudWatch logs on the CloudWatch console to confirm the function is working.

You should see startup messages and other event information that indicates logs are being processed. The provided JAR file is set for information level logging and if needed, to debug any concerns, there is a verbose debug version of the JAR file you can use. Your JAR file options are:

Info-level settings for the Lambda code (default deployment) – lambda-s3-objecthandler-0.2.8.jar
Debug-level settings for the Lambda code – lambda-s3-objecthandler-0.2.8-debug.jar

If you choose to deploy the debug version, the verbosity of the code will show some error-level details in the Hadoop libraries. To be clear, Hadoop code will display lots of exceptions in debug mode because it tests environment settings and looks for things that aren’t provisioned in your Lambda environment, like a Hadoop metrics collector. Most of these startup errors are not fatal and can be ignored.

Visualize the data

Now that you have data flowing into OpenSearch Service from Security Lake via Lambda, it’s time to put those imported visualizations to work. In OpenSearch Dashboards, navigate to the Dashboards page.

You will see four primary dashboards aligned around the OCSF category for which they support. The four supported visualization categories are for DNS activity, security findings, network activity, and AWS CloudTrail using the Cloud API.

Security findings

The findings dashboard is a series of high-level summary information that you use for visual inspection of AWS Security Hub findings in a time window specified by you in the dashboard filters. Many of the encapsulated visualizations give “filter on click” capabilities so you can narrow your discoveries. The following screenshot shows an example.

The Finding Velocity visualization shows findings over time based on severity. The Finding Severity visualization shows which “findings” have passed or failed, and the Findings table visualization is a tabular view with actual counts. Your goal is to be near zero in all the categories except informational findings.

Network activity

The network traffic dashboard provides an overview for all your accounts in the organization that are enabled for Security Lake. The following example is monitoring 260 AWS accounts, and this dashboard summarizes the top accounts with network activities. Aggregate traffic, top accounts generating traffic and top accounts with the most activity are found in the first section of the visualizations.

Additionally, the top accounts are summarized by allow and deny actions for connections. In the visualization below, there are fields that you can drill down into other visualizations. Some of these visualizations have links to third party website that may or may not be allowed in your company. You can edit the links in the Saved objects in the Stack Management plugin.

For drill downs, you can drill down by choosing the account ID to get a summary by account. The list of egress and ingress traffic within a single AWS account is sorted by the volume of bytes transferred between any given two IP addresses.

Finally, if you choose the IP addresses, you’ll be redirected to Project Honey Pot, where you can see if the IP address is a threat or not.

DNS activity

The DNS activity dashboard shows you the requestors for DNS queries in your AWS accounts. Again, this is a summary view of all the events in a time window.

The first visualization in the dashboard shows DNS activity in aggregate across the top five active accounts. Of the 260 accounts in this example, four are active. The next visualization breaks the resolves down by the requesting service or host, and the final visualization breaks out the requestors by account, VPC ID, and instance ID for those queries run by your solutions.

API Activity

The final dashboard gives an overview of API activity via CloudTrail across all your accounts. It summarizes things like API call velocity, operations by service, top operations, and other summary information.

If we look at the first visualization in the dashboard, you get an idea of which services are receiving the most requests. You sometimes need to understand where to focus the majority of your threat discovery efforts based on which services may be consumed differently over time. Next, there are heat maps that break down API activity by region and service and you get an idea of what type of API calls are most prevalent in your accounts you are monitoring.

As you scroll down on the form, more details present themselves such as top five services with API activity and the top API operations for the organization you are monitoring.

Conclusion

Security Lake integration with OpenSearch Service is easy to achieve by following the steps outlined in this post. Security Lake data is transformed from Parquet to JSON, making it readable and simple to query. Enable your SecOps teams to identify and investigate potential security threats by analyzing Security Lake data in OpenSearch Service. The provided visualizations and dashboards can help to navigate the data, identify trends and rapidly detect any potential security issues in your organization.

As next steps, we recommend to use the above framework and associated templates that provide you with easy steps to visualize your Security Lake data using OpenSearch Service.

In a series of follow-up posts, we will review the source code and walkthrough published examples of the Lambda ingestion framework in the AWS Samples GitHub repo. The framework can be modified for use in containers to help address companies that have longer processing times for large files published in Security Lake. Additionally, we will discuss how to detect and respond to security events using example implementations that use OpenSearch plugins such as Security Analytics, Alerting, and the Anomaly Detection available in Amazon OpenSearch Service.

About the authors

Kevin Fallis (@AWSCodeWarrior) is an Principal AWS Specialist Search Solutions Architect. His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

Jimish Shah is a Senior Product Manager at AWS with 15+ years of experience bringing products to market in log analytics, cybersecurity, and IP video streaming. He’s passionate about launching products that offer delightful customer experiences, and solve complex customer problems. In his free time, he enjoys exploring cafes, hiking, and taking long walks

Ross Warren is a Senior Product SA at AWS for Amazon Security Lake based in Northern Virginia. Prior to his work at AWS, Ross’ areas of focus included cyber threat hunting and security operations. When he is not talking about AWS he likes to spend time with his family, bake bread, make sawdust and enjoy time outside.

Improved resiliency with backpressure and admission control for Amazon OpenSearch Service

2023-06-14 Ketan Verma

Post Syndicated from Ketan Verma original https://aws.amazon.com/blogs/big-data/improved-resiliency-with-backpressure-and-admission-control-for-amazon-opensearch-service/

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. Last year, we introduced Shard Indexing Backpressure and admission control, which monitors cluster resources and incoming traffic to selectively reject requests that would otherwise pose stability risks like out of memory and impact cluster performance due to memory contentions, CPU saturation and GC overhead, and more.

We are now excited to introduce Search Backpressure and CPU-based admission control for OpenSearch Service, which further enhances the resiliency of clusters. These improvements are available for all OpenSearch versions 1.3 or higher.

Search Backpressure

Backpressure prevents a system from being overwhelmed with work. It does so by controlling the traffic rate or by shedding excessive load in order to prevent crashes and data loss, improve performance, and avoid total failure of the system.

Search Backpressure is a mechanism to identify and cancel in-flight resource-intensive search requests when a node is under duress. It’s effective against search workloads with anomalously high resource usage (such as complex queries, slow queries, many hits, or heavy aggregations), which could otherwise cause node crashes and impact the cluster’s health.

Search Backpressure is built on top of the task resource tracking framework, which provides an easy-to-use API to monitor each task’s resource usage. Search Backpressure uses a background thread that periodically measures the node’s resource usage and assigns a cancellation score to each in-flight search task based on factors like CPU time, heap allocations, and elapsed time. A higher cancellation score corresponds to a more resource-intensive search request. Search requests are cancelled in descending order of their cancellation score to recover nodes quickly, but the number of cancellations is rate-limited to avoid wasteful work.

The following diagram illustrates the Search Backpressure workflow.

Search requests return an HTTP 429 “Too Many Requests” status code upon cancellation. OpenSearch returns partial results if only some shards fail and partial results are allowed. See the following code:

{
    "error": {
        "root_cause": [
            {
                "type": "task_cancelled_exception",
                "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "SearchTask was cancelled",
        "phase": "fetch",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "nyc_taxis",
                "node": "9gB3PDp6Speu61KvOheDXA",
                "reason": {
                    "type": "task_cancelled_exception",
                    "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
                }
            }
        ],
        "caused_by": {
            "type": "task_cancelled_exception",
            "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
        }
    },
    "status": 429
}

Monitoring Search Backpressure

You can monitor the detailed Search Backpressure state using the node stats API:

curl -X GET "https://{endpoint}/_nodes/stats/search_backpressure"

You can also view the cluster-wide summary of cancellations using Amazon CloudWatch. The following metrics are now available in the ES/OpenSearchService namespace:

SearchTaskCancelled – The number of coordinator node cancellations
SearchShardTaskCancelled – The number of data node cancellations

The following screenshot shows an example of tracking these metrics on the CloudWatch console.

CPU-based admission control

Admission control is a gatekeeping mechanism that proactively limits the number of requests to a node based on its current capacity, both for organic increases and spikes in traffic.

In addition to the JVM memory pressure and request size thresholds, it now also monitors each node’s rolling average CPU usage to reject incoming _search and _bulk requests. It prevents nodes from being overwhelmed with too many requests leading to hot spots, performance problems, request timeouts, and other cascading failures. Excessive requests return an HTTP 429 “Too Many Requests” status code upon rejection.

Handling HTTP 429 errors

You’ll receive HTTP 429 errors if you send excessive traffic to a node. It indicates either insufficient cluster resources, resource-intensive search requests, or an unintended spike in the workload.

Search Backpressure provides the reason for rejection, which can help fine-tune resource-intensive search requests. For traffic spikes, we recommend client-side retries with exponential backoff and jitter.

You can also follow these troubleshooting guides to debug excessive rejections:

Conclusion

Search Backpressure is a reactive mechanism to shed excessive load, while admission control is a proactive mechanism to limit the number of requests to a node beyond its capacity. Both work in tandem to improve the overall resiliency of an OpenSearch cluster.

Search Backpressure is available in OpenSearch, and we are always looking for external contributions. You can refer to the RFC to get started.

About the authors

Ketan Verma is a Senior SDE working on Amazon OpenSearch Service. He is passionate about building large-scale distributed systems, improving performance, and simplifying complex ideas with simple abstractions. Outside work, he likes to read and improve his home barista skills.

Suresh N S is a Senior SDE working on Amazon OpenSearch Service. He is passionate towards solving problems in large scale distributed systems.

Pritkumar Ladani is an SDE-2 working on Amazon OpenSearch Service. He likes to contribute to open source software development, and is passionate about distributed systems. He is an amateur badminton player and enjoys trekking.

Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in building distributed and autonomous systems. He is a maintainer and an active contributor to OpenSearch.

Implement alerts in Amazon OpenSearch Service with PagerDuty

2023-06-08 Manikanta Gona

Post Syndicated from Manikanta Gona original https://aws.amazon.com/blogs/big-data/implement-alerts-in-amazon-opensearch-service-with-pagerduty/

In today’s fast-paced digital world, businesses rely heavily on their data to make informed decisions. This data is often stored and analyzed using various tools, such as Amazon OpenSearch Service, a powerful search and analytics service offered by AWS. OpenSearch Service provides real-time insights into your data to support use cases like interactive log analytics, real-time application monitoring, website search, and more. Analyzing logs can help businesses quickly identify and troubleshoot issues.

However, with the increasing amount of data, it can be challenging to monitor everything manually. Manual monitoring consumes a lot of resources and is hard to maintain as the application landscape changes. We need a sustainable and automated approach to monitor critical applications and infrastructure.

With automated alerting with a third-party service like PagerDuty, an incident management platform, combined with the robust and powerful alerting plugin provided by OpenSearch Service, businesses can proactively manage and respond to critical events. You can use this proactive alerting to monitor data patterns for existing data, monitor clusters, detect patterns, and more.

OpenSearch Dashboard provides an alerting plugin that you can use to set up various types of monitors and alerts. You can use the plugin to set up different monitors, including cluster health, an individual document, a custom query, or aggregated data. These monitors can be used to send alerts to users.

In this post, we demonstrate how to implement PagerDuty as the notification mechanism to get notified based on cluster health status. These notifications can be delivered via various channels, including email, SMS, or custom webhooks (like PagerDuty). The OpenSearch Service alerting plugin supports complex alert rules and provides a user interface to manage them.

Solution overview

PagerDuty is a cloud-based incident management platform that helps businesses handle their alerts and incidents in real time. PagerDuty works by consolidating alerts from various monitoring tools and routing them to the right team member, ensuring that issues are addressed promptly. Many businesses are using PagerDuty for real-time incident notifications via multiple channels, ensuring that the right team members are alerted quickly.

In this post, we describe how to set up PagerDuty and integrate it with an OpenSearch Service custom webhook for alert notifications when a threshold is met.

The following diagram illustrate OpenSearch Service running within an Amazon VPC using monitors and triggers to send a notification to the PagerDuty service using an Events API custom webhook

We need to set up a service and integration on PagerDuty to begin receiving incident notifications from OpenSearch Service. A service in PagerDuty represents an application, component, or team that we can trigger the notification against.

Prerequisites

Before you get started, create the following resources, if not already available:

An AWS account where you can create an OpenSearch Service domain with dashboard access to create monitors and notification channels. For instructions, refer to Creating and managing Amazon OpenSearch Service domains.
A PagerDuty account with access to create a service and integration.

Create a service on PagerDuty

To create a service on PagerDuty, complete the following steps:

Log in to PagerDuty using your personal or enterprise account that is being used to enable the integration with OpenSearch Service.
On the Services tab, choose New Service.
Enter a name and optional description, then choose Next.

In the next step, we create or assign an escalation policy for the service. An escalation policy represents the order of responsibility for reacting to the issues detected on a service.

If you already have an escalation policy defined within the organization or team, select Select an existing Escalation Policy and specify your policy. Otherwise, select Generate a new Escalation Policy, then choose Next.

In the next step, we can group the alerts based on time or content:

- To group alerts together based on the alert content, select Content-Based grouping.
- To group them based on a specific time duration, select Time-Based grouping.
- Selecting the Intelligent grouping option will group the alerts intelligently based on content or time.

Leave the defaults and choose Next.
On the Integrations page, select the Events API V2 integration (this will be used for integration with OpenSearch Service) and choose Create Service.

If you don’t select the integration during this step, you can add it later.

Take note of the integration key on the Integrations tab.

Create a notification channel on OpenSearch Service with a custom webhook

Custom webhooks provide the ability to send these notifications to third-party services like PagerDuty using a REST API. After we configure the notification channel, we can use it for other monitors beyond this use case and to detect data patterns that are stored within the cluster.

Complete the following steps to configure the notification channel:

On the OpenSearch Dashboards page, choose Notifications under Amazon OpenSearch Plugins in the navigation pane.
On the Channels tab, choose Create channel.
Enter a name for the channel and an optional description.
For Channel type, choose Custom webhook.
For Method, choose POST.
For Define endpoints by, select Custom attributes URL.

For Host, enter events.PagerDuty.com.
For Path, enter v2/enqueue.
Under Webhook headers, choose Add header.
Enter X-Routing-Key as the key and the integration key you obtained earlier as the value.
Choose Create and ensure the channel is successfully created.

Configure OpenSearch Service alerts to send notifications to PagerDuty

We can monitor OpenSearch cluster health in two different ways:

Using the OpenSearch Dashboard alerting plugin by setting up a per cluster metrics monitor. This provides a query to retrieve metrics related to the cluster health.
Integrating with Amazon CloudWatch, a monitoring and observability service.

In this use case, we use the alerting plugin. Complete the following steps:

On the OpenSearch Dashboards page, choose Alerting under Amazon OpenSearch Plugins in the navigation pane.
On the Monitors tab, choose Create monitor.
For Monitor name, enter a name (for example, Monitor Cluster Health).
For Monitor type, select Per cluster metrics monitor.
Under Schedule¸ configure the monitor to run every minute.
In the Query section, for Request type, choose Cluster health.
Choose Preview query.
Create a trigger by choosing Add trigger.
For Trigger name, enter a name (for example, Cluster Health Status is Red).
Leave Severity level at 1 (Highest).
Under Trigger condition, delete the default code and enter the following:

ctx.results[0].status == "red"

Choose Preview condition response to confirm that Trigger condition response shows as false, indicating that the cluster is healthy.
Under Actions, choose Add action.
For Action name, enter a name (for example, Send a PagerDuty notification).
For Channels, choose the channel you created earlier.
For Message, enter the following code:

{ "event_action": "trigger",
"payload" :
	{	"summary": "{{ctx.trigger.name}}",
		"source": " {{ctx.monitor.name}}",
		"severity": "critical",
		"custom_details":
			{ 
				"-Severity" : "{{ctx.trigger.severity}}",
				"-Period start" : "{{ctx.periodStart}}",
				"-Period end": "{{ctx.periodEnd}}"
			}
	}
}

Note that apart from the custom_details section in the code, the rest of the fields are mandatory for PagerDuty.

Choose Send test message and test to make sure you receive an alert on the PagerDuty service.
Choose Create and ensure the monitor was created successfully.

A notification will be sent to the PagerDuty service as part of the test, which will trigger a notification via a phone call or text message for the person who is available based on the escalation policy defined earlier. This notification can be safely acknowledged and resolved from PagerDuty because this is was a test.

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

Delete the PagerDuty service.
Delete the OpenSearch Service domain that was created as part of the prerequisites.

Conclusion

The integration of OpenSearch Service alerts with PagerDuty provides a powerful and efficient solution for managing and responding to critical events in real time. With this integration, you can easily set up alerts and notifications to stay informed about potential issues within your OpenSearch Service clusters or issues related to data and documents stored within the cluster, and proactively take action to resolve any problems that arise. Additionally, the integration allows for seamless collaboration between teams, enabling them to work together to identify and troubleshoot issues as they occur.

For more information about anomaly detection and alerts in OpenSearch Service, refer to Anomaly Detection in Amazon OpenSearch and Configuring Alerts in Amazon OpenSearch.

About the Authors

Manikanta Gona is a Data and ML Engineer at AWS Professional Services. He joined AWS in 2021 with 6+ years of experience in IT. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. In his spare time, he love to garden, and go on hikes and biking with his husband.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation

Ravikiran Rao is a Data Architect at AWS and is passionate about solving complex data challenges for various customers. Outside of work, he is a theatre enthusiast and an amateur tennis player.

Hari Krishna KC is a Data Architect with the AWS Professional Services Team. He specializes in AWS Data Lakes & AWS OpenSearch Service and have helped numerous client migrate their workload to Data Lakes and Search data stores

Configure Amazon OpenSearch Service for high availability

2023-05-31 Rohin Bhargava

Post Syndicated from Rohin Bhargava original https://aws.amazon.com/blogs/big-data/configure-amazon-opensearch-service-for-high-availability/

Amazon OpenSearch Service is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like recommendation engines, ecommerce sites, and catalog search. To be successful in your business, you need your systems to be highly available and performant, minimizing downtime and avoiding failure. When you use OpenSearch Service as your primary means of monitoring your infrastructure, you need to ensure its availability as well. Downtime for OpenSearch Service can have a significant effect on your business outcomes, such as loss of revenue, loss in productivity, loss in brand value, and more.

The industry standard for measuring availability is class of nines. OpenSearch Service provides 3 9’s of availability, when you follow best practices, which means it guarantees less than 43.83 minutes of downtime a month. In this post, you will learn how you can configure your OpenSearch Service domain for high availability and performance by following best practices and recommendations while setting up your domain.

There are two essential elements that influence your domain’s availability: the resource utilization of your domain, which is mostly driven by your workload, and external events such as infrastructure failures. Although the former can be controlled through continuous monitoring of the domain’s performance and health and scaling the domain accordingly, the latter cannot. To mitigate the impact of external events such as an Availability Zone outage, instance or disk failure, or networking issues on your domain, you must provision additional capacity, distributed over multiple Availability Zones, and keep multiple copies of data. Failure to do so may result in degraded performance, unavailability, and, in the worst-case situation, data loss.

Let’s look at the options available to you to ensure that domain is available and performant.

Cluster configuration

Under this section we will talk about various configuration options you have to setup your cluster properly which includes specifying the number of AZ for the deployment, setting up the master and data nodes, setting up indexes and shards.

Multi-AZ deployment

Data nodes are responsible for processing indexing and search requests in your domain. Deploying your data nodes across multiple Availability Zones improves the availability of your domain by adding redundant, per-zone data storage and processing. With a Multi-AZ deployment, your domain can remain available even when a full Availability Zone becomes unavailable. For production workloads, AWS recommends using three Availability Zones for your domain. Use two Availability Zones for Regions that support only two for improved availability. This ensures that your domain is available in the event of a Single-AZ failure.

Dedicated cluster manager (master nodes)

AWS recommends using three dedicated cluster manager (CM) nodes for all production workloads. CM nodes track the cluster’s health, the state and location of its indexes and shards, the mapping for all the indexes, and the availability of its data nodes, and it maintains a list of cluster-level tasks in process. Without dedicated CM nodes, the cluster uses data nodes, which makes the cluster vulnerable to workload demands. You should size CM nodes based on the size of the task—primarily, the data node counts, the index counts, and the shard counts. OpenSearch Service always deploys CM nodes across three Availability Zones, when supported by the Region (two in one Availability Zones and one in other Availability Zones if regions have only two Availability Zones). For a running domain, only one of the three CM nodes works as an elected leader. The other two CM nodes participate in an election if the elected CM node fails.

The following table shows AWS’s recommendations for CM sizing. CM nodes do work based on the number of nodes, indexes, shards, and mapping. The more work, the more compute and memory you need to hold and work with the cluster state.

Instance Count	Cluster Manager Node RAM Size	Maximum Supported Shard Count	Recommended Minimum Dedicated Cluster Manager Instance Type
1–10	8 GiB	10,000	m5.large.search or m6g.large.search
11–30	16 GiB	30,000	c5.2xlarge.search or c6g.2xlarge.search
31–75	32 GiB	40,000	c5.4xlarge.search or c6g.4xlarge.search
76 – 125	64 GiB	75,000	r5.2xlarge.search or r6g.2xlarge.search
126 – 200	128 GiB	75,000	r5.4xlarge.search or r6g.4xlarge.search

Indexes and shards

Indexes are a logical construct that houses a collection of documents. You partition your index for parallel processing by specifying a primary shard count, where shards represent a physical unit for storing and processing data. In OpenSearch Service, a shard can be either a primary shard or a replica shard. You use replicas for durability—if the primary shard is lost, OpenSearch Service promotes one of the replicas to primary—and for improving search throughput. OpenSearch Service ensures that the primary and replica shards are placed in different nodes and across different Availability Zones, if deployed in more than one Availability Zone. For high availability, AWS recommends configuring at least two replicas for each index in a three-zone setup to avoid disruption in performance and availability. In a Multi-AZ setup, if a node fails or in the rare worst case an Availability Zone fails, you will still have a copy of the data.

Cluster monitoring and management

As discussed earlier, selecting your configuration based on best practices is only half the job. We also need to continuously monitor the resource utilization and performance to determine if the domain needs to be scaled. An under-provisioned or over-utilized domain can result in performance degradation and eventually unavailability.

CPU utilization

You use the CPU in your domain to run your workload. As a general rule, you should target 60% average CPU utilization for any data node, with peaks at 80%, and tolerate small spikes to 100%. When you consider availability, and especially considering the unavailability of a full zone, there are two scenarios. If you have two Availability Zones, then each zone handles 50% of the traffic. If a zone becomes unavailable, the other zone will take all of that traffic, doubling CPU utilization. In that case, you need to be at around 30–40% average CPU utilization in each zone to maintain availability. If you are running three Availability Zones, each zone is taking 33% of the traffic. If a zone becomes unavailable, each other zone will gain approximately 17% traffic. In this case, you should target 50–60% average CPU utilization.

Memory utilization

OpenSearch Service supports two types of garbage collection. The first is G1 garbage collection (G1GC), which is used by OpenSearch Service nodes, powered by AWS Graviton 2. The second is Concurrent Mark Sweep (CMS), which is used by all nodes powered by other processors. Out of all the memory allocated to a node, half of the memory (up to 32 GB) is assigned to the Java heap, and the rest of the memory is used by other operating system tasks, the file system cache, and so on. To maintain availability for a domain, we recommend keeping the max JVM utilization at around 80% in CMS and 95% in G1GC. Anything beyond that would impact the availability of your domain and make your cluster unhealthy. We also recommend enabling auto-tune, which actively monitors the memory utilization and triggers the garbage collector.

Storage utilization

OpenSearch Service publishes several guidelines for sizing of domains. We provide an empirical formula so that you can determine the right amount of storage required for your requirements. However, it’s important to keep an eye out for the depletion of storage with time and changes in workload characteristics. To ensure the domain doesn’t run out of storage and can continue to index data, you should configure Amazon CloudWatch alarms and monitor your free storage space.

AWS also recommends choosing a primary shard count so that each shard is within an optimal size band. You can determine the optimal shard size through proof-of-concept testing with your data and traffic. We use 10–30 GB primary shard sizes for search use cases and 45–50 GB primary shard sizes for log analytics use cases as a guideline. Because shards are the workers in your domain, they are directly responsible for the distribution of the workload across the data nodes. If your shards are too large, you may see stress in your Java heap from large aggregations, worse query performance, and worse performance on cluster-level tasks like shard rebalancing, snapshots, and hot-to-warm migrations. If your shards are too small, they can overwhelm the domain’s Java heap space, worsen query performance through excessive internal networking, and make cluster-level tasks slow. We also recommend keeping the number of shards per node proportional to the heap available (half of the instance’s RAM up to 32 GB)—25 shards per GB of Java heap. This makes a practical limit of 1,000 shards on any data node in your domain.

Conclusion

In this post, you learned various tips and tricks to set up a highly available domain using OpenSearch Service, which helps you to keep OpenSearch Service performant and available by running it across three Availability Zones.

Stay tuned for a series of posts focusing on the various features and functionalities with OpenSearch Service. If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.

About the authors

Rohin Bhargava is a Sr. Product Manager with the Amazon OpenSearch Service team. His passion at AWS is to help customers find the correct mix of AWS services to achieve success for their business goals.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Amazon OpenSearch Service Under the Hood: Multi-AZ with standby

2023-05-10 Rohin Bhargava

Post Syndicated from Rohin Bhargava original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-under-the-hood-multi-az-with-standby/

Amazon OpenSearch Service recently announced Multi-AZ with standby, a new deployment option for managed clusters that enables 99.99% availability and consistent performance for business-critical workloads. With Multi-AZ with standby, clusters are resilient to infrastructure failures like hardware or networking failure. This option provides improved reliability and the added benefit of simplifying cluster configuration and management by enforcing best practices and reducing complexity.

In this post, we share how Multi-AZ with standby works under the hood to achieve high resiliency and consistent performance to meet the four 9s.

Background

One of the principles in designing highly available systems is that they need to be ready for impairments before they happen. OpenSearch is a distributed system, which runs on a cluster of instances that have different roles. In OpenSearch Service, you can deploy data nodes to store your data and respond to indexing and search requests, you can also deploy dedicated cluster manager nodes to manage and orchestrate the cluster. To provide high availability, one common approach for the cloud is to deploy infrastructure across multiple AWS Availability Zones. Even in the rare case that a full zone becomes unavailable, the available zones continue to serve traffic with replicas.

When you use OpenSearch Service, you create indexes to hold your data and specify partitioning and replication for those indexes. Each index is comprised of a set of primary shards and zero to many replicas of those shards. When you additionally use the Multi-AZ feature, OpenSearch Service ensures that primary shards and replica shards are distributed so that they’re in different Availability Zones.

When there is an impairment in an Availability Zone, the service would scale up in other Availability Zones and redistribute shards to spread out the load evenly. This approach was reactive at best. Additionally, shard redistribution during failure events causes increased resource utilization, leading to increased latencies and overloaded nodes, further impacting availability and effectively defeating the purpose of fault-tolerant, multi-AZ clusters. A more effective, statically stable cluster configuration requires provisioning infrastructure to the point where it can continue operating correctly without having to launch any new capacity or redistribute any shards even if an Availability Zone becomes impaired.

Designing for high availability

OpenSearch Service manages tens of thousands of OpenSearch clusters. We’ve gained insights into which cluster configurations like hardware (data or cluster-manager instance types) or storage (EBS volume types), shard sizes, and so on are more resilient to failures and can meet the demands of common customer workloads. Some of these configurations have been included in Multi-AZ with standby to simplify configuring the clusters. However, this alone is not enough. A key ingredient in achieving high availability is maintaining data redundancy.

When you configure a single replica (two copies) for your indexes, the cluster can tolerate the loss of one shard (primary or replica) and still recover by copying the remaining shard. A two-replica (three copies) configuration can tolerate failure of two copies. In the case of a single replica with two copies, you can still sustain data loss. For example, you could lose data if there is a catastrophic failure in one Availability Zone for a prolonged duration, and at the same time, a node in a second zone fails. To ensure data redundancy at all times, the cluster enforces a minimum of two replicas (three copies) across all its indexes. The following diagram illustrates this architecture.

The Multi-AZ with standby feature deploys infrastructure in three Availability Zones, while keeping two zones as active and one zone as standby. The standby zone offers consistent performance even during zonal failures by ensuring same capacity at all times and by using a statically stable design without any capacity provisioning or data movements during failure. During normal operations, the active zone serves coordinator traffic for read and write requests and shard query traffic, and only replication traffic goes to the standby zone. OpenSearch uses synchronous replication protocol for write requests, which by design has zero replication lag, enabling the service to instantaneously promote a standby zone to active in the event of any failure in an active zone. This event is referred to as a zonal failover. The previously active zone is demoted to the standby mode and recovery operations to bring the state back to healthy begin.

Why zonal failover is critical but hard to do right

One or more nodes in an Availability Zone can fail due to a wide variety of reasons, like hardware failures, infrastructure failures like fiber cuts, power or thermal issues, or inter-zone or intra-zone networking problems. Read requests can be served by any of the active zones, whereas write requests need to be synchronously replicated to all copies across multiple Availability Zones. OpenSearch Service orchestrates two modes of failovers: read failovers and the write failovers.

The primarily goals of read failovers are high availability and consistent performance. This requires the system to constantly monitor for faults and shift traffic away from the unhealthy nodes in the impacted zone. The system takes care of handling the failovers gracefully, allowing all in-flight requests to finish while simultaneously shifting new incoming traffic to a healthy zone. However, it’s also possible for multiple shard copies across both active zones to be unavailable in cases of two node failures or one zone plus one node failure (often referred to as double faults), which poses a risk to availability. To solve this challenge, the system uses a fail-open mechanism to serve traffic off the third zone while it may still be in a standby mode to ensure the system remains highly available. The following diagram illustrates this architecture.

An impaired network device impacting inter-zone communication can cause write requests to significantly slow down, owing to the synchronous nature of replication. In such an event, the system orchestrates a write failover to isolate the impaired zone, cutting off all ingress and egress traffic. Although with write failovers the recovery is immediate, it results in all nodes along with its shards being taken offline. However, after the impacted zone is brought back after network recovery, shard recovery should still be able to use unchanged data from its local disk, avoiding full segment copy. Because the write failover results in the shard copy to be unavailable, we exercise write failovers with extreme caution, neither too frequently nor during transient failures.

The following graph depicts that during a zonal failure, automatic read failover prevents any impact to availability.

The following depicts that during a networking slowdown in a zone, write failover helps recover availability.

To ensure that the zonal failover mechanism is predictable (able to seamlessly shift traffic during an actual failure event), we regularly exercise failovers and keep rotating active and standby zones even during steady state. This not only verifies all network paths, ensuring we don’t hit surprises like clock skews, stale credentials, or networking issues during failover, but it also keeps gradually shifting caches to avoid cold starts on failovers, ensuring we deliver consistent performance at all times.

Improving the resiliency of the service

OpenSearch Service uses several principles and best practices to increase reliability, like automatic detection and faster recovery from failure, throttling excess requests, fail fast strategies, limiting queue sizes, quickly adapting to meet workload demands, implementing loosely coupled dependencies, continuously testing for failures, and more. We discuss a few of these methods in this section.

Automatic failure detection and recovery

All faults get monitored at a minutely granularity, across multiple sub-minutely metrics data points. Once detected, the system automatically triggers a recovery action on the impacted node. Although most classes of failures discussed so far in this post refer to binary failures where the failure is definitive, there is another kind of failure: non-binary failures, termed gray failures, whose manifestations are subtle and usually defy quick detection. Slow disk I/O is one example, which causes performance to be adversely impacted. The monitoring system detects anomalies in I/O wait times, latencies, and throughput, to detect and replace a node with slow I/O. Faster and effective detection and quick recovery is our best bet for a wide variety of infrastructure failures beyond our control.

Effective workload management in a dynamic environment

We’ve studied workload patterns that cause the system either to be overloaded with too many requests, maxing out CPU/memory, or a few rogue queries that can that either allocate huge chunks of memory or runaway queries that can exhaust multiple cores, either degrading the latencies of other critical requests or causing multiple nodes to fail due to the system’s resources running low. Some of the improvements in this direction are being done as a part of search backpressure initiatives, starting with tracking the request footprint at various checkpoints that prevents accommodating more requests and cancels the ones already running if they breach the resource limits for a sustained duration. To supplement backpressure in traffic shaping, we use admission control, which provides capabilities to reject a request at the entry point to avoid doing non-productive work (requests either time out or get cancelled) when the system is already run high on CPU and memory. Most of the workload management mechanisms have configurable knobs. No one size fits all workloads, therefore we use Auto-Tune to control them more granularly.

The cluster manager performs critical coordination tasks like metadata management and cluster formation, and orchestrates a few background operations like snapshot and shard placement. We added a task throttler to control the rate of dynamic mapping updates, snapshot tasks, and so on to prevent overwhelming it and to let critical operations run deterministically all the time. But what happens when there is no cluster manager in the cluster? The next section covers how we solved this.

Decoupling critical dependencies

In the event of cluster manager failure, searches continue as usual, but all write requests start to fail. We concluded that allowing writes in this state should still be safe as long as it doesn’t need to update the cluster metadata. This change further improves the write availability without compromising data consistency. Other service dependencies were evaluated to ensure downstream dependencies can scale as the cluster grows.

Failure mode testing

Although it’s hard to mimic all kinds of failures, we rely on AWS Fault Injection Simulator (AWS FIS) to inject common faults in the system like node failures, disk impairment, or network impairment. Testing with AWS FIS regularly in our pipelines helps us improve our detection, monitoring, and recovery times.

Contributing to open source

OpenSearch is an open-source, community-driven software. Most of the changes including the high availability design to support active and standby zones have been contributed to open source; in fact, we follow an open-source first development model. The fundamental primitive that enables zonal traffic shift and failover is based on a weighted traffic routing policy (active zones are assigned weights as 1 and standby zones are assigned weights as 0). Write failovers use the zonal decommission action, which evacuates all traffic from a given zone. Resiliency improvements for search backpressure and cluster manager task throttling are some of the ongoing efforts. If you’re excited to contribute to OpenSearch, open up a GitHub issue and let us know your thoughts.

Summary

Efforts to improve reliability is a never-ending cycle as we continue to learn and improve. With the Multi-AZ with standby feature, OpenSearch Service has integrated best practices for cluster configuration, improved workload management, and achieved four 9s of availability and consistent performance. OpenSearch Service also raised the bar by continuously verifying availability with zonal traffic rotations and automated tests via AWS FIS.

We are excited to continue our efforts into improving the reliability and fault tolerance even further and to see what new and existing solutions builders can create using OpenSearch Service. We hope this leads to a deeper understanding of the right level of availability based on the needs of your business and how this offering achieves the availability SLA. We would love to hear from you, especially about your success stories achieving high levels of availability on AWS. If you have other questions, please leave a comment.

About the authors

Gaurav Bafna is a Senior Software Engineer working on OpenSearch at Amazon Web Services. He is fascinated about solving problems in distributed systems. He is a maintainer and an active contributor to OpenSearch.

Murali Krishna is a Senior Principal Engineer at AWS OpenSearch Service. He has built AWS OpenSearch Service and AWS CloudSearch. His areas of expertise include Information Retrieval, Large scale distributed computing, low latency real time serving systems etc. He has vast experience in designing and building web scale systems for crawling, processing, indexing and serving text and multimedia content. Prior to Amazon, he was part of Yahoo!, building crawling and indexing systems for their search products.

Ranjith Ramachandra is a Senior Engineering Manager working on Amazon OpenSearch Service. He is passionate about highly scalable distributed systems, high performance and resilient systems.

Use SAML Identities for programmatic access to Amazon OpenSearch Service

2023-05-09 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-saml-identities-for-programmatic-access-to-amazon-opensearch-service/

Customers of Amazon OpenSearch Service can already use Security Assertion Markup Language (SAML) to access OpenSearch Dashboards.

This post outlines two methods by which programmatic users can now access OpenSearch using SAML identities. This applies to all identity providers (IdPs) that support SAML 2.0, including prevalent ones like Active Directory Federation Service (ADFS), Okta, AWS IAM Identity Center (Successor to AWS Single Sign-On), KeyCloak, and others. Although we outline the methods as they pertain to OpenSearch Service and AWS Identity and Access Management (IAM), programmatic access to each of these individual providers is outside the scope of this post. Most of these providers do provide such a facility.

Single sign-on methods

When you use single sign-on (SSO), there are two different authentication methods:

Identity provider initiated – This is when a user or a user-agent first authenticates with an IdP and gets a SAML assertion that establishes the identity of the user. This assertion is then passed to a service provider (SP) that provides access to a protected resource.
Service provider initiated – Although the IdP-initiated exchange is straightforward, a more typical sign-on experience is when the protected resource is accessed directly. The SP then redirects the user to the IdP for authentication along with a SAML authentication request. The IdP responds with an authentication assertion inside a SAML response. After that, the SSO experience is the same as that of an IdP-initiated flow.

For programmatic access to OpenSearch Service, an external IdP is the IdP, and OpenSearch Service and IAM both serve as SPs. To configure your IdP of choice as the SAML IdP for IAM, refer to Creating IAM SAML identity providers. To configure OpenSearch Service, refer to SAML authentication for OpenSearch Dashboards.

In the following sections, we outline two methods to access OpenSearch Service API:

Using AWS Security Token Service (AWS STS)
Using the OpenSearch Dashboards’ console proxy

Method 1: Use AWS STS

The following figure shows the sequence of calls to access OpenSearch Service API using AWS STS.

Let’s explore each step in more detail.

Steps 1 and 2

Steps 1 and 2 vary depending upon your chosen IdP. In general, they typically provide an authentication API or session API or another similar API to authenticate and retrieve the SAML authentication assertion response. We use this SAML assertion in the next step.

Steps 3 and 4

Call the AssumeRoleWithSAML AWS STS API to exchange the SAML assertion for temporary credentials associated with your SAML identity. See the following code:

curl --location 'https://sts.amazonaws.com?
Version=2011-06-15&
Action=AssumeRoleWithSAML&
RoleArn=<ARN of the role being assumed>&
PrincipalArn=<ARN of the IdP integrated with IAM>&
SAMLAssertion=<Base-64 encoded SAML assertion>'

The response contains the temporary AWS STS credentials with AccessKeyId, SecretAccessKey, and a SessionToken.

Step 5

Use the temporary credentials from the last step to sign all API requests to OpenSearch Service. Also ensure the role that you assumed with the AssumeRoleWithSAML call has sufficient permission to access the requisite data in OpenSearch Service. Refer to Mapping roles to users for more information about mapping this role as a backend role. As an additional step to ensure consistency, this AWS STS role and any SAML group the user is part of can be mapped to the same role in OpenSearch Service. The following code shows a model to make this call:

curl --location ‘<OpenSearch Service domain URL>/_search' \
--header 'X-Amz-Security-Token: Fwo...==(truncated)' \
--header 'X-Amz-Date: 20230327T134710Z' \
--header 'Authorization: AWS4-HMAC-SHA256 Credential=ASI..(truncated)/20230327/us-east-1/es/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token, Signature=95eb…(truncated)'

Method 2: Use OpenSearch Dashboards’ console proxy

OpenSearch Dashboards has a component called a console proxy that can proxy requests to OpenSearch. This allows OpenSearch clients to make the same API calls in Domain Specific Language (DSL) to this console proxy instead of directly calling OpenSearch. The console proxy forwards these calls to OpenSearch and responds back to the clients in the same format as OpenSearch.

The following figure shows the sequence of calls you can make to the console proxy to gain programmatic access to OpenSearch Service.

Steps 1 and 2

The first two steps are similar to method 1, and they will vary depending on what IdP is chosen. Essentially, you need to obtain a SAML authentication assertion response from the IdP.

Steps 3 and 4

Use the SAML assertion from the previous steps and POST it to the Assertion Consumer Service (ACS) URL, _opendistro/_security/saml/acs/idpinitiated, to exchange the assertion for the security_authentication token. The following code shows the command line for these steps:

curl --location ‘<dashboards URL>/_opendistro/_security/saml/acs/idpinitiated' \
--header 'content-type: application/x-www-form-urlencoded' \
--data-urlencode ‘SAMLResponse=Base-64 encoded SAML assertion' \
--data-urlencode 'RelayState=’

If you’re using the OpenSearch engine, the dashboard URL is <domain URL>/_dashboards. If you’re using the Elasticsearch engine, the dashboard URL is <domain URL>/_plugin/kibana. OpenSearch Dashboards processes this and responds with a redirect response with code 302 and an empty body. The response headers now also contain a cookie named security_authentication, which is the token you must use in all subsequent calls.

Steps 5–8

Use the security_authentication cookie in the API calls to the console proxy to perform programmatic API calls. The following code shows a command line for these steps:

curl --location ‘<dashboardsURL>/api/console/proxy?path=_search&method=GET' \
--header 'content-type: application/json' \
--header 'cookie: security_authentication=Fe26.2**1...(truncated)' \
--header 'osd-xsrf: true' \
--data '{
  "query": {
    "match_all": {}
  }
}’

Make sure to include a header called osd-xsrf : true for programmatic access to dashboards. The console proxy path is /api/console/proxy for Elasticsearch engines version 6.x and 7.x and OpenSearch engine version 1.x and 2.x.

Similar to method 1, make sure to map roles and groups associated with a particular SAML identity as the correct backend role with requisite permissions.

Comparing these methods

You can use method 1 in any domain regardless of the engine as long as fine-grained access control is enabled. Method 2 only works for domains with Elasticsearch engine versions greater than 6.7 and all OpenSearch engine versions.

The OpenSearch Dashboards process is generally meant for human interactions, which has a lower API call rate and volume than those of programmatic calls. OpenSearch can handle considerably higher API call rates and volume, so take care not to send high-volume API calls using method 2. As a best practice for programmatic access with SAML identities, we recommend method 1 wherever possible to avoid performance bottlenecks.

Conclusion

Both of the methods outlined in this post provide a similar flow to access OpenSearch Service programmatically using SAML identities (exchanging a SAML assertion for an authentication token). AssumeRoleWithSAML is a key and fairly straightforward-to-use API that enables this access and is our recommended method. Try one of OpenSearch Service labs and launch an OpenSearch Service domain to experiment with these methods. Good luck!

About the author

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Amazon OpenSearch Service now supports 99.99% availability using Multi-AZ with Standby

2023-05-03 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-now-supports-99-99-availability-using-multi-az-with-standby/

Customers use Amazon OpenSearch Service for mission-critical applications and monitoring. But what happens when OpenSearch Service itself is unavailable? If your ecommerce search is down, for example, you’re losing revenue. If you’re monitoring your application with OpenSearch Service, and it becomes unavailable, your ability to detect, diagnose, and repair issues with your application is diminished. In these cases, you may suffer lost revenue, customer dissatisfaction, reduced productivity, or even damage to your organization’s reputation.

OpenSearch Service offers an SLA of three 9s (99.9%) availability when following best practices. However, following those practices is complicated, and can require knowledge of and experience with OpenSearch’s data deployment and management, along with an understanding of how OpenSearch Service interacts with AWS Availability Zones and networking, distributed systems, OpenSearch’s self-healing capabilities, and its recovery methods. Furthermore, when an issue arises, such as a node becoming unresponsive, OpenSearch Service recovers by recreating the missing shards (data), causing a potentially large movement of data in the domain. This data movement increases resource usage on the cluster, which can impact performance. If the cluster is not sized properly, it can experience degraded availability, which defeats the purpose of provisioning the cluster across three Availability Zones.

Today, AWS is announcing the new deployment option Multi-AZ with Standby for OpenSearch Service, which helps you offload some of that heavy lifting in terms of high frequency monitoring, fast failure detection, and quick recovery from failure, and keeps your domains available and performant even in the event of an infrastructure failure. With Multi-AZ with Standby, you get 99.99% availability with consistent performance for a domain.

In this post, we discuss the benefits of this new option and how to configure your OpenSearch cluster with Multi-AZ with Standby.

Solution overview

The OpenSearch Service team has incorporated years of experience running tens of thousands of domains for our customers into the Multi-AZ with Standby feature. When you adopt Multi-AZ with Standby, OpenSearch Service creates a cluster across three Availability Zones, with each Availability Zone containing a complete copy of data in the cluster. OpenSearch Service then puts one Availability Zone into standby mode, routing all queries to the other two Availability Zones. When it detects a hardware-related failure, OpenSearch Service promotes nodes from the standby pool to become active in less than a minute. When you use Multi-AZ with Standby, OpenSearch Service doesn’t need to redistribute or recreate data from missing nodes. As a result, cluster performance is unaffected, removing the risk of degraded availability.

Prerequisites

Multi-AZ with Standby requires the following prerequisites:

The domain needs to run on OpenSearch 1.3 or above
The domain is deployed across three Availability Zones
The domain has three (or a multiple of three) data notes
You must use three dedicated cluster manager (master) nodes

Refer to Sizing Amazon OpenSearch Service domains for guidance on sizing your domain and dedicated cluster manager nodes.

Configure your OpenSearch cluster using Multi-AZ with Standby

You can use Multi-AZ with Standby when you create a new domain, or you can add it to an existing domain. If you’re creating a new domain using the AWS Management Console, you can create it with Multi-AZ with Standby by either selecting the new Easy create option or the traditional Standard create option. You can update existing domains to use Multi-AZ with Standby by editing their domain configuration.

The Easy create option, as the name suggests, makes creating a domain easier by defaulting to best practice choices for most of the configuration (the majority of which can be altered later). The domain will be set up for high availability from the start and deployed as Multi-AZ with Standby.

While choosing the data nodes, you should choose three (or a multiple of three) data nodes so that they are equally distributed across each of the Availability Zones. The Data nodes table on the OpenSearch Service console provides a visual representation of the data notes, showing that one of the Availability Zones will be put on standby.

Similarly, while selecting the cluster manager (master) node, consider the number of data nodes, indexes, and shards that you plan to have before deciding the instance size.

After the domain is created, you can check its deployment type on the OpenSearch Service console under Cluster configuration, as shown in the following screenshot.

While creating an index, make sure that the number of copies (primary and replica) are multiples of three. If you don’t specify the number of replicas, the service will default to two. This is important so that there is at least one copy of the data in each Availability Zone. We recommend using an index template or similar for logs workloads.

OpenSearch Service distributes the nodes and data copies equally across the three Availability Zones. During normal operations, the standby nodes don’t receive any search requests. The two active Availability Zones respond to all the search requests. However, data is replicated to these standby nodes to ensure you have a full copy of the data in each Availability Zone at all times.

Response to infrastructure failure events

OpenSearch Service continuously monitors the domain for events like node failure, disk failure, or Availability Zone failure. In the event of an infrastructure failure like an Availability Zone failure, OpenSearch Services promotes the standby nodes to active while the impacted Availability Zone recovers. Impact (if any) is limited to the in-flight requests as traffic is weighed away from the impacted Availability Zone in less a minute.

You can check the status of the domain, data node metrics for both active and standby, and Availability Zone rotation metrics on the Cluster health tab. The following screenshots show the cluster health and metrics for data nodes such as CPU utilization, JVM memory pressure, and storage.

The following screenshot of the AZ Rotation Metrics section (you can find this under Cluster health tab) shows the read and write status of the Availability Zones. OpenSearch Service rotates the standby Availability Zone every 30 minutes to ensure the system is running and ready to respond to events. Availability Zones responding to traffic have a read value of 1, and the standby Availability Zone has a value of 0.

Considerations

Several improvements and guardrails have been made for this feature that offer higher availability and maintain performance. Some static limits have been applied that are specifically related to the number of shards per node, number of shards for a domain, and the size of a shard. OpenSearch Service also enables Auto-Tune by default. Multi-AZ with Standby restricts the storage to GP3- or SSD-backed instances for the most cost-effective and performant storage options. Additionally, we’re introducing an advanced traffic shaping mechanism that will detect rogue queries, which further enhances the reliability of the domain.

We recommend evaluating your domain infrastructure needs based on your workload to achieve high availability and performance.

Conclusion

Multi-AZ with Standby is now available on OpenSearch Service in all AWS Regions globally where OpenSearch service is available, except US West (N. California), and AWS GovCloud (US-Gov-East, US-Gov-West). Try it out and send your feedback to AWS re:Post for Amazon OpenSearch Service or through your usual AWS support contacts.

About the authors

Top strategies for high volume tracing with Amazon OpenSearch Ingestion

2023-04-27 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/top-strategies-for-high-volume-tracing-with-amazon-opensearch-ingestion/

Amazon OpenSearch Ingestion is a serverless, auto-scaled, managed data collector that receives, transforms, and delivers data to Amazon OpenSearch Service domains or Amazon OpenSearch Serverless collections. OpenSearch Ingestion is powered by Data Prepper, an open-source, streaming ETL (extract, transform, and load) solution that’s part of the OpenSearch project. When you use OpenSearch Ingestion, you don’t need to maintain self-managed data pipelines to ingest logs, traces, metrics, and other data with OpenSearch Service. Amazon OpenSearch Ingestion responds to changing volumes of data, automatically scaling your ingest pipeline.

Distributed tracing is the leading way to locate, alert on, and remediate problems with your application and infrastructure. Distributed tracing is part of a broader observability solution, often combined with metrics and log data. OpenSearch Service gives you a native toolset to store and analyze large volumes of log, metric, and trace data. However, moving these large volumes of data is non-trivial to set up, monitor, and maintain.

In this post, we outline steps to set up a trace pipeline and strategies to deal with high volume tracing with Amazon OpenSearch Ingestion.

Solution overview

There is now a new option on the OpenSearch Service console called Pipelines under Ingestion in the navigation pane. We use this feature to create a trace pipeline.

You can also use the AWS Command Line Interface (AWS CLI), AWS CloudFormation, or AWS APIs to create a trace pipeline.

Prerequisites

Refer to Security in OpenSearch Ingestion to set up the permissions you need to create a pipeline and write to a pipeline, and the permissions the pipeline needs to write to a sink.

Create a trace pipeline

To create a trace pipeline, complete the following steps:

On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
Choose Create pipeline.

Amazon OpenSearch Ingestion, powered by Data Prepper, uses pipelines as a mechanism to move the data from a source to a sink, with optional processors to mutate, route, sample, and detect anomalies for the data in the pipe. For more information, refer to Data Prepper. When you use Data Prepper, you build a YAML configuration file. When you use OpenSearch Ingestion, you upload your YAML configuration to the service. If you’re using the OpenSearch Service console, you can use one of the configuration blueprints that we provide. For distributed tracing, you will use an otel_trace_source and an OpenSearch Service domain as the sink.

On the Configuration blueprints menu, choose AWS-TraceAnalyticsPipeline.

Choosing this blueprint will create a sample pipeline with otel_trace_source, an OpenSearch sink, along with span-pipeline and service-map-pipeline.

Enter a name for this pipeline along with a minimum (1) and maximum (96) capacity value for Ingestion-OCUs.

Amazon OpenSearch Ingestion will scale automatically between these values to suit the volume of data you are ingesting.

Edit the configuration’s hosts, aws.sts_role_arn, and region fields of the OpenSearch Service sink.
Follow rest of the steps to complete the trace pipeline creation.

Sample trace pipeline

The following code shows the components of a sample trace pipeline:

version: "2"
entry-pipeline: 
  source:
    otel_trace_source:
      path: "/${pipelineName}/v1/traces"
  processor:
    - trace_peer_forwarder:
  sink:
    - pipeline:
        name: "span-pipeline"
    - pipeline:
        name: "service-map-pipeline"
span-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - otel_trace_raw:
  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          region: "us-east-1"
        index_type: "trace-analytics-raw"
service-map-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - service_map_stateful:
  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          region: "us-east-1"
        index_type: "trace-analytics-service-map"

The sample trace pipeline has three sub-pipelines in its configuration. These are entry-pipeline, span-pipeline, and service-map-pipeline. The following diagram illustrates the workflow.

entry-pipeline specifies the source of data as otel_trace_source, which creates an HTTP listener for receiving OpenTelemetry traces at the ingestion URL for the pipeline. You use a trace_peer_forwarder processor to eliminate duplicate HTTP requests and forward the data to the span-pipeline and service-map pipelines. span-pipeline gets the raw trace data from entry-pipeline and uses the otel_trace_raw processor to complete trace group-related fields for the incoming span records. You use the service_map_stateful processor to have Data Prepper create the distributed service map for visualization in OpenSearch Dashboards. After the sample trace pipeline is created, it’s ready to receive OpenTelemetry traces at its ingestion URL!

Reduce your storage footprint and optimize for cost

The volume of traces collected from instrumenting a modern production enterprise application can reach tens or hundreds of terabytes very quickly, especially when you store every trace from every request. The problem of managing the storage footprint becomes important. In this section, we discuss strategies for reducing your storage footprint and optimizing for cost.

Use storage tiering

OpenSearch Service has three storage tiers: hot, UltraWarm, and cold. You use the hot tier to store frequently accessed data for quick reading and writing, the UltraWarm tier for infrequently used, read-only data backed by Amazon Simple Storage Service (Amazon S3) for lower cost, and the cold tier to maintain re-attachable data at near-Amazon S3 cost. By adjusting relative retention periods between these tiers, you can store a high volume of traces. For example, instead of storing 1 weeks’ worth of traces in the hot tier, you can store 2 days of traces in the hot tier and 15 days in the UltraWarm tier.

Extract metrics without storing traces

You can also use Data Prepper’s aggregation process to extract metrics in the pipeline to avoid delivering all of your data to OpenSearch Service. For example, you may want to analyze request, error, and duration (RED) metrics of your traces to know the current state of your services. OpenSearch Ingestion can calculate these metrics in the pipeline, aggregating them and storing them in separate indexes for analysis, reducing the ingestion and storage footprint of your traces. The following pipeline configuration snippet shows how to use the aggregate processor to calculate a histogram of the duration metric:

...
  processor:
    - aggregate:
        identification_keys: ["serviceName", "traceId"]
        action:
          histogram:
            key: "durationInNanos"
            record_minmax: true
            units: "nanoseconds"
            buckets: [1000000000, 1500000000, 2000000000]
        group_duration: "20s"
   sink:
    - opensearch:
        hosts: ...
        aws_sts_role_arn: ...
        aws_region: ...
        aws_sigv4: true
        index: "red_metrics_from_traces"
  ...

Use sampling

When your application is running without issues, the proportion of error traces is just a small percentage of your overall trace volume. Storing all of the traces for successful requests increases the cost substantially, while offering low value. To reduce cost, you can sample your trace data, reducing the number of traces you store in OpenSearch Service. There are generally two techniques for sampling:

Head sampling – When you do head sampling, you ask OpenSearch Ingestion to make a sampling decision without looking at the whole trace. Head sampling is easy to configure and is efficient, but has a downside of possibly missing important traces.
Tail sampling – Tail sampling is where you analyze the entirety of the trace and then decide whether to sample the trace or not. This accurately captures all the needed traces at the cost of complexity in configuring and implementing.

The following configuration snippet shows an example of the percent_sampler, from the aggregate processor. In this example, you send only 25% of your traces to OpenSearch Service, based on head sampling:

  ...
  processor:
    - aggregate:
        identification_keys: ["serviceName"]
        action:
          percent_sampler:
            percent: 25
        group_duration: "30s"
  sink:
    - opensearch:
        hosts: ...
        aws_sts_role_arn: ...
        aws_region: ...
        aws_sigv4: true
        index: "sampled-traces"
  ...

Use conditional routing with sampling

Head sampling using the percentage_sampler is simple and straightforward, but is a blunt tool. A better way to sample would be to gather, for example, 10% of successful responses, and 100% of failed responses or 100% high duration traces. To solve this, use conditional routing. Routes define conditions that can be used within processors and sinks to direct the data flowing through different parts of pipeline. For example, the following configuration snippet routes traces whose status code indicates a failure to the error_trace pipeline. You forward 100% of the data in that pipe. You route traces whose duration metric is more than 1 second to the high_latency pipeline where you sample them at 80%. Other normal traces are only sampled at 20%.

  processor:
    - otel_trace_raw:
  route:
    - error_traces: "/traceGroupFields/statusCode == 2"
    - high_latency_traces: '/durationInNanos >= 1000000000'
    - normal_traces: '/traceGroupFields/statusCode!= 2 and /durationInNanos < 1000000000'
  sink:
    - pipeline:
        name: "trace-error-pipeline"
        routes:
          - error_traces
    - pipeline: 
        name: "trace-high-latency-metrics-pipeline"
        routes: 
          - high_latency_traces
    - pipeline: 
        name: "trace-normal-pipeline"
        routes: 
          - normal_traces
  ...

Conclusion

In this post, you learned how to configure an OpenSearch Ingestion pipeline and several strategies to keep in mind that help minimize cost while supporting a large-scale production system for distributed tracing. As next step, refer to the Amazon OpenSearch Developer Guide to explore logs and metric pipelines that you can use to build a scalable observability solution for your enterprise applications.

About the author

Working with percolators in Amazon OpenSearch Service

2023-04-27 Arun Lakshmanan

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/working-with-percolators-in-amazon-opensearch-service/

Amazon OpenSearch Service is a managed service that makes it easy to secure, deploy, and operate OpenSearch and legacy Elasticsearch clusters at scale in the AWS Cloud. Amazon OpenSearch Service provisions all the resources for your cluster, launches it, and automatically detects and replaces failed nodes, reducing the overhead of self-managed infrastructures. The service makes it easy for you to perform interactive log analytics, real-time application monitoring, website searches, and more by offering the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Amazon OpenSearch Service now offers a serverless deployment option (public preview) that makes it even easier to use OpenSearch in the AWS cloud.

A typical workflow for OpenSearch is to store documents (as JSON data) in an index, and execute searches (also JSON) to find those documents. Percolation reverses that. You store searches and query with documents. Let’s say I’m searching for a house in Chicago that costs < 500K. I could go to the website every day and run my query. A clever website would be able to store my requirements (a query) and notify me when something new (a document) comes up that matches my requirements. Percolation is an OpenSearch feature that enables the website to store these queries and run documents against them to find new matches.

In this post, We will explore how to use percolators to find matching homes from new listings.

Before getting into the details of percolators, let’s explore how search works. When you insert a document, OpenSearch maintains an internal data structure called the “inverted index” which speeds up the search.

Indexing and Searching:

Let’s take the above example of a real estate application having the simple schema of type of the house, city, and the price.

First, let’s create an index with mappings as below

PUT realestate
{
     "mappings": {
        "properties": {
           "house_type": { "type": "keyword"},
           "city": { "type": "keyword" },
           "price": { "type": "long" }
         }
    }
}

Let’s insert some documents into the index.

ID	House_type	City	Price
1	townhouse	Chicago	650000
2	house	Washington	420000
3	condo	Chicago	580000

POST realestate/_bulk 
{ "index" : { "_id": "1" } } 
{ "house_type" : "townhouse", "city" : "Chicago", "price": 650000 }
{ "index" : { "_id": "2" } }
{ "house_type" : "house", "city" : "Washington", "price": 420000 }
{ "index" : { "_id": "3"} }
{ "house_type" : "condo", "city" : "Chicago", "price": 580000 }

As we don’t have any townhouses listed in Chicago for less than 500K, the below query returns no results.

GET realestate/_search
{
  "query": {
    "bool": {
      "filter": [ 
        { "term": { "city": "Chicago" } },
        { "term": { "house_type": "townhouse" } },
        { "range": { "price": { "lte": 500000 } } }
      ]
    }
  }
}

If you’re curious to know how search works under the hood at high level, you can refer to this article.

Percolation:

If one of your customers wants to get notified when a townhouse in Chicago is available, and listed at less than $500,000, you can store their requirements as a query in the percolator index. When a new listing becomes available, you can run that listing against the percolator index with a _percolate query. The query will return all matches (each match is a single set of requirements from one user) for that new listing. You can then notify each user that a new listing is available that fits their requirements. This process is called percolation in OpenSearch.

OpenSearch has a dedicated data type called “percolator” that allows you to store queries.

Let’s create a percolator index with the same mapping, with additional fields for query and optional metadata. Make sure you include all the necessary fields that are part of a stored query. In our case, along with the actual fields and query, we capture the customer_id and priority to send notifications.

PUT realestate-percolator-queries
{
  "mappings": {
    "properties": {
      "user": {
         "properties": {
            "query": { "type": "percolator" },
            "id": { "type": "keyword" },
            "priority":{ "type": "keyword" }
         }
      },
      "house_type": {"type": "keyword"},
      "city": {"type": "keyword"},
      "price": {"type": "long"}
    }
  }
}

After creating the index, insert a query as below

POST realestate-percolator-queries/_doc/chicago-house-alert-500k
{
  "user" : {
     "id": "CUST101",
     "priority": "high",
     "query": {
        "bool": {
           "filter": [ 
                { "term": { "city": "Chicago" } },
                { "term": { "house_type": "townhouse" } },
                { "range": { "price": { "lte": 500000 } } }
            ]
        }
      }
   }
}

The percolation begins when a new document gets run against the stored queries.

{"city": "Chicago", "house_type": "townhouse", "price": 350000}
{"city": "Dallas", "house_type": "house", "price": 500000}

Run the percolation query with document(s), and it matches the stored query

GET realestate-percolator-queries/_search
{
  "query": {
     "percolate": {
        "field": "user.query",
        "documents": [ 
           {"city": "Chicago", "house_type": "townhouse", "price": 350000 },
           {"city": "Dallas", "house_type": "house", "price": 500000}
        ]
      }
   }
}

The above query returns the queries along with the metadata we stored (customer_id in our case) that matches the documents

{
    "took" : 11,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
     },
     "hits" : {
        "total" : {
           "value" : 1,
           "relation" : "eq"
         },
         "max_score" : 0.0,
         "hits" : [ 
         {
              "_index" : "realestate-percolator-queries",
              "_id" : "chicago-house-alert-500k",
              "_score" : 0.0,
              "_source" : {
                   "user" : {
                       "id" : "CUST101",
                       "priority" : "high",
                       "query" : {
                            "bool" : {
                                 "filter" : [ 
                                      { "term" : { "city" : "Chicago" } },
                                      { "term" : { "house_type" : "townhouse" } },
                                      { "range" : { "price" : { "lte" : 500000 } } }
                                 ]
                              }
                        }
                  }
            },
            "fields" : {
                "_percolator_document_slot" : [0]
            }
        }
     ]
   }
}

Percolation at scale

When you have a high volume of queries stored in the percolator index, searching queries across the index might be inefficient. You can consider segmenting your queries and use them as filters to handle the high-volume queries effectively. As we already capture priority, you can now run percolation with filters on priority that reduces the scope of matching queries.

GET realestate-percolator-queries/_search
{
    "query": {
        "bool": {
            "must": [ 
             {
                  "percolate": {
                      "field": "user.query",
                      "documents": [ 
                          { "city": "Chicago", "house_type": "townhouse", "price": 35000 },
                          { "city": "Dallas", "house_type": "house", "price": 500000 }
                       ]
                  }
              }
          ],
          "filter": [ 
                  { "term": { "user.priority": "high" } }
            ]
       }
    }
}

Best practices

Prefer the percolation index separate from the document index. Different index configurations, like number of shards on percolation index, can be tuned independently for performance.
Prefer using query filters to reduce matching queries to percolate from percolation index.
Consider using a buffer in your ingestion pipeline for reasons below,
1. You can batch the ingestion and percolation independently to suit your workload and SLA
2. You can prioritize the ingest and search traffic by running the percolation at off hours. Make sure that you have enough storage in the buffering layer.

Consider using an independent cluster for percolation for the below reasons,
1. The percolation process relies on memory and compute, your primary search will not be impacted.
2. You have the flexibility of scaling the clusters independently.

Conclusion

In this post, we walked through how percolation in OpenSearch works, and how to use effectively, at scale. Percolation works in both managed and serverless versions of OpenSearch. You can follow the best practices to analyze and arrange data in an index, as it is important for a snappy search performance.

If you have feedback about this post, submit your comments in the comments section.

About the author

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Configure SAML federation for Amazon OpenSearch Serverless with AWS IAM Identity Center

2023-04-18 Utkarsh Agarwal

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-aws-iam-identity-center/

Amazon OpenSearch Serverless is a serverless option of Amazon OpenSearch Service that makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads. With OpenSearch Serverless, you can configure SAML to enable users to access data through OpenSearch Dashboards using an external SAML identity provider (IdP).

AWS IAM Identity Center (Successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications, OpenSearch Dashboards being one of them.

In this post, we show you how to configure SAML authentication for OpenSearch Dashboards using IAM Identity Center as its IdP.

Solution overview

The following diagram illustrates how the solution allows users or groups to authenticate into OpenSearch Dashboards using single sign-on (SSO) with IAM Identity Center using its built-in directory as the identity source.

The workflow steps are as follows:

A user accesses the OpenSearch Dashboard URL in their browser and chooses the SAML provider.
OpenSearch Serverless redirects the login to the specified IdP.
The IdP provides a login form for the user to specify the credentials for authentication.
After the user is authenticated successfully, a SAML assertion is sent back to OpenSearch Serverless.

OpenSearch Serverless validates the SAML assertion, and the user logs in to OpenSearch Dashboards.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection. Refer to Creating and managing Amazon OpenSearch Serverless collections to learn more about creating a collection. Furthermore, you must have the correct AWS Identity and Access Management (IAM) permissions for configuring SAML authentication along with relevant IAM permissions for configuring the data access policy.

IAM Identity Center should be enabled, and you should have the relevant IAM permissions to create an application in IAM Identity Center and create and manage users and groups.

Create and configure the application in IAM Identity Center

To set up your application in IAM Identity Center, complete the following steps:

On the IAM Identity Center dashboard, choose Applications in the navigation pane.
Choose Add application
For Custom application, select Add custom SAML 2.0 application.
Choose Next.
Under Configure application, enter a name and description for the application.
Under IAM Identity Center metadata, choose Download under IAM Identity Center SAML metadata file.

We use this metadata file to create a SAML provider under OpenSearch Serverless. It contains the public certificate used to verify the signature of the IAM Identity Center SAML assertions.

Under Application properties, leave Application start URL and Relay state blank.
For Session duration, choose 1 hour (the default value).

Note that the session duration you configure in this step takes precedence over the OpenSearch Dashboards timeout setting specified in the configuration of the SAML provider details on the OpenSearch Serverless end.

Under Application metadata, select Manually type your metadata values.
For Application ACS URL, enter your URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs. For example, we enter https://collection.us-east-1.aoss.amazonaws.com/_saml/acs for this post.
For Application SAML audience, enter your service provider in the format aws:opensearch:<aws account id>.
Choose Submit.

Now you modify the attribute settings. The attribute mappings you configure here become part of the SAML assertion that is sent to the application.

On the Actions menu, choose Edit attribute mappings.
Configure Subject to map to ${user:email}, with the format unspecified.

Using ${user:email} here ensures that the email address for the user in IAM Identity Center is passed in the <NameId> tag of the SAML response.

Choose Save changes.

Now we assign a user to the application.

Create a user in IAM Identity Center to use to log in to OpenSearch Dashboards.

Alternatively, you can use an existing user.

On the IAM Identity Center console, navigate to your application and choose Assign Users and select the user(s) you would like to assign.

You have now created a custom SAML application. Next, you will configure the SAML provider in OpenSearch Serverless.

Create a SAML provider

The SAML provider you create in this step can be assigned to any collection in the same Region. Complete the following steps:

On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
Choose Create SAML provider.
Enter a name and description for your SAML provider.
Enter the metadata from your IdP that you downloaded earlier.
Under Additional settings, you can optionally add custom user ID and group attributes. We leave these settings blank for now.
Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, we walk you through configuring the data access policy for accessing collections.

Create the data access policy

In this section, you set up data access policies for OpenSearch Serverless and allow access to the users. Complete the following steps:

On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
Choose Create access policy.
Enter a name and description for your access policy.
For Policy definition method, select Visual Editor.
In the Rules section, enter a rule name.
Under Select principals, for Add principals, choose SAML users and groups.
For SAML provider name, choose the SAML provider you created earlier.
Specify the user in the format user/<email> (for example, user/[email protected]).

The value of the email address should match the email address in IAM Identity Center.

Choose Save.
Choose Grant and specify the permissions.

You can configure what access you want to provide for the specific user at the collection level and specific indexes at the index pattern level.

You should select the access the user needs based on the least privilege model. Refer to Supported policy permissions and Supported OpenSearch API operations and permissions to set up more granular access for your users.

Choose Save and configure any additional rules, if required.

You can now review and edit your configuration if needed.

Choose Create to create the data access policy.

Now you have the data access policy that will allow the users to perform the allowed actions on OpenSearch Dashboards.

Access OpenSearch Dashboards

To sign in to OpenSearch Dashboards, complete the following steps:

On the OpenSearch Service dashboard, under Serverless in the navigation pane, choose Dashboard.
Locate your dashboard and copy the OpenSearch Dashboards URL (in the format <collection-endpoint>/_dashboards).
Enter this URL into a new browser tab.
On the OpenSearch login page, choose your IdP and specify your SSO credentials.
Choose Login.

Configure SAML authentication using groups in IAM Identity Center

Groups can help you organize your users and permissions in a coherent way. With groups, you can add multiple users from the IdP, and then use groupid as the identifier in the data access policy. For more information, refer to Add groups and Add users to groups.

To configure group access to OpenSearch Dashboards, complete the following steps:

On the IAM Identity Center console, navigate to your application.
In the Attribute mappings section, add an additional user as group and map it to ${user:groups}, with the format unspecified.
Choose Save changes.
For the SAML provider in OpenSearch Serverless, under Additional settings, for Group attribute, enter group.
For the data access policy, create a new rule or add an additional principal in the previous rule.
Choose the SAML provider name and enter group/<GroupId>.

You can fetch the value for the group ID by navigating to the Group section on the IAM Identity Center console.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

On the IAM Identity Center console, remove the application.
On OpenSearch Dashboards, delete the following resources:

Conclusion

In this post, you learned how to set up IAM Identity Center as an IdP to access OpenSearch Dashboards using SAML as SSO. You also learned on how to set up users and groups within IAM Identity Center and control the access of users and groups for OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search solutions using OpenSearch Serverless. You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.

About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.

Ravi Bhatane is a software engineer with Amazon OpenSearch Serverless Service. He is passionate about security, distributed systems, and building scalable services. When he’s not coding, Ravi enjoys photography and exploring new hiking trails with his friends.

Patterns for updating Amazon OpenSearch Service index settings and mappings

2023-04-06 Mikhail Vaynshteyn

Post Syndicated from Mikhail Vaynshteyn original https://aws.amazon.com/blogs/big-data/patterns-for-updating-amazon-opensearch-service-index-settings-and-mappings/

Amazon OpenSearch Service is used for a broad set of use cases like real-time application monitoring, log analytics, and website search at scale. As your domain ages and you add additional consumers, you need to reevaluate and change the domain’s configuration to handle additional storage and compute needs. You want to minimize downtime and performance impact as you make these changes.

Customers have been seeking guidance on best practices and patterns for changing index settings without an index maintenance window or affecting overall performance of the OpenSearch Service domain. This is part one of a two-part series, in which we show how to make settings changes to OpenSearch Service indexes with little to no downtime while supporting active producers and consumers of the data.

Indexes in OpenSearch Service

In OpenSearch Service, data must be indexed before it can be queried. Indexing is the method by which search engines organize data for fast retrieval. The resulting structure is called, fittingly, an index. All operations performed on an index are done via index APIs. Also, each index contains index mappings, which define field names and data types in the index. Data producers can add new fields with data types to an index. Index mappings can’t change throughout the index lifecycle.

OpenSearch Service indexes have two types of settings that periodically need adjustments as the profile of your workload changes:

Dynamic – Settings that can be changed on the index at any time
Static – Settings that can only be defined at the index creation time and can’t be changed throughout the index lifecycle

Dynamic index settings can be changed at any time using the update settings API. While the OpenSearch Service domain is performing instructed operations on dynamic index settings, the index doesn’t require a downtime. Changes to most dynamic index settings won’t trigger background tasks that affect the overall utilization of domain resources; however, some settings such as increasing the number of replicas via index.number_of_replicas or index.auto_expand_replicas, and depending on the domain’s configuration, can cause a temporary increase in resource utilization while the domain adds replicas. We recommend maintaining at least one replica for redundancy reasons, and multiple replicas for high query throughput use cases.

Static index settings such as mapping and shard count are defined at index creation time and can’t be changed throughout the index lifecycle. In this post, we focus on patterns and best practices for working with static index settings, such as changing shard count and patterns for updating index mappings.

All operations and procedures that we cover in this post are issued directly to the OpenSearch REST API or via the Dev Tools in OpenSearch Dashboards.

As with any use case, there is a spectrum of solutions and constraints to be considered. We start with a few simple foundational patterns and build on them, accounting for use cases with more operational constraints and working with large datasets.

Solution overview

OpenSearch Service has a default sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes.

It’s not possible to increase the primary shard number of an existing index, meaning an index must be recreated if you want to increase the primary shard count.

The _reindex operation is ideal for creating destination indexes with updated shards and mapping settings. The _reindex operation is resource intensive. We recommend disabling replicas in your destination index by setting number_of_replicas to 0 and re-enable replicas when the reindex process is complete. If you have your data in a second, durable store, the simplest thing to do is pause updates and reindex from the source. But that’s not always possible. In this post, we share several patterns that enable you to update even static index settings like shard count.

One the major advantages of using the _reindex operation is that it doesn’t require placing the source index in a read-only mode (data producers may continue to write the data while reindexing is in progress). Also, the _reindex operation enables reprocessing, transformation, and reindexing a subset of documents and even selectively combining documents from multiple indexes. With the _reindex operation, you can copy all or a subset of documents that you select through a query to another index. In its most basic form, the _reindex operation requires you to specify a source and a destination index and configuration parameters.

The following are the some of the use cases supported by the reindex API:

Reindexing all documents
Reindexing from a remote cluster when transferring data between clusters
Reindexing a subset of documents that match a search query
Combining one or more indexes
Transforming documents during reindexing

To increase the shard count, you create a new index, set number_of_shards to your desired primary shard count, set number_of_replicas to 0, update the new index mapping based on your requirement, and run the reindex API operation. After the _reindex operation is complete, we recommend updating number_of_replicas in the destination index settings to achieve your desired level of replica shards.

In the following sections, we provide a walkthrough of the reindex API operation. Note that the patterns and procedures presented in this post have been validated on Amazon OpenSearch Service version 1.3.

Prerequisites

The source of the documents must be stored in the index (the “_source” setting at the index mappings level must be set to “enabled”:true, which is the default). The _reindex operation can’t be used without source documents.

Create the destination index with your desired mapping (field or data type). For demonstration purposes, our source index has a field ratings defined as long, and we want the same field to use the float data type in the destination index:

GET /source_index_name/mappings
{  
  "source_index_name": {
    "mappings" : {
      "properties" : {
        "ratings " : {
          "type" : "long"
        },
…
      }
    }
  }
}

PUT /destination_index_name
{
  "settings": {
    "index": {
      "number_of_shards": <DESIRED_NUMBER_OF_PRIMARY_SHARDS>,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties" : {
      "ratings" : {
          "type" : "float"
        },
…
    }
  }
}

Ensure that you have sufficient disk space on each hot tier data node to house the new index primary shards and, depending on your configuration, replica shards. If disk space is insufficient, perform an update operation on the OpenSearch Service domain to add the required storage capacity. Depending on storage requirements, you may need to migrate the OpenSearch Service domain to a different instance type, because nodes have constraints on the EBS volume size that can be mounted to each instance type. Issue the following operation to validate available disk space:

GET _cat/allocation?v

The following screenshot shows the output.

Check the disk.avail metric for hot storage tier nodes to validate your available disk space.

Use the reindex API operation

The _reindex operation snapshots the index at the beginning of its run and performs processing on a snapshot to minimize impact on the source index. The source index can still be used for querying and processing the data. Although the _reindex operation can run both synchronously and asynchronously, we recommend using an asynchronous run. You can monitor the progress of the _reindex operation, cancel its run, or throttle its run using the _task, _cancel, and _rethrottle operations, respectively.

Because the _reindex operation doesn’t require the source index placed in a read-only mode, query and index update operations are free to continue.

Use the reindex API with the following command:

POST _reindex?wait_for_completion=false
{
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "index"
  }
}

The source indexes as part of the _reindex API operation can be supplemented with a query for reindexing a subset of documents and storing them in the destination index. Progress of the re-indexing operation can be monitored via tasks API operation:

GET _tasks

Note that the _reindex operation can be throttled via a _rethrottle API or settings passed as a parameter. You can cancel the task with the _cancel operation:

POST _tasks/TASK_ID/_cancel

The following screenshot shows the output of the _reindex operation for reindexing from source_index_name to destination_index_name.

When the operation is complete, both consumers and producers of the source indexes or aliases need to re-point to the destination index and the same _reindex operation needs to run again to catch up on any create, update, or delete operations performed on the source indexes while the initial _reindex operation was running. This step is required because the _reindex operation is running on a snapshot of the index. At this time, the _reindex operation needs to run with “op_type”:”create” to realign missing and out-of-version documents. See the following API command:

POST _reindex?wait_for_completion=false
{
"conflicts":"proceed",
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "create"
  }
}

After the operation is complete and data integrity in the destination index is confirmed, you can delete the source index to reclaim disk space.

Increase index shard count using the split index API

The split index API and shrink index API cover a large array of use cases and present with low resource utilization in the domain. However, these APIs require closing the index for write operations and don’t address use cases that require changes to the mapping settings.

In OpenSearch Service, the number_of_shards index setting is immutable and defined at the time when the index is created. However, although this setting is immutable, there are several patterns to increase or decrease index shard count without needing to explicitly reindex the data. You can alternatively use the split index API to increase index shard count in the environments that can suspend write operations. The split index API provides a simplified way of creating a new index with a different shard setting and without reindexing your data. The split index API operation creates a new index based off of a read-only index with a desired number of primary shards.

In OpenSearch Service, an index alias is a virtual index name that can point to one or more indexes. Referencing to indexes using aliases in your applications allows you to avoid index name changes. Index aliases are used to point consumers and producers to a new index after the split index API operation is complete.

Although the majority of use cases focus on increasing a number of shards on an existing index due to data growth, there are also instances where you need to reduce the number of shards on an existing index. Such cases occasionally happen when an actual index size is less than what was anticipated when the index was created, and you want to align with a shard strategy for operational best practices for OpenSearch Service. In cases where you need to reduce a number of shards on an index, you can use the shrink index API to achieve this task.

Conclusion

In this post, we reviewed best practices when reindexing data for making changes in OpenSearch Service static index settings and mappings that require little or no index downtime. We also covered use of the split index and shrink index APIs for changing the primary index shard count for use cases where the index can be placed in a read-only state.

In our next post, we’ll explore patterns for remote indexing to alleviate load and resource utilization on the source OpenSearch Service domain.

About the Authors

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.

Sukhomoy Basak is a Solutions Architect at Amazon Web Services, with a passion for data and analytics solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Perform accent-insensitive search using OpenSearch

2023-03-30 Aruna Govindaraju

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/perform-accent-insensitive-search-using-opensearch/

We often need our text search to be agnostic of accent marks. Accent-insensitive search, also called diacritics-agnostic search, is where search results are the same for queries that may or may not contain Latin characters such as à, è, Ê, ñ, and ç. Diacritics are English letters with an accent to mark a difference in pronunciation. In recent years, words with diacritics have trickled into the mainstream English language, such as café or protégé. Well, touché! OpenSearch has the answer!

OpenSearch is a scalable, flexible, and extensible open-source software suite for your search workload. OpenSearch can be deployed in three different modes: the self-managed open-source OpenSearch, the managed Amazon OpenSearch Service, and Amazon OpenSearch Serverless. All three deployment modes are powered by Apache Lucene, and offer text analytics using the Lucene analyzers.

In this post, we demonstrate how to perform accent-insensitive search using OpenSearch to handle diacritics.

Solution overview

Lucene Analyzers are Java libraries that are used to analyze text while indexing and searching documents. These analyzers consist of tokenizers and filters. The tokenizers split the incoming text into one or more tokens, and the filters are used to transform the tokens by modifying or removing the unnecessary characters.

OpenSearch supports custom analyzers, which enable you to configure different combinations of tokenizers and filters. It can consist of character filters, tokenizers, and token filters. In order to enable our diacritic-insensitive search, we configure custom analyzers that use the ASCII folding token filter.

ASCIIFolding is a method used to covert alphabetic, numeric, and symbolic Unicode characters that aren’t in the first 127 ASCII characters (the Basic Latin Unicode block) into their ASCII equivalents, if one exists. For example, the filter changes “à” to “a”. This allows search engines to return results agnostic of the accent.

In this post, we configure accent-insensitive search using the ASCIIFolding filter supported in OpenSearch Service. We ingest a set of European movie names with diacritics and verify search results with and without the diacritics.

Create an index with a custom analyzer

We first create the index asciifold_movies with custom analyzer custom_asciifolding:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Ingest sample data

Next, we ingest sample data with Latin characters into the index asciifold_movies:

POST _bulk
{ "index" : { "_index" : "asciifold_movies", "_id":"1"} }
{  "title" : "Jour de fête"}
{ "index" : { "_index" : "asciifold_movies", "_id":"2"} }
{  "title" : "La gloire de mon père" }
{ "index" : { "_index" : "asciifold_movies", "_id":"3"} }
{  "title" : "Le roi et l’oiseau" }
{ "index" : { "_index" : "asciifold_movies", "_id":"4"} }
{  "title" : "Être et avoir" }
{ "index" : { "_index" : "asciifold_movies", "_id":"5"} }
{  "title" : "Kirikou et la sorcière"}
{ "index" : { "_index" : "asciifold_movies", "_id":"6"} }
{  "title" : "Señora Acero"}
{ "index" : { "_index" : "asciifold_movies", "_id":"7"} }
{  "title" : "Señora garçon"}
{ "index" : { "_index" : "asciifold_movies", "_id":"8"} }
{  "title" : "Jour de fete"}

Query the index

Now we query the asciifold_movies index for words with and without Latin characters.

Our first query uses an accented character:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fête"
    }
  }
}

Our second query uses a spelling of the same word without the accent mark:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fete"
    }
  }
}

In the preceding queries, the search terms “fête” and “fete” return the same results:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.7361701,
    "hits": [
      {
        "_index": "asciifold_movies",
        "_id": "8",
        "_score": 0.7361701,
        "_source": {
          "title": "Jour de fete"
        }
      },
      {
        "_index": "asciifold_movies",
        "_id": "1",
        "_score": 0.42547938,
        "_source": {
          "title": "Jour de fête"
        }
      }
    ]
  }
}

Similarly, try comparing results for “señora” and “senora” or “sorcière” and “sorciere.” The accent-insensitive results are due to the ASCIIFolding filter used with the custom analyzers.

Enable aggregations for fields with accents

Now that we have enabled accent-insensitive search, let’s look at how we can make aggregations work with accents.

Try the following query on the index:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following response:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 1
        },
        {
          "key" : "Jour de fête",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorcière",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon père",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l’oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Señora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Señora garçon",
          "doc_count" : 1
        },
        {
          "key" : "Être et avoir",
          "doc_count" : 1
        }
      ]
    }
  }

Create accent-insensitive aggregations using a normalizer

In the previous example, the aggregation returns two different buckets, one for “Jour de fête” and one for “Jour de fete.” We can enable aggregations to create one bucket for the field, regardless of the diacritics. This is achieved using the normalizer filter.

The normalizer supports a subset of character and token filters. Using just the defaults, the normalizer filter is a simple way to standardize Unicode text in a language-independent way for search, thereby standardizing different forms of the same character in Unicode and allowing diacritic-agnostic aggregations.

Let’s modify the index mapping to include the normalizer. Delete the previous index, then create a new index with the following mapping and ingest the same dataset:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "filter": "asciifolding"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256,
            "normalizer": "custom_normalizer"
          }
        }
      }
    }
  }
}

After you ingest the same dataset, try the following query:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following results:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 2
        },
        {
          "key" : "Etre et avoir",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorciere",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon pere",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l'oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Senora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Senora garcon",
          "doc_count" : 1
        }
      ]
    }
  }

Now we compare the results, and we can see the aggregations with term “Jour de fête” and “Jour de fete” are rolled up into one bucket with doc_count=2.

Summary

In this post, we showed how to enable accent-insensitive search and aggregations by designing the index mapping to do ASCII folding for search tokens and normalize the keyword field for aggregations. You can use the OpenSearch query DSL to implement a range of search features, providing a flexible foundation for structured and unstructured search applications. The Open Source OpenSearch community has also extended the product to enable support for natural language processing, machine learning algorithms, custom dictionaries, and a wide variety of other plugins.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the Author

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favorite pastime is hiking the New England trails and mountains.

AWS Week in Review – March 20, 2023

2023-03-21 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-20-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.

Last Week’s Launches
Here are the launches that got my attention last week:

Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:

S3 Object Lambda now provides aliases that are interchangeable with bucket names and can be used with Amazon CloudFront to tailor content for end users.
S3 now support datasets that are replicated across multiple AWS accounts with cross-account support for S3 Multi-Region Access Points.
You can now create and configure replication rules to automatically replicate S3 objects from one AWS Outpost to another.
Amazon S3 has also simplified private connectivity from on-premises networks: with private DNS for S3, on-premises applications can use AWS PrivateLink to access S3 over an interface endpoint, while requests from your in-VPC applications access S3 using gateway endpoints.
We released Mountpoint for Amazon S3, a high performance open source file client. Read more in the blog. Note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations.

Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info.

Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services.

AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data.

Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.

Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.

AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.

Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments.

AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources.

Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance

AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3.

AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs.

Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond.

What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet.

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.

Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons.

How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet:

AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions.

Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech.

The AWS Summits season is warming up! You can sign up here to know when registration opens in your area.

That’s all from me for this week. Come back next Monday for another Week in Review!

— Danilo

Role-based access control in Amazon OpenSearch Service via SAML integration with AWS IAM Identity Center

2023-03-14 Scott Chang

Post Syndicated from Scott Chang original https://aws.amazon.com/blogs/big-data/role-based-access-control-in-amazon-opensearch-service-via-saml-integration-with-aws-iam-identity-center/

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. AWS IAM Identity Center (successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. To build a strong least-privilege security posture, customers also wanted fine-grained access control to manage dashboard permission by user role. In this post, we demonstrate a step-by-step procedure to implement IAM Identity Center to OpenSearch Service via native SAML integration, and configure role-based access control in OpenSearch Dashboards by using group attributes in IAM Identity Center. You can follow the steps in this post to achieve both authentication and authorization for OpenSearch Service based on the groups configured in IAM Identity Center.

Solution overview

Let’s review how to map users and groups in IAM Identity Center to OpenSearch Service security roles. Backend roles in OpenSearch Service are used to map external identities or attributes of workgroups to pre-defined OpenSearch Service security roles.

The following diagram shows the solution architecture. Create two groups, assign a user to each group and edit attribute mappings in IAM Identity Center. If you have integrated IAM Identity Center with your Identity Provider (IdP), you can use existing users and groups mapped to your IdP for this test. The solution uses two roles: all_access for administrators, and alerting_full_access for developers who are only allowed to manage OpenSearch Service alerts. You can set up backend role mapping in OpenSearch Dashboards by group ID. Based on the following diagram, you can map the role all_access to the group Admin, and alerting_full_access to Developer. User janedoe is in the group Admin, and user johnstiles is in the group Developer.

Then you will log in as each user to verify the access control by looking at the different dashboard views.

Let’s get started!

Prerequisites

Complete the following prerequisite steps:

Have an AWS account.
Have an Amazon OpenSearch Service domain.
Enable IAM Identity Center in the same Region as the OpenSearch Service domain.
Test your users in IAM Identity Center (to create users, refer to Add users).

Enable SAML in Amazon OpenSearch Service and copy SAML parameters

To configure SAML in OpenSearch Service, complete the following steps:

On the OpenSearch Service console, choose Domains in the navigation pane.
Choose your domain.
On the Security configuration tab, confirm that Fine-grained access control is enabled.
On the Actions menu, choose Edit security configuration.
Select Enable SAML authentication.

You can also configure SAML during domain creation if you are creating a new OpenSearch domain. For more information, refer to SAML authentication for OpenSearch Dashboards.

Copy the values for Service provider entity ID and IdP-Initiated SSO URL.

Create a SAML application in IAM Identity Center

To create a SAML application in IAM Identity Center, complete the following steps:

On the IAM Identity Center console, choose Applications in the navigation pane.
Choose Add application.
Select Add customer SAML 2.0 application, then choose Next.
Enter your application name for Display name.
Under IAM Identity Center metadata, choose Download to download the SAML metadata file.
Under Application metadata, select Manually type your metadata values.
For Application ACS URL, enter the IdP-initiated URL you copied earlier.
For Application SAML audience, enter the service provider entity ID you copied earlier.
Choose Submit.
On the Actions menu, choose Edit attribute mappings.
Create attributes and map the following values:
1. Subject map to ${user:email}, the format is emailAddress.
2. Role map to ${user:groups}, the format is unspecified.
Choose Save changes.
On the IAM Identity Center console, choose Groups in the navigation pane.
Create two groups: Developer and Admin.
Assign user janedoe to the group Admin.
Assign user johnstiles to the group Developer.
Open the Admin group and copy the group ID.

Finish SAML configuration and map the SAML primary backend role

To complete your SAML configuration and map the SAML primary backend role, complete the following steps:

On the OpenSearch Service console, choose Domains in the navigation pane.
Open your domain and choose Edit security configuration.
Under SAML authentication for OpenSearch Dashboards/Kibana, for Import IdP metadata, choose Import from XML file.
Upload the IdP metadata downloaded from the IAM Identity Center metadata file.

The IdP entity ID will be auto populated.

Under SAML master backend role, enter the group ID of the Admin group you copied earlier.
For Roles key, enter Role for the SAML assertion.

This is because we defined and mapped Role to ${user:groups} as a SAML attribute in IAM Identity Center.

Choose Save changes.

Configure backend role mapping for the Developer group

You have completely integrated IAM Identity Center with OpenSearch Service and mapped the Admin group as the primary role (all_access) in OpenSearch Service. Now you will log in to OpenSearch Dashboards as Admin and configure mapping for the Developer group.

There are two ways to log in to OpenSearch Dashboards:

OpenSearch Dashboards URL – On the OpenSearch Service console, navigate to your domain and choose the Dashboards URL under General Information. (For example, https://opensearch-domain-name-random-keys.us-west-2.es.amazonaws.com/_dashboards)
AWS access portal URL – On the IAM Identity Center console, choose Dashboard in the navigation pane and choose the access portal URL under Settings summary. (For example, https://d-1234567abc.awsapps.com/start)

Complete the following steps:

Log in as the user in the Admin group (janedoe).
Choose the tile for your OpenSearch Service application to be redirected to OpenSearch Dashboards.
Choose the menu icon, then choose Security, Roles.
Choose the alerting_full_access role and on the Mapped users tab, choose Manage mapping.
For Backend roles, enter the group ID of Developer.
Choose Map to apply the change.

Now you have successfully mapped the Developer group to the alerting_full_access role in OpenSearch Service.

Verify permissions

To verify permissions, complete the following steps:

Log out of the Admin account in OpenSearch Service as log in as a Developer user.
Choose the OpenSearch Service application tile to be redirected to OpenSearch Dashboards.

You can see there are only alerting related features available on the drop-down menu. This Developer user can’t see all of the Admin features, such as Security.

Clean up

After you test the solution, remember to delete all of the resources you created to avoid incurring future charges:

Delete your Amazon OpenSearch Service domain.
Delete the SAML application, users, and groups in IAM Identity Center.

Conclusion

In the post, we walked through a solution of how to map roles in Amazon OpenSearch Service to groups in IAM Identity Center by using SAML attributes to achieve role-based access control for accessing OpenSearch Dashboards. We connected IAM Identity Center users to OpenSearch Dashboards, and also mapped predefined OpenSearch Service security roles to IAM Identity Center groups based on group attributes. This makes it easier to manage permissions without updating the mapping when new users belonging to the same workgroup want to log in to OpenSearch Dashboards. You can follow the same procedure to provide fine-grained access to workgroups based on team functions or compliance requirements.

About the Authors

Scott Chang is a Solution Architecture at AWS based in San Francisco. He has over 14 years of hands-on experience in Networking also familiar with Security and Site Reliability Engineering. He works with one of major strategic customers in west region to design highly scalable, innovative and secure cloud solutions.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch service. He builds large scale search applications and solutions. Muthu is interested in the topics of networking and security and is based out of Austin, Texas

Identify and remediate security threats to your business using security analytics with Amazon OpenSearch Service

2023-03-14 Kevin Fallis

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/identify-and-remediate-security-threats-to-your-business-using-security-analytics-with-amazon-opensearch-service/

Recently, one of the largest wireless carriers in North America revealed that hackers compromised a database of its customer information through unauthorized use of an API and acquired the personal details of millions of individuals, including names, addresses, phone numbers, and account numbers. Once identified, the company halted the malicious activity. However, investigations indicated that the data breach likely occurred months prior to being detected.

With the ever-increasing volume of data that organizations store in the cloud, malicious threats to their business sensitive data and resources will continue to grow alongside their online activity. Adversaries, also known as attackers, continue to use a variety of techniques to breach an organization’s security and compromise their systems, which can cause significant financial losses or damage to reputation or have legal consequences for the affected organization. To mitigate the damage caused, it is critically important for organizations to protect themselves by implementing various security measures and deploying tools to detect and respond to security threats. By being proactive, organizations can significantly reduce the risk of being victimized by cyber adversaries.

Amazon OpenSearch Service is a fully managed and scalable log analytics framework that you can use to ingest, store, and visualize data. You can use OpenSearch Service for a diverse set of data workloads including healthcare data, financial transactions information, application performance data, observability data, and much more. This managed service is valued for its ingest performance, scalability, low query latency, and its ability to analyze large datasets.

Security analytics with OpenSearch Service

Today, OpenSearch Service announces OpenSearch-powered security analytics, which includes features to monitor, analyze, and respond to potential security threats to your critical infrastructure. In this post, we discuss these new features and how to identify and remediate security threats.

Security analytics provide real-time visibility into potential threats across your infrastructure, enabling you to respond to security incidents quickly, thereby reducing the impact of any security breaches. It can also help you meet regulatory compliance requirements and improve your overall security posture.

Security analytics with OpenSearch is designed to gain visibility into a company’s infrastructure, monitor for anomalous activity, help detect potential security threats in real time, and trigger alerts to pre-configured destinations. You can monitor for malicious activity from your security event logs by continuously evaluating out-of-the-box security rules, and review auto generated security findings to aid your investigation. In addition, security analytics can generate automated alerts and send to a preconfigured destination of your choice such as Slack or email.

Security analytics is powered by the open-source OpenSearch project and deployed on OpenSearch Service with OpenSearch version 2.5 or higher. It includes the following key features:

Out-of-the-box support for over 2,200 open-source Sigma security rules
Support for log sources such as Windows, NetFlow, AWS CloudTrail, DNS, AD/LDAP, and more
Detectors that auto generate findings based on the Sigma rules
Automated alerts sent to preconfigured destinations
A rules editor to create new custom rules or modify existing rules
Visualizations to summarize findings and alerts trends

Sigma rules

Sigma is a generic signature format, expressed using YAML (yet another markup language), to describe significant events that occur in your logs in a simple and straightforward way. The format is portable across different SIEM implementations and fosters a community of threat hunters, so that you don’t have to reinvent the wheel if you change your SIEM implementation.

An example of a simple rule to detect the run of C:\Windows\System32\rundll32.exe, one of the most commonly used methods for launching malicious code on a Windows platform, could be the following YAML configuration:

title: Rundll32 execution
description: Detects a rundll32.exe execution
references:
      - https://attack.mitre.org/techniques/T1218/011/
author: Captain Threathunter
date: 2023/03/05
logsource:
    category: process_creation
    product: windows
detection:
    selection:
        winlog-event_data-ProcessName: 'C:\Windows\System32\rundll32.exe'
    condition: selection
level: high
status: test

After you import this rule into the security analytics rules repository and enable it with your detector, it auto generates a security finding when the preceding condition matches an incoming event log.

Security analytics components and concepts

The security analytics offering includes a number of tools and features elemental to its operation. The major components that comprise the plugin are summarized in the following sections.

Log types

OpenSearch supports several types of logs and provides out-of-the-box mappings for each. The log type is specified during the creation of a detector and includes the ability to customize field mappings for that detector. For a log type selected in a detector, security analytics automatically enables a relevant set of rules that run at the configured interval.

Detectors

Detectors are core components that you configure to identify a range of cybersecurity threats for a log type, across your data indexes. Detectors use custom rules and pre-packaged Sigma rules to evaluate events occurring in the system, automatically generating security findings from these events.

Rules

Rules, or threat detection rules, define the conditions applied to ingested log data to identify a security event. Security analytics provides prepackaged, open-source Sigma rules to detect common threats from your logs. Security analytics also supports importing, creating, and customizing rules to meet your requirements. Many rules are also mapped to an ever-growing knowledge base of adversary tactics and techniques maintained by the MITRE ATT&CK organization. You can take advantage of these options using either OpenSearch Dashboards or the APIs.

Findings

Findings are generated every time a detector matches a rule with a log event. Findings don’t necessarily point to imminent threats within the system, but they isolate an event of interest. Because they represent the result of a specific matched condition in a detector rule, findings include a unique combination of select rules, a log type, and a rule severity.

Alerts

When defining a detector, you can specify one or more conditions that trigger an alert. When an event triggers an alert, the system sends a notification to a preferred channel, such as Slack or email. The alert can be triggered when the detector matches one or multiple rules. You can also create a notification message with a customized subject line and message body.

Taking the tool for a test drive

With an understanding of these fundamental concepts, let’s navigate to the security analytics interface in OpenSearch Dashboards. Security analytics also provides a robust set of configuration APIs.

Overview page

After you have logged in to OpenSearch Dashboards and navigate to the security analytics overview page, you’re presented with the current state of the detectors you are monitoring. You can see a summary view comprised of multiple visualizations. The following chart, for example, shows the findings and alerts trend for various log types over a given period of time.

As you scroll down on the summary page, you can review your most recent findings and alerts.

Additionally, you can see a distribution of the most frequently triggered rules across all the active detectors. This can help you detect and investigate different types of malicious activities across log types.

Finally, you can view the status of configured detectors. From this panel, you can also navigate to the create detector workflow.

Creating a detector

In the previous section, we reviewed the overview page. Now, let’s walkthrough the create detector workflow. One of the best things about security analytics are the prepackaged rules. You don’t have to write your own. You can use the prepackaged rules to get up and running quickly! In the following example, we show you how to create a detector with prepackaged rules for your Windows logs.

In the Dashboards navigation pane, under Security Analytics, choose Detectors.
Choose Create Detector to create a new detector.
1. First, give it a name and a data source to query. The data source can be a pattern or specific index.
2. When you select a Log type, all matching rules are automatically loaded and enabled by default. In this example, we select Windows logs to help narrow the set of rules applied to this detector. As an optional step, you can choose to selectively enable or disable one or more rules. See an example rules selection panel below.
3. Specify a schedule to run the rules and select Next.
4. Configure any necessary field mappings per your rule.
  You have two field mapping sections to optionally review. Default mapped fields provide pre-configured field mappings for the specific log type and enabled rules; you can skip this section unless you need to change the mappings. Additional mappings can be configured in the Pending field mappings section.
Configure the alerts.
The final step of setting up a detector is to configure the alerts and review your configuration. Note that each detector can generate multiple findings or alerts, and you have the option to customize the alert destination based on a rule match criterion such as severity, tags etc. In this example, we show you how to match a single rule that monitors a password dump to a host file system (QuarksPwDumps Dump File) and send the alert to a destination of your choice.
1. First, define the name of the alert.
2. Set up the criticality based on configurations in the rule and select the tags.
3. Give the alert a severity and select a channel.
  If you need to create a new channel, there is a breadcrumb that sends you to the Notifications feature. You can create additional channels needed.
4. Review the configuration and Create the detector. Once the detector is active, any time a rule is matched for your incoming logs, it will automatically generate a security finding and alert (if configured).

Configuring custom rules

One of the key capabilities of security analytics is defining custom rules and being able to import rules created by others such as a community of threat hunters. As mentioned before, security analytics includes over 2200 rules out of the box. In some cases, you may want to create your own rules. If you navigate to the Rules page, you have the option to create your own rule.

The rules editor allows you to provide a custom rule that it will automatically validate. Once created, the rule is included in the rules library, helping you to customize your threat hunting needs.

Conclusion

Many organizations struggle with the high cost of commercial alternatives and are required to duplicate their data across multiple systems that generate specific insights. OpenSearch Service security analytics provides an open-source alternative to businesses that seek to reduce the cost of their security products. There is no additional charge for security analytics, and you can customize it to meet the security requirements of your organization. With simple workflows and prepackaged security content, security analytics enables your security teams to detect potential threats quickly while providing the tools to help with security investigations.

To get started, create or upgrade your existing Amazon OpenSearch Service domain to OpenSearch version 2.5. To learn more about security analytics, see documentation.

About the Authors

Disaster Recovery Solutions with AWS-Managed Services, Part 3: Multi-Site Active/Passive

2023-03-10 Brent Kim

Post Syndicated from Brent Kim original https://aws.amazon.com/blogs/architecture/disaster-recovery-solutions-with-aws-managed-services-part-3-multi-site-active-passive/

Welcome to the third post of a multi-part series that addresses disaster recovery (DR) strategies with the use of AWS-managed services to align with customer requirements of performance, cost, and compliance. In part two of this series, we introduced a DR concept that utilizes managed services through a backup and restore strategy with multiple Regions. The post also introduces a multi-site active/passive approach.

The multi-site active/passive approach is best for customers who have business-critical workloads with higher availability requirements over other active/passive environments. A warm-standby strategy (as in Figure 1) is more costly than other active/passive strategies, but provides good protection from downtime and data loss outside of an active/active (A/A) environment.

Figure 1. Warm standby

Implementing the multi-site active/passive strategy

By replicating across multiple Availability Zones in same Region, your workloads become resilient to the failure of an entire data center. Using multiple Regions provides the most resilient option to deploy workloads, which safeguards against the risk of failure of multiple data centers.

Let’s explore an application that processes payment transactions and is modernized to utilize managed services in the AWS Cloud, as in Figure 2.

Figure 2. Warm standby with managed services

Let’s cover each of the components of this application, as well as how managed services behave in a multisite environment.

1. Amazon Route53 – Active/Passive Failover: This configuration consists of primary resources to be available, and secondary resources on standby in the case of failure of the primary environment. You would just need to create the records and specify failover for the routing policy. When responding to queries, Amazon Route 53 includes only the healthy primary resources. If the primary record configured in the Route 53 health check shows as unhealthy, Route 53 responds to DNS queries using the secondary record.

2. Amazon EKS control plane: Amazon Elastic Kubernetes Service (Amazon EKS) control plane nodes run in an account managed by AWS. Each EKS cluster control plane is single-tenant and unique, and runs on its own set of Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EKS is also a Regional service, so each cluster is confined to the Region where it is deployed, with each cluster being a standalone entity.

3. Amazon EKS data plane: Operating highly available and resilient applications requires a highly available and resilient data plane. It’s best practice to create worker nodes using Amazon EC2 Auto Scaling groups instead of creating individual Amazon EC2 instances and joining them to the cluster.

Figure 2 shows three nodes in the primary Region while there will only be a single node in the secondary. In case of failover, the data plane scales up to meet the workload requirements. This strategy deploys a functional stack to the secondary Region to test Region readiness before failover. You can use Velero with Portworx to manage snapshots of persistent volumes. These snapshots can be stored in an Amazon Simple Storage Service (Amazon S3) bucket in the primary Region, which is replicated to an Amazon S3 bucket in another Region using Amazon S3 cross-Region replication.

During an outage in the primary Region, Velero restores volumes from the latest snapshots in the standby cluster.

4. Amazon OpenSearch Service: With cross-cluster replication in Amazon OpenSearch Service, you can replicate indexes, mappings, and metadata from one OpenSearch Service domain to another. The domain follows an active-passive replication model where the follower index (where the data is replicated) pulls data from the leader index. Using cross-cluster replication helps to ensure recovery from disaster events and allows you to replicate data across geographically distant data centers to reduce latency.

Cross-cluster replication is available on domains running Elasticsearch 7.10 or OpenSearch 1.1 or later. Full documentation for cross-cluster replication is available in the OpenSearch documentation.

If you are using any versions prior to Elasticsearch 7.10 or OpenSearch 1.1, refer to part two of our blog series for guidance on using APIs for cross-Region replication.

5. Amazon RDS for PostgreSQL: One of the managed service offerings of Amazon Relational Database Service (Amazon RDS) for PostgreSQL is cross-Region read replicas. Cross-Region read replicas enable you to have a DR solution scaling read database workloads, and cross-Region migration.

Amazon RDS for PostgreSQL supports the ability to create read replicas of a source database (DB). Amazon RDS uses an asynchronous replication method of the DB engine to update the read replica whenever there is a change made on the source DB instance. Although read replicas operate as a DB instance that allows only read-only connections, they can be used to implement a DR solution for your production DB environment. If the source DB instance fails, you can promote your Read Replica to a standalone source server.

Using a cross-Region read replica helps ensure that you get back up and running if you experience a Regional availability issue. For more information on PostgreSQL cross-Region read replicas, visit the Best Practices for Amazon RDS for PostgreSQL Cross-Region Read Replicas blog post.

6. Amazon ElastiCache: AWS provides a native solution called Global Datastore that enables cross-Region replication. By using the Global Datastore for Redis feature, you can work with fully managed, fast, reliable, and secure replication across AWS Regions. This feature helps create cross-Region read replica clusters for ElastiCache for Redis to enable low-latency reads and DR across AWS Regions. Each global datastore is a collection of one or more clusters that replicate to one another. When you create a global datastore in Amazon ElastiCache, ElastiCache for Redis automatically replicates your data from the primary cluster to the secondary cluster. ElastiCache then sets up and manages automatic, asynchronous replication of data between the two clusters.

7. Amazon Redshift: With Amazon Redshift, there are only two ways of deploying a true DR approach: backup and restore, and an (A/A) solution. We’ll use the A/A solution as this provides a better recovery time objective (RTO) for the overall approach. The recovery point objective (RPO) is dependent upon the configured schedule of AWS Lambda functions. The application within the primary Region sends data to both Amazon Simple Notification Service (Amazon SNS) and Amazon S3, and the data is distributed to the Redshift clusters in both Regions through Lambda functions.

Amazon EKS uploads data to an Amazon S3 bucket and publishes a message to an Amazon SNS topic with a reference to the stored S3 object. S3 acts as an intermediate data store for messages beyond the maximum output limit of Amazon SNS. Amazon SNS is configured with primary and secondary Region Amazon Simple Queue Service (Amazon SQS) endpoint subscriptions. Amazon SNS supports the cross-Region delivery of notifications to Amazon SQS queues. Lambda functions deployed in the primary and secondary Region are used to poll the Amazon SQS queue in respective Regions to read the message. The Lambda functions then use the Amazon SQS Extended Client Library for Java to retrieve the Amazon S3 object referenced in the message. Once the Amazon S3 object is retrieved, the Lambda functions upload the data into Amazon Redshift.

For more on how to coordinate large messages across accounts and Regions with Amazon SNS and Amazon SQS, explore the Coordinating Large Messages Across Accounts and Regions with Amazon SNS and SQS blog post.

Conclusion

This active/passive approach covers how you can build a creative DR solution using a mix of native and non-native cross-Region replication methods. By using managed services, this strategy becomes simpler through automation of service updates, deployment using Infrastructure as a Code (IaaC), and general management of the two environments.

Related information

Want to learn more? Explore the following resources within this series and beyond!

Boosting Resiliency with an ML-based Telemetry Analytics Architecture

2023-03-03 Shibu Nair

Post Syndicated from Shibu Nair original https://aws.amazon.com/blogs/architecture/boosting-resiliency-with-an-ml-based-telemetry-analytics-architecture/

Data proliferation has become a norm and as organizations become more data driven, automating data pipelines that enable data ingestion, curation, and processing is vital. Since many organizations have thousands of time-bound, automated, complex pipelines, monitoring their telemetry information is critical. Keeping track of telemetry data helps businesses monitor and recover their pipelines faster which results in better customer experiences.

In our blog post, we explain how you can collect telemetry from your data pipeline jobs and use machine learning (ML) to build a lower- and upper-bound threshold to help operators identify anomalies in near-real time.

The applications of anomaly detection on telemetry data from job pipelines are wide-ranging, including these and more:

Detecting abnormal runtimes
Detecting jobs running slower than expected
Proactive monitoring
Notifications

Key tenets of telemetry analytics

There are five key tenets of telemetry analytics, as in Figure 1.

Figure 1. Key tenets of telemetry analytics

The key tenets for near real-time telemetry analytics for data pipelines are:

Collecting the metrics
Aggregating the metrics
Identify anomaly
Notify and resolve issues
Persist for compliance reasons, historical trend analysis, and to visualize

This blog post describes how customers can easily implement these steps by using AWS native no-code, low-code (AWS LCNC) solutions.

ML-based telemetry analytics solution architecture

The architecture defined here helps customers incrementally enable features with AWS LCNC solutions by leveraging AWS managed services to avoid the overhead of infrastructure provisioning. Most of the steps are configurations of the features provided by AWS services. This enables customers to make their applications resilient by tracking and resolving anomalies in near real time, as in Figure 2.

Figure 2. ML-based telemetry analytics solution architecture

Let’s explore each of the architecture steps in detail.

1. Indicative AWS data analytics services: Choose from a broad range of AWS analytics services, including data movement, data storage, data lakes, big data analytics, log analytics, and streaming analytics to business intelligence, ML, and beyond. This diagram shows a subset of these data analytics services. You may use one or a combination of many, depending on your use case.

2. Amazon CloudWatch metrics for telemetry analytics: Collecting and visualizing real-time logs, metrics, and event data is a key step in any process. CloudWatch helps you accomplish these tasks without any infrastructure provisioning. Almost every AWS data analytics service is integrated with CloudWatch to enable automatic capturing of the detailed metrics needed for telemetry analytics.

3. Near real-time use case examples: Step three presents practical, near real-time use cases that represent a range of real-world applications, one or more of which may apply to your own business needs.

Use case 1: Anomaly detection

CloudWatch provides the functionality to apply anomaly detection for a metric. The key business use case of this feature is to apply statistical and ML algorithms on a per-metrics basis of business critical applications to proactively identify issues and raise alarms.

The focus is on a single set of metrics that will be important for the application’s functioning—for example, AWS Lambda metrics of a 24/7 credit card company’s fraud monitoring application.

Use case 2: Unified metrics using Amazon Managed Grafana

For proper insights into telemetry data, it is important to unify metrics and collaboratively identify and troubleshoot issues in analytical systems. Amazon Managed Grafana helps to visualize, query, and corelate metrics from CloudWatch in near real-time.

For example, Amazon Managed Grafana can be used to monitor container metrics for Amazon EMR running on Amazon Elastic Kubernetes Service (Amazon EKS), which supports processing high-volume data from business critical Internet of Things (IoT) applications like connected factories, offsite refineries, wind farms, and more.

Use case 3: Combined business and metrics data using Amazon OpenSearch Service

Amazon OpenSearch Service provides the capability to perform near real-time, ML-based interactive log analytics, application monitoring, and search by combining business and telemetry data.

As an example, customers can combine AWS CloudTrail logs for AWS logins, Amazon Athena, and Amazon RedShift query access times with employee reference data to detect insider threats.

This log analytics use case architecture integrates into OpenSearch, as in Figure 3.

Figure 3. Log analytics use case architecture overview with OpenSearch

Use case 4: ML-based advanced analytics

Using Amazon Simple Storage Service (Amazon S3) as data storage, data lake customers can tap into AWS analytics services such as the AWS Glue Catalog, AWS Glue DataBrew, and Athena for preparing and transforming data, as well as build trend analysis using ML models in Amazon SageMaker. This mechanism helps with performing ML-based advanced analytics to identify and resolve recurring issues.

4. Anomaly resolution: When an alert is generated either by CloudWatch alarm, OpenSearch, or Amazon Managed Grafana, you have the option to act on the alert in near-real time. Amazon Simple Notification Service (Amazon SNS) and Lambda can help build workflows. Lambda also helps integrate with ServiceNow ticket creation, Slack channel notifications, or other ticketing systems.

Simple data pipeline example

Let’s explore another practical example using an architecture that demonstrates how AWS Step Functions orchestrates Lambda, AWS Glue jobs, and crawlers.

To report an anomaly on AWS Glue jobs based on total number of records processed, you can leverage the glue.driver.aggregate.recordsRead CloudWatch metric and set up a CloudWatch alarm based on anomaly detection, Amazon SNS topic for notifications, and Lambda for resolution, as in Figure 4.

Figure 4. AWS Step Functions orchestrating Lamba, AWS Glue jobs, and crawlers

Here are the steps involved in the architecture proposed:

CloudWatch automatically captures the metric glue.driver.aggregate.recordsRead from AWS Glue jobs.
Customers set a CloudWatch alarm based on the anomaly detection of glue.driver.aggregate.recordsRead metric and set a notification to Amazon SNS topic.
CloudWatch applies a ML algorithm to the metric’s past data and creates a model of metric’s expected values.
When the number of records increases significantly, the metric from the CloudWatch anomaly detection model notifies the Amazon SNS topic.
Customers can notify an email group and trigger a Lambda function to resolve the issue, or create tickets in their operational monitoring system.
Customers can also unify all the AWS Glue metrics using Amazon Managed Grafana. Using Amazon S3, data lake customers can crawl and catalog the data in the AWS Glue catalog and make it available for ad-hoc querying. Amazon SageMaker can be used for custom model training and inferencing.

Conclusion

In this blog post, we covered a recommended architecture to enable near-real time telemetry analytics for data pipelines, anomaly detection, notification, and resolution. This provides resiliency to the customer applications by proactively identifying and resolving issues.

Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

2023-03-01 Kachi Odoemene

Post Syndicated from Kachi Odoemene original https://aws.amazon.com/blogs/big-data/build-a-semantic-search-engine-for-tabular-columns-with-transformers-and-amazon-opensearch-service/

Finding similar columns in a data lake has important applications in data cleaning and annotation, schema matching, data discovery, and analytics across multiple data sources. The inability to accurately find and analyze data from disparate sources represents a potential efficiency killer for everyone from data scientists, medical researchers, academics, to financial and government analysts.

Conventional solutions involve lexical keyword search or regular expression matching, which are susceptible to data quality issues such as absent column names or different column naming conventions across diverse datasets (for example, zip_code, zcode, postalcode).

In this post, we demonstrate a solution for searching for similar columns based on column name, column content, or both. The solution uses approximate nearest neighbors algorithms available in Amazon OpenSearch Service to search for semantically similar columns. To facilitate the search, we create features representations (embeddings) for individual columns in the data lake using pre-trained Transformer models from the sentence-transformers library in Amazon SageMaker. Finally, to interact with and visualize results from our solution, we build an interactive Streamlit web application running on AWS Fargate.

We include a code tutorial for you to deploy the resources to run the solution on sample data or your own data.

Solution overview

The following architecture diagram illustrates the two-stage workflow for finding semantically similar columns. The first stage runs an AWS Step Functions workflow that creates embeddings from tabular columns and builds the OpenSearch Service search index. The second stage, or the online inference stage, runs a Streamlit application through Fargate. The web application collects input search queries and retrieves from the OpenSearch Service index the approximate k-most-similar columns to the query.

Figure 1. Solution architecture

The automated workflow proceeds in the following steps:

The user uploads tabular datasets into an Amazon Simple Storage Service (Amazon S3) bucket, which invokes an AWS Lambda function that initiates the Step Functions workflow.
The workflow begins with an AWS Glue job that converts the CSV files into Apache Parquet data format.
A SageMaker Processing job creates embeddings for each column using pre-trained models or custom column embedding models. The SageMaker Processing job saves the column embeddings for each table in Amazon S3.
A Lambda function creates the OpenSearch Service domain and cluster to index the column embeddings produced in the previous step.
Finally, an interactive Streamlit web application is deployed with Fargate. The web application provides an interface for the user to input queries to search the OpenSearch Service domain for similar columns.

You can download the code tutorial from GitHub to try this solution on sample data or your own data. Instructions on the how to deploy the required resources for this tutorial are available on Github.

Prerequistes

To implement this solution, you need the following:

An AWS account.
Basic familiarity with AWS services such as the AWS Cloud Development Kit (AWS CDK), Lambda, OpenSearch Service, and SageMaker Processing.
A tabular dataset to create the search index. You can bring your own tabular data or download the sample datasets on GitHub.

Build a search index

The first stage builds the column search engine index. The following figure illustrates the Step Functions workflow that runs this stage.

Figure 2 – Step functions workflow – multiple embedding models

Datasets

In this post, we build a search index to include over 400 columns from over 25 tabular datasets. The datasets originate from the following public sources:

s3://sagemaker-sample-files/datasets/tabular/
NYC Open Data
Chicago Data Portal

For the the full list of the tables included in the index, see the code tutorial on GitHub.

You can bring your own tabular dataset to augment the sample data or build your own search index. We include two Lambda functions that initiate the Step Functions workflow to build the search index for individual CSV files or a batch of CSV files, respectively.

Transform CSV to Parquet

Raw CSV files are converted to Parquet data format with AWS Glue. Parquet is a column-oriented format file format preferred in big data analytics that provides efficient compression and encoding. In our experiments, the Parquet data format offered significant reduction in storage size compared to raw CSV files. We also used Parquet as a common data format to convert other data formats (for example JSON and NDJSON) because it supports advanced nested data structures.

Create tabular column embeddings

To extract embeddings for individual table columns in the sample tabular datasets in this post, we use the following pre-trained models from the sentence-transformers library. For additional models, see Pretrained Models.

Model name	Dimension	Size (MB)
all-MiniLM-L6-v2	384	80
all-distilroberta-v1	768	290
average_word_embeddings_glove.6B.300d	300	420

The SageMaker Processing job runs create_embeddings.py(code) for a single model. For extracting embeddings from multiple models, the workflow runs parallel SageMaker Processing jobs as shown in the Step Functions workflow. We use the model to create two sets of embeddings:

column_name_embeddings – Embeddings of column names (headers)
column_content_embeddings – Average embedding of all the rows in the column

For more information about the column embedding process, see the code tutorial on GitHub.

An alternative to the SageMaker Processing step is to create a SageMaker batch transform to get column embeddings on large datasets. This would require deploying the model to a SageMaker endpoint. For more information, see Use Batch Transform.

Index embeddings with OpenSearch Service

In the final step of this stage, a Lambda function adds the column embeddings to a OpenSearch Service approximate k-Nearest-Neighbor (kNN) search index. Each model is assigned its own search index. For more information about the approximate kNN search index parameters, see k-NN.

Online inference and semantic search with a web app

The second stage of the workflow runs a Streamlit web application where you can provide inputs and search for semantically similar columns indexed in OpenSearch Service. The application layer uses an Application Load Balancer, Fargate, and Lambda. The application infrastructure is automatically deployed as part of the solution.

The application allows you to provide an input and search for semantically similar column names, column content, or both. Additionally, you can select the embedding model and number of nearest neighbors to return from the search. The application receives inputs, embeds the input with the specified model, and uses kNN search in OpenSearch Service to search indexed column embeddings and find the most similar columns to the given input. The search results displayed include the table names, column names, and similarity scores for the columns identified, as well as the locations of the data in Amazon S3 for further exploration.

The following figure shows an example of the web application. In this example, we searched for columns in our data lake that have similar Column Names (payload type) to district (payload). The application used all-MiniLM-L6-v2 as the embedding model and returned 10 (k) nearest neighbors from our OpenSearch Service index.

The application returned transit_district, city, borough, and location as the four most similar columns based on the data indexed in OpenSearch Service. This example demonstrates the ability of the search approach to identify semantically similar columns across datasets.

Figure 3: Web application user interface

Clean up

To delete the resources created by the AWS CDK in this tutorial, run the following command:

cdk destroy --all

Conclusion

In this post, we presented an end-to-end workflow for building a semantic search engine for tabular columns.

Get started today on your own data with our code tutorial available on GitHub. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon Machine Learning Solutions Lab.

About the Authors

Kachi Odoemene is an Applied Scientist at AWS AI. He builds AI/ML solutions to solve business problems for AWS customers.

Taylor McNally is a Deep Learning Architect at Amazon Machine Learning Solutions Lab. He helps customers from various industries build solutions leveraging AI/ML on AWS. He enjoys a good cup of coffee, the outdoors, and time with his family and energetic dog.

Austin Welch is a Data Scientist in the Amazon ML Solutions Lab. He develops custom deep learning models to help AWS public sector customers accelerate their AI and cloud adoption. In his spare time, he enjoys reading, traveling, and jiu-jitsu.

Control access to Amazon OpenSearch Service Dashboards with attribute-based role mappings

2023-02-23 Stefan Appel

Post Syndicated from Stefan Appel original https://aws.amazon.com/blogs/big-data/control-access-to-amazon-opensearch-service-dashboards-with-attribute-based-role-mappings/

Federated users of Amazon OpenSearch Service often need access to OpenSearch Dashboards with roles based on their user profiles. OpenSearch Service fine-grained access control maps authenticated users to OpenSearch Search roles and then evaluates permissions to determine how to handle the user’s actions. However, when an enterprise-wide identity provider (IdP) manages the users, the mapping of users to OpenSearch Service roles often needs to happen dynamically based on IdP user attributes. One option to map users is to use OpenSearch Service SAML integration and pass user group information to OpenSearch Service. Another option is Amazon Cognito role-based access control, which supports rule-based or token-based mappings. But neither approach supports arbitrary role mapping logic. For example, when you need to interpret multivalued user attributes to identify a target role.

This post shows how you can implement custom role mappings with an Amazon Cognito pre-token generation AWS Lambda trigger. For our example, we use a multivalued attribute provided over OpenID Connect (OIDC) to Amazon Cognito. We show how you are in full control of the mapping logic and process of such a multivalued attribute for AWS Identity and Access Management (IAM) role lookups. Our approach is generic for OIDC-compatible IdPs. To make this post self-contained, we use the Okta IdP as an example to walk through the setup.

Overview of solution

The provided solution intercepts the OICD-based login process to OpenSearch Dashboards with a pre-token generation Lambda function. The login to OpenSearch Dashboards with a third-party IdP and Amazon Cognito as an intermediary consists of several steps:

First, the initial user request to OpenSearch Dashboard is redirected to Amazon Cognito.
Amazon Cognito redirects the request to the IdP for authentication.
After the user authenticates, the IdP sends the identity token (ID token) back to Amazon Cognito.
Amazon Cognito invokes a Lambda function that modifies the obtained token. We use an Amazon DynamoDB table to perform role mapping lookups. The modified token now contains the IAM role mapping information.
Amazon Cognito uses this role mapping information to map the user to the specified IAM role and provides the role credentials.
OpenSearch Service maps the IAM role credentials to OpenSearch roles and applies fine-grained permission checks.

The following architecture outlines the login flow from a user’s perspective.

On the backend, OpenSearch Dashboards integrates with an Amazon Cognito user pool and an Amazon Cognito identity pool during the authentication flow. The steps are as follows:

Authenticate and get tokens.
Look up the token attribute and IAM role mapping and overwrite the Amazon Cognito attribute.
Exchange tokens for AWS credentials used by OpenSearch dashboards.

The following architecture shows this backend perspective to the authentication process.

In the remainder of this post, we walk through the configurations necessary for an authentication flow in which a Lambda function implements custom role mapping logic. We provide sample Lambda code for the mapping of multivalued OIDC attributes to IAM roles based on a DynamoDB lookup table with the following structure.

OIDC Attribute Value	IAM Role
`["attribute_a","attribute_b"]`	`arn:aws:iam::<aws-account-id>:role/<role-name-01>`
`["attribute_a","attribute_x"]`	`arn:aws:iam::<aws-account-id>:role/<role-name-02>`

The high-level steps of the solution presented in this post are as follows:

Configure Amazon Cognito authentication for OpenSearch Dashboards.
Add IAM roles for mappings to OpenSearch Service roles.
Configure the Okta IdP.
Add a third-party OIDC IdP to the Amazon Cognito user pool.
Map IAM roles to OpenSearch Service roles.
Create the DynamoDB attribute-role mapping table.
Deploy and configure the pre-token generation Lambda function.
Configure the pre-token generation Lambda trigger.
Test the login to OpenSearch Dashboards.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with an OpenSearch Service domain.
A third-party IdP that supports OpenID Connect and adds a multivalued attribute in the authorization token. For this post, we use attributes_array as this attribute’s name and Okta as an IdP provider. You can create an Okta Developer Edition free account to test the setup.

Configure Amazon Cognito authentication for OpenSearch Dashboards

The modification of authentication tokens requires you to configure the OpenSearch Service domain to use Amazon Cognito for authentication. For instructions, refer to Configuring Amazon Cognito authentication for OpenSearch Dashboards.

The Lambda function implements custom role mappings by setting the cognito:preferred_role claim (for more information, refer to Role-based access control). For the correct interpretation of this claim, set the Amazon Cognito identity pool to Choose role from token. The Amazon Cognito identity pool then uses the value of the cognito:preferred_role claim to select the correct IAM role. The following screenshot shows the required settings in the Amazon Cognito identity pool that is created during the configuration of Amazon Cognito authentication for OpenSearch Service.

Add IAM roles for mappings to OpenSearch roles

IAM roles used for mappings to OpenSearch roles require a trust policy so that authenticated users can assume them. The trust policy needs to reference the Amazon Cognito identity pool created during the configuration of Amazon Cognito authentication for OpenSearch Service. Create at least one IAM role with a custom trust policy. For instructions, refer to Creating a role using custom trust policies. The IAM role doesn’t require the attachment of a permission policy. For a sample trust policy, refer to Role-based access control.

Configure the Okta IdP

In this section, we describe the configuration steps to include a multivalued attribute_array attribute in the token provided by Okta. For more information, refer to Customize tokens returned from Okta with custom claims. We use the Okta UI to perform the configurations. Okta also provides an API that you can use to script and automate the setup.

The first step is adding the attributes_array attribute to the Okta user profile.

Use Okta’s Profile Editor under Directory, Profile Editor.
Select User (default) and then choose Add Attribute.
Add an attribute with a display name and variable name attributes_array of type string array.

The following screenshot shows the Okta default user profile after the custom attribute has been added.

Next, add attributes_array attribute values to users using Okta’s user management interface under Directory, People.
Select a user and choose Profile.
Choose Edit and enter attribute values.

The following screenshot shows an example of attributes_array attribute values within a user profile.

The next step is adding the attributes_array attribute to the ID token that is generated during the authentication process.

On the Okta console, choose Security, API and select the default authorization server.
Choose Claims and choose Add Claim to add the attributes_array attribute as part of the ID token.
As the scope, enter openid and as the attribute value, enter user.attributes_array.

This references the previously created attribute in a user’s profile.

Next, create an application for the federation with Amazon Cognito. For instructions, refer to How do I set up Okta as an OpenID Connect identity provider in an Amazon Cognito user pool.

The last step assigns the Okta application to Okta users.

Navigate to Directory, People, select a user, and choose Assign Applications.
Select the application you created in the previous step.

Add a third-party OIDC IdP to the Amazon Cognito user pool

We are implementing the role mapping based on the information provided in a multivalued OIDC attribute. The authentication token needs to include this attribute. If you followed the previously described Okta configuration, the attribute is automatically added to the ID token of a user. If you used another IdP, you might have to request the attribute explicitly. For this, add the attribute name to the Authorized scopes list of the IdP in Amazon Cognito.

For instructions on how to set up the federation between a third-party IdP and an Amazon Cognito user pool and how to request additional attributes, refer to Adding OIDC identity providers to a user pool. For a detailed walkthrough for Okta, refer to How do I set up Okta as an OpenID Connect identity provider in an Amazon Cognito user pool.

After requesting the token via OIDC, you need to map the attribute to an Amazon Cognito user pool attribute. For instructions, refer to Specifying identity provider attribute mappings for your user pool. The following screenshot shows the resulting configuration on the Amazon Cognito console.

Map IAM roles to OpenSearch Service roles

Upon login, OpenSearch Service maps users to an OpenSearch Service role based on the IAM role ARN set in the cognito:preferred_role claim by the pre-token generation Lambda trigger. This requires a role mapping in OpenSearch Service. To add such role mappings to IAM backend roles, refer to Mapping roles to users. The following screenshot shows a role mapping on the OpenSearch Dashboards console.

Create the attribute-role mapping table

For this solution, we use DynamoDB to store mappings of users to IAM roles. For instructions, refer to Create a table and define a partition key named Key of type String. You need the table name in the subsequent step to configure the Lambda function.

The next step is writing the mapping information into the table. A mapping entry consists of the following attributes:

Key – A string that contains attribute values in comma-separated alphabetical order
RoleArn – A string with the IAM role ARN to which the attribute value combination should be mapped

For details on how to add data to a DynamoDB table, refer to Write data to a table using the console or AWS CLI.

For example, if the previously configured OIDC attribute attributes_array contains three values, attribute_a, attribute_b, and attribute_c, the entry in the mapping table looks like table line 1 in the following screenshot.

Deploy and configure the pre-token generation Lambda function

A Lambda function implements the custom role mapping logic. The Lambda function receives an Amazon Cognito event as input and extracts attribute information out of it. It uses the attribute information for a lookup in a DynamoDB table and retrieves the value for cognito:preferred_role. Follow the steps in Getting started with Lambda to create a Node.js Lambda function and insert the following source code:

const AWS = require("aws-sdk");
const tableName = process.env.TABLE_NAME;
const unauthorizedRoleArn = process.env.UNAUTHORIZED_ROLE;
const userAttributeArrayName = process.env.USER_POOL_ATTRIBUTE;
const dynamodbClient = new AWS.DynamoDB({apiVersion: "2012-08-10"});
exports.lambdaHandler = handlePreTokenGenerationEvent

async function handlePreTokenGenerationEvent (event, context) {
    var sortedAttributeList = getSortedAttributeList(event);
    var lookupKey = sortedAttributeList.join(',');
    var roleArn = await lookupIAMRoleArn(lookupKey);
    appendResponseWithPreferredRole(event, roleArn);
    return event;
}

function getSortedAttributeList(event) {
    return JSON.parse(event['request']['userAttributes'][userAttributeArrayName]).sort();
}

async function lookupIAMRoleArn(key) {
    var params = {
        TableName: tableName,
        Key: {
          'Key': {S: key}
        },
        ProjectionExpression: 'RoleArn'
      };
    try {
        let item = await dynamodbClient.getItem(params).promise();
        return item['Item']['RoleArn']['S'];
    } catch (e){
        console.log(e);
        return unauthorizedRoleArn; 
    }
}

function appendResponseWithPreferredRole(event, roleArn){
    event.response = {
        'claimsOverrideDetails': {
            'groupOverrideDetails': {
                'preferredRole': roleArn
            }
        }
    };
}

The Lambda function expects three environment variables. Refer to Using AWS Lambda environment variables for instructions to add the following entries:

TABLE_NAME – The name of the previously created DynamoDB table. This table is used for the lookups.
UNAUTHORIZED_ROLE – The ARN of the IAM role that is used when no mapping is found in the lookup table.
USER_POOL_ATTRIBUTE – The Amazon Cognito user pool attribute used for the IAM role lookup. In our example, this attribute is named custom:attributes_array.

The following screenshot shows the final configuration.

The Lambda function needs permissions to access the DynamoDB lookup table. Set permissions as follows: attach the following policy to the Lambda execution role (for instructions, refer to Lambda execution role) and provide the Region, AWS account number, and DynamoDB table name:

{
    "Statement": [
        {
            "Action": [
                "dynamodb:GetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:BatchGetItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": [
                "arn:aws:dynamodb:<region>:<accountid>:table/<table>",
                "arn:aws:dynamodb:<region>:<accountid>:table/<table>/index/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The configuration of the Lambda function is now complete.

Configure the pre-token generation Lambda trigger

As final step, add a pre-token generation trigger to the Amazon Cognito user pool and reference the newly created Lambda function. For details, refer to Customizing user pool workflows with Lambda triggers. The following screenshot shows the configuration.

This step completes the setup; Amazon Cognito now maps users to OpenSearch Service roles based on the values provided in an OIDC attribute.

Test the login to OpenSearch Dashboards

The following diagram shows an exemplary login flow and the corresponding screenshots for an Okta user user1 with a user profile attribute attribute_array and value: ["attribute_a", "attribute_b", "attribute_c"].

Clean up

To avoid incurring future charges, delete the OpenSearch Service domain, Amazon Cognito user pool and identity pool, Lambda function, and DynamoDB table created as part of this post.

Conclusion

In this post, we demonstrated how to set up a custom mapping to OpenSearch Service roles using values provided via an OIDC attribute. We dynamically set the cognito:preferred_role claim using an Amazon Cognito pre-token generation Lambda trigger and a DynamoDB table for lookup. The solution is capable of handling dynamic multivalued user attributes, but you can extend it with further application logic that goes beyond a simple lookup. The steps in this post are a proof of concept. If you plan to develop this into a productive solution, we recommend implementing Okta and AWS security best practices.

The post highlights just one use case of how you can use Amazon Cognito support for Lambda triggers to implement custom authentication needs. If you’re interested in further details, refer to How to Use Cognito Pre-Token Generation trigger to Customize Claims In ID Tokens.

About the Authors

Stefan Appel is a Senior Solutions Architect at AWS. For 10+ years, he supports enterprise customers adopt cloud technologies. Before joining AWS, Stefan held positions in software architecture, product management, and IT operations departments. He began his career in research on event-based systems. In his spare time, he enjoys hiking and has walked the length of New Zealand following Te Araroa.

Modood Alvi is Senior Solutions Architect at Amazon Web Services (AWS). Modood is passionate about digital transformation and is committed helping large enterprise customers across the globe accelerate their adoption of and migration to the cloud. Modood brings more than a decade of experience in software development, having held various technical roles within companies like SAP and Porsche Digital. Modood earned his Diploma in Computer Science from the University of Stuttgart.