Tag Archives: Amazon OpenSearch Service

Use the reverse token filter to enable suffix matching queries in OpenSearch

2023-09-06 Bharav Patel

Post Syndicated from Bharav Patel original https://aws.amazon.com/blogs/big-data/use-the-reverse-token-filter-to-enable-suffix-matching-queries-in-opensearch/

OpenSearch is an open-source RESTful search engine built on top of the Apache Lucene library. OpenSearch full-text search is fast, can give the result of complex queries within a fraction of a second. With OpenSearch, you can convert unstructured text into structured text using different text analyzers, tokenizers, and filters to improve search. OpenSearch uses a default analyzer, called the standard analyzer, which works well for most use cases out of the box. But for some use cases, it may not work best, and you need to use a specific analyzer.

In this post, we show how you can implement a suffix-based search. To find a document with the movie name “saving private ryan” for example, you can use the prefix “saving” with a prefix-based query. Occasionally, you also want to match suffixes as well, such as matching “Harry Potter Goblet of Fire” with the suffix “Fire” To do that, first reverse the string “eriF telboG rettoP yrraH” with the reverse token filter, then query for the prefix “eriF”.

Solution overview

Text analysis involves transforming unstructured text, such as the content of an email or a product description, into a structured format that is finely tuned for effective searching. An analyzer enables the implementation of full-text search using tokenization, which entails breaking down a text into smaller fragments known as tokens, with these tokens commonly representing individual words. To implement a reversed field search, the analyzer does the following.

The analyzer processes text in the following order:

Use a character filter to replace - with _. For example, from “My Driving License Number Is 123-456-789” to “My Driving License Number Is 123_456_789.”
The standard tokenizer splits texts into tokens. For example, from “My Driving License Number Is 123_456_789” to “[ my, driving, license, number, is, 123, 456, 789 ].”
The reverse token filter reverses each token in a stream. For example, from [ my, driving, license, number, is, 123, 456, 789 ] to [ ym, gnivird, esnecil, rebmun, si, 321, 654, 987 ].

The standard analyzer (default analyzer) breaks down input strings into tokens based on word boundaries and removes most punctuation marks. For additional information about analyzers, refer Build-in analyzers.

Indexing and searching

Every document is a collection of fields, each having its own specific data type. When you create a mapping for your data, you create a mapping definition, which contains a list of fields that are pertinent to the document. To know more about index mappings refer to index mapping.

Let’s take the example of an analyzer with the reverse token filter applied on the text field.

First, create an index with mappings as shown in the following code. The new field ‘reverse_title’ is derived from ‘title’ field for suffix search and original field ‘title’ will be used for normal search.

PUT movies
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "whitespace_reverse" : {
          "tokenizer" : "whitespace",
          "filter" : ["reverse"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": { 
        "type": "text",
        "analyzer": "standard",
        "copy_to": "reverse_title"
      },
      "reverse_title": {
        "type": "text",
        "analyzer": "whitespace_reverse"
      }
    }
  }
}

Insert some documents into the index:

POST _bulk
{ "index" : { "_index" : "movies", "_id" : "1" } }
{ "title": "Harry Potter Goblet of Fire" }
{ "index" : { "_index" : "movies", "_id" : "2" } }
{ "title": "Lord of the rings" }
{ "index" : { "_index" : "movies", "_id" : "3" } }
{ "title": "Saving Private Ryan" }

Run the following query to perform a suffix/reverse search on derived field ‘reverse_title’ for “Fire”:

GET movies/_search
{
  "query": {
    "prefix": {
      "reverse_title": {
        "value": "eriF"
      }
    }
  }
}

The following code shows our results:

   {
        "_index": "movies",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "Harry Potter Goblet of Fire"
        }
      }

For non-reverse search you can use original field ‘title’.

GET movies/_search
{
  "query": {
    "match": {
      "title": "Fire"
    }
  }
}

The following code shows our result.

{
        "_index": "movies",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "title": "Harry Potter Goblet of Fire"
        }
}

The query returns a document with the movie name “Harry Potter Goblet of Fire”.
If you’re curious to know how search works at high level, refer to A query, or There and Back Again.

Conclusion

In this post, you walked through how text analysis works in OpenSearch and how to implement suffix-based search using a reverse token filter effectively.

If you have feedback about this post, submit your comments in the comments section.

About the Authors

Bharav Patel is a Specialist Solution Architect, Analytics at Amazon Web Services. He primarily works on Amazon OpenSearch Service and helps customers with key concepts and design principles of running OpenSearch workloads on the cloud. Bharav likes to explore new places and try out different cuisines.

Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion

2023-08-31 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-amazon-msk-as-a-source-for-amazon-opensearch-ingestion/

Ingesting a high volume of streaming data has been a defining characteristic of operational analytics workloads with Amazon OpenSearch Service. Many of these workloads involve either self-managed Apache Kafka or Amazon Managed Streaming for Apache Kafka (Amazon MSK) to satisfy their data streaming needs. Consuming data from Amazon MSK and writing to OpenSearch Service has been a challenge for customers. AWS Lambda, custom code, Kafka Connect, and Logstash have been used for ingesting this data. These methods involve tools that must be built and maintained. In this post, we introduce Amazon MSK as a source to Amazon OpenSearch Ingestion, a serverless, fully managed, real-time data collector for OpenSearch Service that makes this ingestion even easier.

Solution overview

The following diagram shows the flow from data sources to Amazon OpenSearch Service.

The flow contains the following steps:

Data sources produce data and send that data to Amazon MSK
OpenSearch Ingestion consumes the data from Amazon MSK.
OpenSearch Ingestion transforms, enriches, and writes the data into OpenSearch Service.
Users search, explore, and analyze the data with OpenSearch Dashboards.

Prerequisites

You will need a provisioned MSK cluster created with appropriate data sources. The sources, as producers, write data into Amazon MSK. The cluster should be created with the appropriate Availability Zone, storage, compute, security and other configurations to suit your workload needs. To provision your MSK cluster and have your sources producing data, see Getting started using Amazon MSK.

As of this writing, OpenSearch Ingestion supports Amazon MSK provisioned, but not Amazon MSK Serverless. However, OpenSearch Ingestion can reside in the same or different account where Amazon MSK is present. OpenSearch Ingestion uses AWS PrivateLink to read data, so you must turn on multi-VPC connectivity on your MSK cluster. For more information, see Amazon MSK multi-VPC private connectivity in a single Region. OpenSearch Ingestion can write data to Amazon Simple Storage Service (Amazon S3), provisioned OpenSearch Service, and Amazon OpenSearch Service. In this solution, we use a provisioned OpenSearch Service domain as a sink for OSI. Refer to Getting started with Amazon OpenSearch Service to create a provisioned OpenSearch Service domain. You will need appropriate permission to read data from Amazon MSK and write data to OpenSearch Service. The following sections outline the required permissions.

Permissions required

To read from Amazon MSK and write to Amazon OpenSearch Service, you need to create a an AWS Identity and Access Management (IAM) role used by Amazon OpenSearch Ingestion. In this post we use a role called pipeline-Role for this purpose. To create this role please see Creating IAM roles.

Reading from Amazon MSK

OpenSearch Ingestion will need permission to create a PrivateLink connection and other actions that can be performed on your MSK cluster. Edit your MSK cluster policy to include the following snippet with appropriate permissions. If your OpenSearch Ingestion pipeline resides in an account different from your MSK cluster, you will need a second section to allow this pipeline. Use proper semantic conventions when providing the cluster, topic, and group permissions and remove the comments from the policy before using.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "osis-pipelines.aws.internal"
      },
      "Action": [
        "kafka:CreateVpcConnection",
        "kafka:GetBootstrapBrokers",
        "kafka:DescribeCluster"
      ],
      # Change this to your msk arn
      "Resource": "arn:aws:kafka:us-east-1:XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx"
    },    
    ### Following permissions are required if msk cluster is in different account than osi pipeline
    {
      "Effect": "Allow",
      "Principal": {
        # Change this to your sts role arn used in the pipeline
        "AWS": "arn:aws:iam:: XXXXXXXXXXXX:role/PipelineRole"
      },
      "Action": [
        "kafka-cluster:*",
        "kafka:*"
      ],
      "Resource": [
        # Change this to your msk arn
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx",
        # Change this as per your cluster name & kafka topic name
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:topic/test-cluster/xxxxxxxx-xxxx-xx/*",
        # Change this as per your cluster name
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:group/test-cluster/*"
      ]
    }
  ]
}

Edit the pipeline role’s inline policy to include the following permissions. Ensure that you have removed the comments before using the policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:Connect",
                "kafka-cluster:AlterCluster",
                "kafka-cluster:DescribeCluster",
                "kafka:DescribeClusterV2",
                "kafka:GetBootstrapBrokers"
            ],
            "Resource": [
                # Change this to your msk arn
                "arn:aws:kafka:us-east-1:XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:*Topic*",
                "kafka-cluster:ReadData"
            ],
            "Resource": [
                # Change this to your kafka topic and cluster name
                "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:topic/test-cluster/xxxxxxxx-xxxx-xx/topic-to-consume"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:AlterGroup",
                "kafka-cluster:DescribeGroup"
            ],
            "Resource": [
                # change this as per your cluster name
                "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:group/test-cluster/*"
            ]
        }
    ]
}

Writing to OpenSearch Service

In this section, you provide the pipeline role with necessary permissions to write to OpenSearch Service. As a best practice, we recommend using fine-grained access control in OpenSearch Service. Use OpenSearch dashboards to map a pipeline role to an appropriate backend role. For more information on mapping roles to users, see Managing permissions. For example, all_access is a built-in role that grants administrative permission to all OpenSearch functions. When deploying to a production environment, ensure that you use a role with enough permissions to write to your OpenSearch domain.

Creating OpenSearch Ingestion pipelines

The pipeline role now has the correct set of permissions to read from Amazon MSK and write to OpenSearch Service. Navigate to the OpenSearch Service console, choose Pipelines, then choose Create pipeline.

Choose a suitable name for the pipeline. and se the pipeline capacity with appropriate minimum and maximum OpenSearch Compute Unit (OCU). Then choose ‘AWS-MSKPipeline’ from the dropdown menu as shown below.

Use the provided template to fill in all the required fields. The snippet in the following section shows the fields that needs to be filled in red.

Configuring Amazon MSK source

The following sample configuration snippet shows every setting you need to get the pipeline running:

msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true                     # Default is false  
      topics: 
         - name: "<topic name>" 
           group_id: "<consumer group id>" 
           serde_format: json                   # Remove, if Schema Registry is used. (Other option is plaintext)  
 
           # Below defaults can be tuned as needed 
           # fetch_max_bytes: 52428800          Optional 
           # fetch_max_wait: 500                Optional (in msecs) 
           # fetch_min_bytes: 1                 Optional (in MB) 
           # max_partition_fetch_bytes: 1048576 Optional 
           # consumer_max_poll_records: 500     Optional                                
           # auto_offset_reset: "earliest"      Optional (other option is "earliest") 
           # key_mode: include_as_field         Optional (other options are include_as_field, discard)  
 
       
           serde_format: json                   # Remove, if Schema Registry is used. (Other option is plaintext)   
 
      # Enable this configuration if Glue schema registry is used            
      # schema:                                 
      #   type: aws_glue 
 
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        # sts_role_arn: "arn:aws:iam::XXXXXXXXXXXX:role/Example-Role" 
        # Provide the region of the domain. 
        # region: "us-west-2" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:us-west-2:XXXXXXXXXXXX:cluster/msk-prov-1/id" 
 
  sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          # hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
          # sts_role_arn: "arn:aws:iam::XXXXXXXXXXXX:role/Example-Role" 
          # Provide the region of the domain. 
          # region: "us-east-1" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # serverless: true 
          # index name can be auto-generated from topic name 
          index: "index_${getMetadata(\"kafka_topic\")}-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          # distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          # dlq: 
            # s3: 
            # Provide an S3 bucket

We use the following parameters:

acknowledgements – Set to true for OpenSearch Ingestion to ensure that the data is delivered to the sinks before committing the offsets in Amazon MSK. The default value is set to false.
name – This specifies topic OpenSearch Ingestion can read from. You can read a maximum of four topics per pipeline.
group_id – This parameter specifies that the pipeline is part of the consumer group. With this setting, a single consumer group can be scaled to as many pipelines as needed for very high throughput.
serde_format – Specifies a deserialization method to be used for the data read from Amazon MSK. The options are JSON and plaintext.
AWS sts_role_arn and OpenSearch sts_role_arn – Specifies the role OpenSearch Ingestion uses for reading and writing. Specify the ARN of the role you created from the last section. OpenSearch Ingestion currently uses the same role for reading and writing.
MSK arn – Specifies the MSK cluster to consume data from.
OpenSearch host and index – Specifies the OpenSearch domain URL and where the index should write.

When you have configured the Kafka source, choose the network access type and log publishing options. Public pipelines do not involve PrivateLink and they will not incur a cost associated with PrivateLink. Choose Next and review all configurations. When you are satisfied, choose Create pipeline.

Recommended compute units (OCUs) for the MSK pipeline

Each compute unit has one consumer per topic. Brokers will balance partitions among these consumers for a given topic. However, when the number of partitions is greater than the number of consumers, Amazon MSK will host multiple partitions on every consumer. OpenSearch Ingestion has built-in auto scaling to scale up or down based on CPU usage or number of pending records in the pipeline. For optimal performance, partitions should be distributed across many compute units for parallel processing. If topics have a large number of partitions, for example, more than 96 (maximum OCUs per pipeline), we recommend configuring a pipeline with 1–96 OCUs because it will auto scale as needed. If a topic has a low number of partitions, for example, less than 96, then keep the maximum compute unit to same as the number of partitions. When pipeline has more than one topic, user can pick a topic with highest number of partitions as a reference to configure maximum computes units. By adding another pipeline with a new set of OCUs to the same topic and consumer group, you can scale the throughput almost linearly.

Clean up

To avoid future charges, clean up any unused resources from your AWS account.

Conclusion

In this post, you saw how to use Amazon MSK as a source for OpenSearch Ingestion. This not only addresses the ease of data consumption from Amazon MSK, but it also relieves you of the burden of self-managing and manually scaling consumers for varying and unpredictable high-speed, streaming operational analytics data. Please refer to the ‘sources’ list under ‘supported plugins’ section for exhaustive list of sources from which you can ingest data.

About the authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focusses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large scale distributed systems and cloud-native technologies and is based out of Seattle, Washington.

Raj Sharma is a Sr. SDM with Amazon OpenSearch Service. He builds large-scale distributed applications and solutions. Raj is interested in the topics of Analytics, databases, networking and security, and is based out of Palo Alto, California.

Deploy Amazon OpenSearch Serverless with Terraform

2023-08-31 Joshua Luo

Post Syndicated from Joshua Luo original https://aws.amazon.com/blogs/big-data/deploy-amazon-opensearch-serverless-with-terraform/

Amazon OpenSearch Serverless provides the search and analytical functionality of OpenSearch without the manual overhead of configuring, managing, and scaling OpenSearch clusters. It automatically scales the resources based on your workload, and you only pay for the resources consumed. Managing OpenSearch Serverless is simple, but with infrastructure as code (IaC) software like Terraform, you can simplify your resource management even more.

This post demonstrates how to use Terraform to create, deploy, and clean up OpenSearch Serverless infrastructure.

Solution overview

To create and deploy an OpenSearch Serverless collection with security and access policies using Terraform, you need to follow these steps:

Initialize the Terraform configuration.
Create an encryption policy.
Create an OpenSearch Serverless collection.
Create a network policy.
Create a virtual private cloud (VPC) endpoint.
Create a data access policy.
Deploy using Terraform.

Prerequisites

This post assumes that you’re familiar with GitHub and Git commands.

For this walkthrough, you need the following:

An AWS account. If you don’t have an account, you can sign up for one.
Access to an AWS Identity and Access Management (IAM) user or role that has the minimum required permissions for setting up a collection.
Terraform 0.12 or greater installed on your workstation. See Install Terraform.

Initialize the Terraform configuration

The sample code is available in the Terraform GitHub repo in the terraform-provider-aws/examples/opensearchserverless directory. This configuration will get you started with OpenSearch Serverless. First, clone the repository to your workstation and navigate to the directory:

$ git clone https://github.com/hashicorp/terraform-provider-aws.git && \ 
cd ./terraform-provider-aws/examples/opensearchserverless

Initialize the configuration to install the aws provider by running the following command:

$ terraform init

The Terraform configuration first defines the version of Terraform required and configures the AWS provider to launch resources in the Region defined by the aws_region variable:

# Require Terraform 0.12 or greater
terraform {
  required_version = ">= 0.12"
}

# Set AWS provider region
provider "aws" {
  region = var.aws_region
}

The variables used in this Terraform configuration are defined in the variables.tf file. This post assumes the default values are used:

variable "aws_region" {
  description = "The AWS region to create things in."
  default     = "us-east-1"
}

variable "collection_name" {
  description = "Name of the OpenSearch Serverless collection."
  default     = "example-collection"
}

Create an encryption policy

Now that the provider is installed and configured, the Terraform configuration moves on to defining OpenSearch Serverless policies for security. OpenSearch Serverless uses AWS Key Management Service (AWS KMS) to encrypt your data. The encryption is managed by an encryption policy. To create an encryption policy, use the aws_opensearchserverless_security_policy resource, which has a name parameter, a type of encryption, a JSON string that defines the policy, and an optional description:

# Creates an encryption security policy
resource "aws_opensearchserverless_security_policy" "encryption_policy" {
  name        = "example-encryption-policy"
  type        = "encryption"
  description = "encryption policy for ${var.collection_name}"
  policy = jsonencode({
    Rules = [
      {
        Resource = [
          "collection/${var.collection_name}"
        ],
        ResourceType = "collection"
      }
    ],
    AWSOwnedKey = true
  })
}

This encryption policy is named example-encryption-policy, applies to a collection named example-collection, and uses an AWS owned key to encrypt the data.

Create an OpenSearch Serverless collection

You can organize your OpenSearch indexes into a logical grouping called a collection. Create a collection using the aws_opensearchserverless_collection resource, which has a name parameter, and optionally, description, tags, and type:

# Creates a collection
resource "aws_opensearchserverless_collection" "collection" {
  name = var.collection_name

  depends_on = [aws_opensearchserverless_security_policy.encryption_policy]
}

This collection is named example-collection. If type is not specified, a time series collection is created. Supported collection types can be found in the Terraform documentation for the aws_opensearchserverless_collection resource. OpenSearch Serverless requires encryption at rest, so an applicable encryption policy is required before a collection can be created. The Terraform configuration explicitly defines this dependency using the depends_on meta-argument. Errors can arise if this dependency is not defined.

Now that a collection has been created with an AWS owned KMS key, the Terraform configuration goes on to define the network and data access policy to configure access to the collection.

Create a network policy

A network policy allows access to your collection either over the public internet or through OpenSearch Serverless-managed VPC endpoints. Similar to the encryption policy, to create a network policy, use the aws_opensearchserverless_security_policy resource, which has a name parameter, a type of network, a JSON string that defines the policy, and an optional description:

# Creates a network security policy
resource "aws_opensearchserverless_security_policy" "network_policy" {
  name        = "example-network-policy"
  type        = "network"
  description = "public access for dashboard, VPC access for collection endpoint"
  policy = jsonencode([
    {
      Description = "VPC access for collection endpoint",
      Rules = [
        {
          ResourceType = "collection",
          Resource = [
            "collection/${var.collection_name}"
          ]
        }
      ],
      AllowFromPublic = false,
      SourceVPCEs = [
        aws_opensearchserverless_vpc_endpoint.vpc_endpoint.id
      ]
    },
    {
      Description = "Public access for dashboards",
      Rules = [
        {
          ResourceType = "dashboard"
          Resource = [
            "collection/${var.collection_name}"
          ]
        }
      ],
      AllowFromPublic = true
    }
  ])
}

This network policy is named example-network-policy and applies to the collection named example-collection. This policy only allows access to the collection’s OpenSearch endpoint through a VPC endpoint, but allows public access to the OpenSearch Dashboards endpoint.

You’ll notice the VPC endpoint has not been defined yet, but it is referenced in the network policy. Terraform determines this dependency automatically and will not create the network policy until the VPC endpoint has been created.

Create a VPC endpoint

A VPC endpoint enables you to privately access your OpenSearch Serverless collection using AWS PrivateLink (for more information, refer to Access AWS services through AWS PrivateLink). Create a VPC endpoint using the aws_opensearchserverless_vpc_endpoint resource, where you define name, vpc_id, subnet_ids , and optionally, security_group_ids:

# Creates a VPC endpoint
resource "aws_opensearchserverless_vpc_endpoint" "vpc_endpoint" {
  name               = "example-vpc-endpoint"
  vpc_id             = aws_vpc.vpc.id
  subnet_ids         = [aws_subnet.subnet.id]
  security_group_ids = [aws_security_group.security_group.id]
}

Creating a VPC and all the required networking resources is out of scope for this post, but the minimum required VPC resources are created here in a separate file to demonstrate the VPC endpoint functionality. Refer to Getting Started with Amazon VPC to learn more.

Create a data access policy

The configuration defines a data source that looks up information about the context Terraform is currently running in. This data source is used when defining the data access policy. More information can be found in the Terraform documentation for aws_caller_identity.

# Gets access to the effective Account ID in which Terraform is authorized
data "aws_caller_identity" "current" {}

A data access policy allows you to define who has access to collections and indexes. The data access policy is defined using the aws_opensearchserverless_access_policy resource, which has a name parameter, a type parameter set to data, a JSON string that defines the policy, and an optional description:

# Creates a data access policy
resource "aws_opensearchserverless_access_policy" "data_access_policy" {
  name        = "example-data-access-policy"
  type        = "data"
  description = "allow index and collection access"
  policy = jsonencode([
    {
      Rules = [
        {
          ResourceType = "index",
          Resource = [
            "index/${var.collection_name}/*"
          ],
          Permission = [
            "aoss:*"
          ]
        },
        {
          ResourceType = "collection",
          Resource = [
            "collection/${var.collection_name}"
          ],
          Permission = [
            "aoss:*"
          ]
        }
      ],
      Principal = [
        data.aws_caller_identity.current.arn
      ]
    }
  ])
}

This data access policy allows the current AWS role or user to perform collection-related actions on the collection named example-collection and index-related actions on the indexes in the collection.

Deploy using Terraform

Now that you have configured the necessary resources, apply the configuration using terraform apply. Before creating the resources, Terraform will describe all the resources that will be created so you can verify your configuration:

$ terraform apply

...

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

...

Plan: 13 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + collection_enpdoint = (known after apply)
  + dashboard_endpoint  = (known after apply)

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

Answer yes to proceed.

If this is the first OpenSearch Serverless collection in your account, applying the configuration may take over 10 minutes because Terraform waits for the collection to become active.

Apply complete! Resources: 13 added, 0 changed, 0 destroyed.

Outputs:

collection_enpdoint = "..."
dashboard_endpoint = "..."

You have now deployed an OpenSearch Serverless time series collection with policies to configure encryption and access to the collection!

Clean up

The resources will incur costs as long as they are running, so clean up the resources when you are done using them. Use the terraform destroy command to do this:

$ terraform destroy

...

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  ...

Plan: 0 to add, 0 to change, 13 to destroy.

Changes to Outputs:

...

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value:

Answer yes to run this plan and destroy the infrastructure.

Destroy complete! Resources: 13 destroyed.

All resources created during this walkthrough have now been deleted.

Conclusion

In this post, you created an OpenSearch Serverless collection. Using IaC software like Terraform can make it simple to manage your resources like OpenSearch Serverless collections, encryption, network, and data access policies, and VPC endpoints.

Thank you to all the open-source contributors who help maintain OpenSearch and Terraform.

Try using OpenSearch Serverless with Terraform to simplify your resource management. Check out the Getting started with Amazon OpenSearch Serverless workshop and the Amazon OpenSearch Serverless Developer Guide to learn more about OpenSearch Serverless.

About the authors

Joshua Luo is a Software Development Engineer for Amazon OpenSearch Serverless. He works on the systems that enable customers to manage and monitor their OpenSearch Serverless resources. He enjoys bouldering, photography, and videography in his free time.

Satish Nandi is a Senior Technical Product Manager for Amazon OpenSearch Service.

Monitoring Amazon OpenSearch Serverless using AWS User Notifications

2023-08-29 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/monitoring-amazon-opensearch-serverless-using-aws-user-notifications/

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple for you to run search and analytics workloads without having to think about infrastructure management. The compute capacity used for data ingestion, and search and query in OpenSearch Serverless is measured in OpenSearch Compute Units (OCUs). Customers can configure maximum OCU limits in their AWS account to control costs. In the past, customers had to monitor resource usage metrics to make sure that the collections did not deplete their configured storage and computational capacity. With the new AWS User Notification integration, you can configure the system to send notifications whenever the capacity threshold is breached. The User Notification feature eliminates the need to monitor the service constantly. AWS User Notifications enables users to centrally set up and view notifications from various AWS services across accounts and regions in a human-friendly format. Users can view notifications in a Console Notifications Center and also configure various delivery channels.

You can now use AWS User Notifications to set up delivery channels to get notified when resource usage is nearing or exceeding the capacity threshold. You receive a notification when an event matches a rule that you specify. There are multiple channels for receiving notifications, including email, AWS Chatbot chat notifications, and AWS Console Mobile Application push notifications. In this post, you will see how you can use AWS user notifications to receive notifications for OCU threshold breaches across all your OpenSearch Serverless collections.

Solution overview

OpenSearch Serverless allows you to receive OCU utilization notifications for both search and indexing in the two scenarios listed below. OCU utilization percent is calculated based on your configured maximum capacity limit and the current OCU consumption.

OCU Utilization Approaching Max Limit – OpenSearch Serverless sends this event through AWS User Notifications when current OCU usage percent reaches greater than or equal to 75 percent of the configured maximum OCU capacity.
OCU Utilization Reached Max Limit – OpenSearch Serverless sends this event when OCU usage percent reaches 100 percent of the configured maximum OCU capacity.

If you receive OCU utilization notification, you can adjust the maximum capacity limit for your collection using either the console or AWS CLI.

The following sections detail the steps you can take to receive both these OCU utilization notifications for all collections.

Prerequisites

As a prerequisite to receive OCU utilization notifications, you will set up notification configuration in notification hubs to store the notifications data. This has to be done only once, and at least one AWS region should be selected. The following screenshot shows a sample notification hub configuration for the US East (Ohio) AWS Region.

Also, please configure the maximum OCU capacity depending on your requirements for all collections.

Set up OCU utilization notifications

To set up the notifications, complete the following steps:

On the AWS User Notifications console, create a notification configuration to receive notifications about OCU utilization.
Choose Amazon OpenSearch Serverless for the service name and choose OCU Utilization Approaching Max Limit for the event type.
Click Add another event rule and choose OCU Utilization Reached Max Limit for the event type.
Using AWS User Notifications, you can receive the notifications as soon as they occur or receive them within 5 minutes to avoid receiving too many notifications all at once. We recommend receiving notifications every 5 minutes. Choose Receive within 5 minutes (recommended).
You can opt to receive notifications through many delivery channels like the AWS Console Mobile Application and chat channels like Slack. For this blog post, to keep it simple, you can add your email address as the delivery channel, as shown in the following image.
After you complete the configuration, your notification configurations page should look like the following screenshot.
Once the OCU consumption for any of your collections reaches greater than or equal to 75 percent of the configured maximum OCU capacity, you will get a notification to your configured email address within 5 minutes, as shown in the following image.
Once the OCU consumption reaches 100 percent of allocated OCU capacity for any of your collections, you will get a notification to your configured email address within 5 minutes, as shown in the following image.
You can also see notifications in the AWS User Notifications console, as shown in the following screenshots

Summary

You can use AWS User Notifications to get notifications from various AWS services in one place. Now with Amazon OpenSearch Serverless integration, you can receive OCU utilization notifications as well. In this post, we explored how to enable notifications for all Amazon OpenSearch Serverless collections and receive notifications using an email delivery channel. If you have any questions or suggestions, please write to us in the comments section.

About the Author

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

2023-08-29 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/generate-security-insights-from-amazon-security-lake-data-using-amazon-opensearch-ingestion/

Amazon Security Lake centralizes access and management of your security data by aggregating security event logs from AWS environments, other cloud providers, on premise infrastructure, and other software as a service (SaaS) solutions. By converting logs and events using Open Cybersecurity Schema Framework, an open standard for storing security events in a common and shareable format, Security Lake optimizes and normalizes your security data for analysis using your preferred analytics tool.

Amazon OpenSearch Service continues to be a tool of choice by many enterprises for searching and analyzing large volume of security data. In this post, we show you how to ingest and query Amazon Security Lake data with Amazon OpenSearch Ingestion, a serverless, fully managed data collector with configurable ingestion pipelines. Using OpenSearch Ingestion to ingest data into your OpenSearch Service cluster, you can derive insights quicker for time sensitive security investigations. You can respond swiftly to security incidents, helping you protect your business critical data and systems.

Solution overview

The following architecture outlines the flow of data from Security Lake to OpenSearch Service.

The workflow contains the following steps:

Security Lake persists OCSF schema normalized data in an Amazon Simple Storage Service (Amazon S3) bucket determined by the administrator.
Security Lake notifies subscribers through the chosen subscription method, in this case Amazon Simple Queue Service (Amazon SQS).
OpenSearch Ingestion registers as a subscriber to get the necessary context information.
OpenSearch Ingestion reads Parquet formatted security data from the Security Lake managed Amazon S3 bucket and transforms the security logs into JSON documents.
OpenSearch Ingestion ingests this OCSF compliant data into OpenSearch Service.
Download and import provided dashboards to analyze and gain quick insights into the security data.

OpenSearch Ingestion provides a serverless ingestion framework to easily ingest Security Lake data into OpenSearch Service with just a few clicks.

Prerequisites

Complete the following prerequisite steps:

Create an Amazon OpenSearch Service domain. For instructions, refer to Creating and managing Amazon OpenSearch Service domains.
You must have access to the AWS account in which you wish to set up this solution.

Set up Amazon Security Lake

In this section, we present the steps to set up Amazon Security Lake, which includes enabling the service and creating a subscriber.

Enable Amazon Security Lake

Identify the account in which you want to activate Amazon Security Lake. Note that for accounts that are part of organizations, you have to designate a delegated Security Lake administrator from your management account. For instructions, refer to Managing multiple accounts with AWS Organizations.

Sign in to the AWS Management Console using the credentials of the delegated account.
On the Amazon Security Lake console, choose your preferred Region, then choose Get started.

Amazon Security Lake collects log and event data from a variety of sources and across your AWS accounts and Regions.

Now you’re ready to enable Amazon Security Lake.

You can either select All log and event sources or choose specific logs by selecting Specific log and event sources.
Data is ingested from all Regions. The recommendation is to select All supported regions so activities are logged for accounts that you might not frequently use as well. However, you also have the option to select Specific Regions.
For Select accounts, you can select the accounts in which you want Amazon Security Lake enabled. For this post, we select All accounts.

You’re prompted to either create a new AWS Identity and Access Management (IAM) role or use an existing IAM role. This gives required permissions to Amazon Security Lake to collect the logs and events. Choose the option appropriate for your situation.
Choose Next.
Optionally, specify the Amazon S3 storage class for the data in Amazon Security Lake. For more information, refer to Lifecycle management in Security Lake.
Choose Next.
Review the details and create the data lake.

Create an Amazon Security Lake subscriber

To access and consume data in your Security Lake managed Amazon S3 buckets, you must set up a subscriber.

Complete the following steps to create your subscriber:

On the Amazon Security Lake console, choose Summary in the navigation pane.

Here, you can see the number of Regions selected.

Choose Create subscriber.

A subscriber consumes logs and events from Amazon Security Lake. In this case, the subscriber is OpenSearch Ingestion, which consumes security data and ingests it into OpenSearch Service.

For Subscriber name, enter OpenSearchIngestion.
Enter a description.
Region is automatically populated based on the current selected Region.
For Log and event sources, select whether the subscriber is authorized to consume all log and event sources or specific log and event sources.
For Data access method, select S3.
For Subscriber credentials, enter the subscriber’s <AWS account ID> and OpenSearchIngestion-<AWS account ID>.
For Notification details, select SQS queue.

This prompts Amazon Security Lake to create an SQS queue that the subscriber can poll for object notifications.

Choose Create.

Install templates and dashboards for Amazon Security Lake data

Your subscriber for OpenSearch Ingestion is now ready. Before you configure OpenSearch Ingestion to process the security data, let’s configure an OpenSearch sink (destination to write data) with index templates and dashboards.

Index templates are predefined mappings for security data that selects the correct OpenSearch field types for corresponding Open Cybersecurity Schema Framework (OCSF) schema definition. In addition, index templates also contain index-specific settings for a particular index patterns. OCSF classifies security data into different categories such as system activity, findings, identity and access management, network activity, application activity and discovery.

Amazon Security Lake publishes events from four different AWS sources: AWS CloudTrail with subsets for AWS Lambda and Amazon Simple Storage Service (Amazon S3), Amazon Virtual Private Cloud(Amazon VPC) Flow Logs, Amazon Route 53, and AWS Security Hub. The following table details the event sources and their corresponding OCSF categories and OpenSearch index templates.

Amazon Security Lake Source	OCSF Category ID	OpenSearch Index Pattern
CloudTrail (Lambda and Amazon S3 API subsets)	3005	ocsf-3005*
VPC Flow Logs	4001	ocsf-4001*
Route 53	4003	ocsf-4003*
Security Hub	2001	ocsf-2001*

To easily identify OpenSearch indices containing Security Lake data, we recommend following a structured index naming pattern that includes the log category and its OCSF defined class in the name of the index. An example is provided below

ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd}

Complete the following steps to install the index templates and dashboards for your data:

Download the component_templates.zip and index_templates.zip files and unzip them on your local device.

Component templates are composable modules with settings, mappings, and aliases that can be shared and used by index templates.

Upload the component templates before the index templates. For example, the following Linux command line shows how to use the OpenSearch _component_template API to upload to your OpenSearch Service domain (change the domain URL and the credentials to appropriate values for your environment):
```
ls component_templates | awk -F'_body' '{print $1}' | xargs -I{} curl  -u adminuser:password -X PUT -H 'Content-Type: application/json' -d @component_templates/{}_body.json https://my-opensearch-domain.es.amazonaws.com/_component_template/{}
```

Once the component templates are successfully uploaded, proceed to upload the index templates:

ls index_templates | awk -F'_body' '{print $1}' | xargs -I{} curl  -uadminuser:password -X PUT -H 'Content-Type: application/json' -d @index_templates/{}_body.json https://my-opensearch-domain.es.amazonaws.com/_index_template/{}

Verify whether the index templates and component templates are uploaded successfully, by navigating to OpenSearch Dashboards, choose the hamburger menu, then choose Index Management.

In the navigation pane, choose Templates to see all the OCSF index templates.

Choose Component templates to verify the OCSF component templates.

After successfully uploading the templates, download the pre-built dashboards and other components required to visualize the Security Lake data in OpenSearch indices.
To upload these to OpenSearch Dashboards, choose the hamburger menu, and under Management, choose Stack Management.
In the navigation pane, choose Saved Objects.

Choose Import.

Choose Import, navigate to the downloaded file, then choose Import.

Confirm the dashboard objects are imported correctly, then choose Done.

All the necessary index and component templates, index patterns, visualizations, and dashboards are now successfully installed.

Configure OpenSearch Ingestion

Each OpenSearch Ingestion pipeline will have a single data source with one or more sub-pipelines, processors, and sink. In our solution, Security Lake managed Amazon S3 is the source and your OpenSearch Service cluster is the sink. Before setting up OpenSearch Ingestion, you need to create the following IAM roles and set up the required permissions:

Pipeline role – Defines permissions to read from Amazon Security Lake and write to the OpenSearch Service domain
Management role – Defines permission to allow the user to create, update, delete, validate the pipeline and perform other management operations

The following figure shows the permissions and roles you need and how they interact with the solution services.

Before you create an OpenSearch Ingestion pipeline, the principal or the user creating the pipeline must have permissions to perform management actions on a pipeline (create, update, list, and validate). Additionally, the principal must have permission to pass the pipeline role to OpenSearch Ingestion. If you are performing these operations as a non-administrator, add the following permissions to the user creating the pipelines:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Resource": "*",
			"Action": [
				"osis:CreatePipeline",
				"osis:ListPipelineBlueprints",
				"osis:ValidatePipeline",
				"osis:UpdatePipeline"
			]
		},
		{
			"_comment": "Replace {your-account-id} with your AWS account ID",
			"Resource": [
				"arn:aws:iam::{your-account-id}:role/pipeline-role"
			],
			"Effect": "Allow",
			"Action": [
				"iam:PassRole"
			]
		}
	]
}

Configure a read policy for the pipeline role

Security Lake subscribers only have access to the source data in the Region you selected when you created the subscriber. To give a subscriber access to data from multiple Regions, refer to Managing multiple Regions. To create a policy for read permissions, you need the name of the Amazon S3 bucket and the Amazon SQS queue created by Security Lake.

Complete the following steps to configure a read policy for the pipeline role:

On the Security Lake console, choose Regions in the navigation pane.
Choose the S3 location corresponding to the Region of the subscriber you created.

Make a note of this Amazon S3 bucket name.

Choose Subscribers in the navigation pane.
Choose the subscriber OpenSearchIngestion that you created earlier.

Take note of the Amazon SQS queue ARN under Subscription endpoint.

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
In the Specify permissions section, choose JSON to open the policy editor.

Remove the default policy and enter the following code (replace the S3 bucket and SQS queue ARN with the corresponding values):

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "ReadFromS3",
			"Effect": "Allow",
			"Action": "s3:GetObject",
			"Resource": "arn:aws:s3:::{bucket-name}/*"
		},
		{
			"Sid": "ReceiveAndDeleteSqsMessages",
			"Effect": "Allow",
			"Action": [
				"sqs:DeleteMessage",
				"sqs:ReceiveMessage"
			],
			"_comment": "Replace {your-account-id} with your AWS account ID",
			"Resource": "arn:aws:sqs:{region}:{your-account-id}:{sqs-queue-name}"
		}
	]
}

Choose Next.
For policy name, enter read-from-securitylake.
Choose Create policy.

You have successfully created the policy to read data from Security Lake and receive and delete messages from the Amazon SQS queue.

The complete process is shown below.

Configure a write policy for the pipeline role

We recommend using fine-grained access control (FGAC) with OpenSearch Service. When you use FGAC, you don’t have to use a domain access policy; you can skip the rest of this section and proceed to creating your pipeline role with the necessary permissions. If you use a domain access policy, you need to create a second policy (for this post, we call it write-to-opensearch) as an added step to the steps in the previous section. Use the following policy code:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "es:DescribeDomain",
			"Resource": "arn:aws:es:*:{your-account-id}:domain/*"
		},
		{
			"Effect": "Allow",
			"Action": "es:ESHttp*",
			"Resource": "arn:aws:es:*:{your-account-id}:domain/{domain-name}/*"
		}
	]
}

If the configured role has permissions to access Amazon S3 and Amazon SQS across accounts, OpenSearch Ingestion can ingest data across accounts.

Create the pipeline role with necessary permissions

Now that you have created the policies, you can create the pipeline role. Complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Use cases for other AWS services, select OpenSearch Ingestion pipelines.
Choose Next.
Search for and select the policy read-from-securitylake.
Search for and select the policy write-to-opensearch (if you’re using a domain access policy).
Choose Next.
For Role Name, enter pipeline-role.
Choose Create.

Keep note of the role name; you will be using it while configuring opensearch-pipeline.

Now you can map the pipeline role to an OpenSearch backend role if you’re using FGAC. You can map the ingestion role to one of predefined roles or create your own with necessary permissions. For example, all_access is a built-in role that grants administrative permission to all OpenSearch functions. When deploying to a production environment, make sure to use a role with just enough permissions to write to your Amazon OpenSearch Service domain.

Create the OpenSearch Ingestion pipeline

In this section, you use the pipeline role you created to create an OpenSearch Ingestion pipeline. Complete the following steps:

On the OpenSearch Service console, choose OpenSearch Ingestion in the navigation pane.
Choose Create pipeline.
For Pipeline name, enter a name, such as security-lake-osi.
In the Pipeline configuration section, choose Configuration blueprints and choose AWS-SecurityLakeS3ParquetOCSFPipeline.

Under source, update the following information:
1. Update the queue_url in the sqs section. (This is the SQS queue that Amazon Security Lake created when you created a subscriber. To get the URL, navigate to the Amazon SQS console and look for the queue ARN created with the format AmazonSecurityLake-abcde-Main-Queue.)
2. Enter the Region to use for aws credentials.

Under sink, update the following information:
1. Replace the hosts value in the OpenSearch section with the Amazon OpenSearch Service domain endpoint.
2. For sts_role_arn, enter the ARN of pipeline-role.
3. Set region as us-east-1.
4. For index, enter the index name that was defined in the template created in the previous section ("ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd}").
Choose Validate pipeline to verify the pipeline configuration.

If the configuration is valid, a successful validation message appears; you can now proceed to the next steps.

Under Network, select Public for this post. Our recommendation is to select VPC access for an inherent layer of security.
Choose Next.
Review the details and create the pipeline.

When the pipeline is active, you should see the security data ingested into your Amazon OpenSearch Service domain.

Visualize the security data

After OpenSearch Ingestion starts writing your data into your OpenSearch Service domain, you should be able to visualize the data using the pre-built dashboards you imported earlier. Navigate to dashboards and choose any one of the installed dashboards.

For example, choosing DNS Activity will give you dashboards of all DNS activity published in Amazon Security Lake.

This dashboard shows the top DNS queries by account and hostname. It also shows the number of queries per account. OpenSearch Dashboards are flexible; you can add, delete, or update any of these visualizations to suit your organization and business needs.

Clean up

To avoid unwanted charges, delete the OpenSearch Service domain and OpenSearch Ingestion pipeline, and disable Amazon Security Lake.

Conclusion

In this post, you successfully configured Amazon Security Lake to send security data from different sources to OpenSearch Service through serverless OpenSearch Ingestion. You installed pre-built templates and dashboards to quickly get insights from the security data. Refer to Amazon OpenSearch Ingestion to find additional sources from which you can ingest data. For additional use cases, refer to Use cases for Amazon OpenSearch Ingestion.

About the authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Jimish Shah is a Senior Product Manager at AWS with 15+ years of experience bringing products to market in log analytics, cybersecurity, and IP video streaming. He’s passionate about launching products that offer delightful customer experiences, and solve complex customer problems. In his free time, he enjoys exploring cafes, hiking, and taking long walks.

Amazon OpenSearch Service H1 2023 in review

2023-08-24 Hajer Bouafif

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-h1-2023-in-review/

Since its release in January 2021, the OpenSearch project has released 14 versions through June 2023. Amazon OpenSearch Service supports the latest versions of OpenSearch up to version 2.7.

OpenSearch Service provides two configuration options to deploy and operate OpenSearch at scale in the cloud. With OpenSearch Service managed domains, you specify a hardware configuration and OpenSearch Service provisions the required hardware and takes care of software patching, failure recovery, backups, and monitoring. With managed domains, you can use advanced capabilities at no extra cost such as cross-cluster search, cross-cluster replication, anomaly detection, semantic search, security analytics, and more. You don’t need a large team to maintain and operate your OpenSearch Service domain at scale. Your team should be familiar with sharding concepts and OpenSearch best practices to use the OpenSearch managed offering.

Amazon OpenSearch Serverless provides a straightforward and fully auto scaled deployment option. When you use OpenSearch Serverless, you create a collection (a set of indexes that work together on one workload) and use OpenSearch’s APIs, and OpenSearch Serverless does the rest. You don’t need to worry about sizing, capacity planning, or tuning your OpenSearch cluster.

In this post, we provide a review of all the exciting features releases in OpenSearch Service in the first half of 2023.

Build powerful search solutions

In this section, we discuss some of the features in OpenSearch Service that enable you to build powerful search solutions.

OpenSearch Serverless and the serverless vector engine

Earlier this year, we announced the general availability of OpenSearch Serverless. OpenSearch Serverless separates storage and compute components, and indexing and query compute, so they can be managed and scaled independently. It uses Amazon Simple Storage Service (Amazon S3) as the primary data storage for indexes, adding durability for your data. Collections are able to take advantage of the S3 storage layer to reduce the need for hot storage, and reduce cost, by bringing data into local store when it’s accessed.

When you create a serverless collection, you set a collection type. OpenSearch Serverless optimizes resource use depending on the type you set. At release, you could create search and time series collections for full-text search and log analytics use cases, respectively. In July 2023, we previewed support for a third collection type: vector search. The vector engine for OpenSearch Serverless is a simple, scalable, and high-performing vector store and query engine that enables generative AI, semantic search, image search, and more. Built on OpenSearch Serverless, the vector engine inherits and benefits from its robust architecture. With the vector engine, you don’t have to worry about sizing, tuning, and scaling the backend infrastructure. The vector engine automatically adjusts resources by adapting to changing workload patterns and demand to provide consistently fast performance and scale. The vector engine uses approximate nearest neighbor (ANN) algorithms from the Non-Metric Space Library (NMSLIB) and FAISS libraries to power k-NN search.

You can start using the new vector engine capabilities by selecting Vector search when creating your collection on the OpenSearch Service console. Refer to Introducing the vector engine for Amazon OpenSearch Serverless, now in preview for more information about the new vector search option with OpenSearch Serverless.

Configure collection settings

Point in Time

Point in Time (PIT) search, released in version 2.4 of OpenSearch Project and supported in OpenSearch 2.5 in OpenSearch Service, provides consistency in search pagination even when new documents are ingested or deleted within a specific index. For example, let’s say your website user searched for “blue couch” and spent a few minutes looking at the results. During those few minutes, the application added some additional couches to the index, shifting the order of the first 20 documents. If the user then navigates from page 1 to page 2, they may see results that were already on page 1 but have shifted down in the result order. The pagination is not stable over the addition of new data to the index. If you use PIT search, the result order is guaranteed to remain the same across pages, regardless of changes to the index. To learn more about PIT capabilities, refer to Launch highlight: Paginate with Point in Time.

Search relevance plugin

Ever wondered what would happen if you adjusted your relevance function—would the results be better, or worse? With the search relevance plugin, you can now view a side-by-side comparison of results in OpenSearch Dashboards. A UI view makes it simple to see how the results have changed and dial in your relevance to perfection.

Additional field types

OpenSearch 2.7 (available in OpenSearch Service) supports the following new object mapping types:

Cartesian field type – OpenSearch 2.7 in OpenSearch Service adds deeper support for GEO data. If you are building a virtual reality application, computer-aided design (CAD), or sporting venue mapping, you can benefit from the support of Cartesian field types xy point field and xy shape field.
Flat object type – When you set your field’s mapping to flat_object, OpenSearch indexes any JSON objects in the field to let you search for leaf values, even if you don’t know the field name, and lets you search via dotted-path notation. Refer to Use flat object in OpenSearch to learn more about how the flat object mapping type simplifies index mappings and the search experience in OpenSearch.

Geographical analysis

Starting from OpenSearch 2.7 in OpenSearch Service, you can run GeoHex grid aggregation queries on datasets built with the Hexagonal Hierarchical Geospatial Indexing System (H3) open-source library. H3 provides precision down to the square meter or less, making it useful for cases that require a high degree of precision. Because high-precision requests are compute heavy, you should be sure to limit the geographic area using filters.

Take Observability to the next level

Observability in OpenSearch is a collection of plugins and features that let you explore, query and visualize telemetry data stored in OpenSearch. In this section, we discuss how OpenSearch Service enables you to take Observability to the next level.

Simple schema for observability

With version 2.6, the OpenSearch Project released a new unified schema for Observability named Simple Schema for Observability (SS4O) (supported in OpenSearch 2.7 in OpenSearch Service). SS4O is inspired by both OpenTelemetry and the Elastic Common Schema (ECS) and uses Amazon Elastic Container Service (Amazon ECS) event logs and OpenTelemetry (OTel) metadata. SS4O specifies the index structure (mapping), index naming conventions, an integration feature for adding preconfigured dashboards and visualizations, and a JSON schema for enforcing and validating the structure. SS4O complies with the OTEL schema for logs, traces, and metrics.

Jaeger traces support

With the release of OpenSearch 2.5, you can now integrate Jaeger trace data in OpenSearch and use the Observability plugin to analyze your trace data in Jaeger format.

Observability provides you with visibility on the health of your system and microservice applications. OpenSearch Dashboards comes with an Observability plugin, which provides a unified experience for collecting and monitoring metrics, logs, and traces from common data sources. With the Observability plugin, you can monitor and alert on your logs, metrics, and traces to ensure that your application is available, performant, and error-free.

In the first half of 2023, we added the capability to create Observability dashboards and standard dashboards from the OpenSearch Dashboards main menu. Before that, you needed to navigate to the Observability plugin to create event analytics visualizations using Piped Processing Language (PPL). With this release, we made this feature more accessible by integrating a new type of visualization named “PPL” within the list of visualization types on the Dashboards main menu. This helps you correlate both business insights and observability analytics in a single place.

“PPL” visualization type

Build serverless ingestion pipelines

In April of 2023, OpenSearch Service released Amazon OpenSearch Ingestion, a fully managed and auto scaled ingestion pipeline for OpenSearch Service domains and OpenSearch Serverless collections. OpenSearch ingestion is powered by Data Prepper, with source and sink plugins to process, sample, filter, enrich, and deliver data for downstream analysis. Refer to Supported plugins and options for Amazon OpenSearch Ingestion pipelines to learn more.

The service automatically accommodates your workload demands by scaling up and down the OpenSearch Compute units (OCUs). Each OCU provides an estimated 8 GB per hour of throughput (your workload will determine the actual throughput) and is a combination of 8 GiB of memory and 2 vCPUs. You can scale up to 96 OCUs.

OpenSearch ingestion provides out-of-the-box pipeline blueprints that provide configuration templates for the most common ingestion pipelines. For more information, refer to Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service.

Enable your business with security features

In this section, we discuss how you can use OpenSearch Service to enable your business with security features.

Enable SAML during domain creation

SAML authentication for OpenSearch Dashboards was introduced in OpenSearch Service domains with Elasticsearch version 6.7 or higher and OpenSearch version 1.0 or higher, but you had to wait for the domain to be created to enable SAML. In February 2023, we enabled you to specify SAML support during domain creation. Support is available when you create domains on the AWS Management Console, AWS SDK, or AWS CloudFormation templates. SAML authentication for OpenSearch Dashboards enables you to integrate directly with identity providers (IdPs) such as Okta, Ping Identity, OneLogin, Auth0, Active Directory Federation Services (ADFS), and Azure Active Directory.

Security analytics with OpenSearch

OpenSearch 2.5 in OpenSearch Service launched support for OpenSearch’s security analytics plugin. In the past, identifying actionable security alerts and gaining valuable insights required significant expertise and familiarity with various security products. However, with security analytics, you can now benefit from simplified workflows that facilitate correlating multiple security logs and investigating security incidents, all within the OpenSearch environment, even without prior security experience. The security analytics plugin is bundled with an extensive collection of over 2,200 open-source Sigma security rules. These rules play a crucial role in detecting potential security threats in real time from your event logs. With the security analytics plugin, you can also design custom rules, tailor security alerts based on threat severity, and receive automated notifications at your preferred destination, such as email or a Slack channel. For more information about creating detectors and configuring rules, refer to Identify and remediate security threats to your business using security analytics with Amazon OpenSearch Service.

Security Analytics plugin - Alerts and findings

Ingest events from Amazon Security Lake

In June 2023, OpenSearch Ingestion added support for real-time ingestion of events from Amazon Security Lake, reducing indexing time for security data in OpenSearch Service. With Amazon Security Lake centralizing security data from various sources, you can take advantage of the extensive security analytics capabilities and rich dashboard visualizations of OpenSearch Service to gain valuable insights quickly. Using the Open Cybersecurity Schema Framework (OCSF), Amazon Security Lake normalizes and combines data from diverse enterprise security sources in Apache Parquet format. OpenSearch Ingestion now enables ingestion in Parquet format, with built-in processors to convert data into JSON documents before indexing. Additionally, there’s a specialized blueprint for ingesting data from Amazon Security Lake and support for Data Prepper 2.3.0, offering new features like S3 sink, Avro codec, obfuscation processor, event tagging, advanced expressions, and tail sampling.

Amazon Security Lake blueprint in OpenSearch Ingestion

Simplify cluster operations

In this section, we discuss how you can use OpenSearch Service to simplify cluster operations.

Enhanced dry run for configuration changes

OpenSearch Service has introduced an enhanced dry run option that allows you to validate configuration changes before applying them to your clusters. This feature ensures that any potential validation errors that might occur during the deployment of configuration changes are checked and summarized for your review. Additionally, the dry run will indicate whether a blue/green deployment is necessary to apply a change, enabling you to plan accordingly.

Ensure high availability and consistent performance

OpenSearch Service now offers 99.99% availability with Multi-AZ with Standby deployment. This new capability makes your business-critical workloads more resilient to potential infrastructure failures such as Availability Zone failure. Prior to this new launch, OpenSearch Service automatically recovered from Availability Zone outages by allocating more capacity in the impacted Availability Zone and automatically redistributing shards. However, this approach is a reactive approach to infrastructure and network failures, and usually led to high latency and increased resource utilization across the nodes. The Multi-AZ with Standby feature deploys infrastructure in three Availability Zones, while keeping two zones as active and one zone as standby. It requires a minimum of two replicas to maintain data redundancy across Availability Zones for a recovery time in less than a minute.

Multi AZ with stand-by feature

Skip unavailable clusters in cross-cluster search

With the release of the Skip unavailable clusters option for cross-cluster search in June 2023, your cross-cluster search queries will return results even if you have unavailable shards or indexes on one of the remote clusters. The feature is enabled by default when you request connection to a remote cluster on the OpenSearch Service console.

Cross-cluster search feature

Enhance your experience with OpenSearch Dashboards

The release of OpenSearch 2.5 and OpenSearch 2.7 in OpenSearch Service has brought new features to manage data streams and indexes on the OpenSearch Dashboards UI.

Snapshot management

By default, OpenSearch Service takes hourly snapshots of your data with a retention time of 14 days. The automatic snapshots are incremental in nature and help you recover from data loss or cluster failure. In addition to the default hourly snapshots, OpenSearch Service provides the capability to run manual snapshots and store them in an S3 bucket. You can use snapshot management to create manual snapshots, define a snapshot retention policy, and set up the frequency and timing of snapshot creation. Snapshot management is available under the index management plugin in OpenSearch Dashboards.

Snapshot management plugin

Index and data streams management

With the support of OpenSearch 2.5 and OpenSearch 2.7 in OpenSearch Service , you can now use the index management plugin in OpenSearch dashboards to manage data streams, index templates, and index aliases.

The index management UI provides expended capabilities to include running manual rollover and force merge actions for data streams. You can also visually manage multiple index templates and define index mappings, number of primary shards, number of replicas, and refresh internal for your indexes.

index management UI

Conclusion

It’s been a busy first half of the year! OpenSearch Project and OpenSearch Service have launched OpenSearch Serverless to use OpenSearch without worrying about infrastructure, index, or shards; OpenSearch Ingestion to ingest your data; the vector engine for OpenSearch Serverless; security analytics to analyze data from Amazon Security Lake; operational improvements to bring 99.99% availability; and improvements to the Observability plugin. OpenSearch Service provides a full suite of capabilities, including a vector database, semantic search, and log analytics engine. We invite you to check out the features described in this post and we appreciate providing us your valuable feedback.

You can get started by having hands-on experience with the publicly available workshops for semantic search, microservice observability, and OpenSearch Serverless. You can also learn more about the service features and use cases by checking out more OpenSearch Service blog posts.

About the Authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aish Gunasekar is a Specialist Solutions Architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew health

2023-08-21 Nikhil Agarwal

Post Syndicated from Nikhil Agarwal original https://aws.amazon.com/blogs/big-data/amazon-cloudwatch-metrics-for-amazon-opensearch-service-storage-and-shard-skew-health/

Amazon OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in AWS to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite.

When working with OpenSearch Service, shard strategy is key. Shards distribute your workload across the data nodes of your cluster. When creating an index, you tell OpenSearch Service how many primary shards to create and how many replicas to create of each shard. The primary shards are independent partitions of the full dataset. OpenSearch Service automatically distributes your data across the primary shards in an index. Our recommendation is to use two replicas for your index. For example, if you set your index’s shard count to three primary shards and two replicas, you will have a total of nine shards. Properly configured indexes can help boost overall domain performance, whereas a misconfigured index will lead to storage and performance skew.

OpenSearch Service distributes the shards in your indexes to the data nodes in your domain, ensuring that no primary shard and its replicas are placed on the same node. The data for the shards are stored in the node’s storage. If your indexes (and therefore their shards) are very different sizes, the storage used on the data nodes in the domain will be unequal, or skewed. Storage skew leads to uneven memory and CPU utilization, intermittent and uneven latency, and uneven queueing and rejecting of requests. Therefore, it’s important to configure and maintain indexes such that shards can be distributed evenly across the data nodes of your cluster.

In this post, we explore how to deploy Amazon CloudWatch metrics using an AWS CloudFormation template to monitor an OpenSearch Service domain’s storage and shard skew. This solution uses an AWS Lambda function to extract storage and shard distribution metadata from your OpenSearch Service domain, calculates the level of skew, and then pushes this information to CloudWatch metrics so that you can easily monitor, alert, and respond.

Solution overview

The solution and associated resources are available for you to deploy into your own AWS account as a CloudFormation template. The template deploys the following resources:

An AWS Identity and Access Management (IAM) role for the Lambda function called OpensearchSkewMetricsLambdaRole. This allows write access to CloudWatch metrics and access to the CloudWatch log group and OpenSearch APIs.
An AWS Lambda function called Opensearch-SkewMetricsPublisher-py.
An Amazon CloudWatch log group for the Lambda function called /aws/lambda/Opensearch-skewmetrics-publisher-py.
An Amazon EventBridge rule for the Lambda function called EventRuleForOSSkew.
The following CloudWatch metrics for the Lambda function:
- aws_/<region-name>/<MetricIdentifier>/_storagemetric
- aws_/<region-name>/<MetricIdentifier>/_shardmetric

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
An OpenSearch Service domain.
This post requires you to add a Lambda role to the OpenSearch Service domain’s security configuration access policy. If your domain is using fine-grained access control, then you need to follow the steps as described in the section Mapping roles to users to enable access for the newly deployed Lambda execution role to the domain after deploying the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template, complete the following steps:

Log in to your AWS account.
Select the Region where you’re running your OpenSearch Service domain.
To launch your CloudFormation stack, choose Launch Stack
For Stack name, enter a name for the stack (maximum length 30 characters).
For MetricIdentifier, enter a unique identifier that will help you identify the custom CloudWatch metrics for your domain.
For OpensearchDomainURL, enter the domain endpoint that you are monitoring.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Create stack.
Wait for the stack creation to complete.
On the Lambda console, choose Functions in the navigation pane.
Choose the Lambda function called Opensearch-SkewMetricsPublisher-py-<stackname>.
In the Code section, choose Test.
Keep the default values for the test event and run a quick test.

Make sure to grant the Lambda execution role permission to the OpenSearch Service domain’s resource-based policy, if you are using one. If fine-grained access control is enabled on the domain, then follow the steps in Mapping roles to users (as mentioned in the prerequisites) to allow the Lambda function to read from the domain in read-only access.

The Lambda function that sends OpenSearch domain metrics to CloudWatch is set to a default frequency of 1 day. You can change this configuration to monitor the domain at the required granularity by updating the event schedule for the rule deployed by the CloudFormation stack on the EventBridge console. Note that if the frequency is set to 1 minute, this will trigger the Lambda function every minute and will increase the Lambda cost.

This solution uses the cat/allocation API, which provides the number of data nodes in the domain along with each data node’s number of shards and storage usage attributes. For further details on domain storage and shard skew, refer to Node shard and storage skew. The Lambda function processes and sorts each data node’s storage and shard skew from the average value. Any data node’s skew above 10% from the average is generally considered to be significantly skewed. This will start to impact CPU, network, and disk bandwidth usage because the nodes with the highest storage utilization tend to be the resource-strained nodes, whereas nodes with less than 10% usage represent underutilized capacity.

Refer to Demystifying Elasticsearch shard allocation for details related to shard size and shard count strategy. In general, we recommend keeping shard sizes between 10–30 GB for workloads where search latency is a key performance objective and 30–50 GB for write-heavy workloads. For shard count, we recommend maintaining index shard counts that are divisible by the data node count. For additional details, refer to Sizing Amazon OpenSearch Service domains and Shard strategy.

View skew metrics in CloudWatch

After you run this solution in your account, it will create two CloudWatch metrics for monitoring. To access these CloudWatch metrics, use the following steps:

On the CloudWatch console, under Metrics in the navigation pane, choose All metrics.
Choose Browse and select Custom namespaces. You should see two custom metrics ending with _storageworkspace and _shardworkspace, respectively.
Choose either of the custom metrics and then select NodeID.
On the list of node IDs, select all the nodes displayed in the list, and the graph will be plotted automatically.

You can hover the mouse over the plotted lines to see the node skew information.

The following screenshots show examples of how the CloudWatch metrics will appear on the console.

The storage skew metrics will be similar to the following screenshot. Storage skew metrics shows the domain storage skew. If you hover over the graph, it shows the node list with available nodes in the domain. This list is sorted by the storage size (largest to smallest). The Lambda function will periodically post the latest storage skew results.

The shard skew metrics will be similar to the following screenshot. Shard skew metrics show the domain shard skew. If you hover over the graph, it shows the node list with available nodes in the domain. This list is sorted by the shard size (largest to smallest). The Lambda function will periodically post the latest storage skew results.

Storage skew occurs when one or more nodes within the domain has significantly more storage than other nodes. The CloudWatch metric will show higher deviation of storage usage for these nodes vs. other nodes. Similarly, shard skew occurs when one or more nodes has significantly more shards than others nodes. The CloudWatch metric will show higher deviation for these nodes vs. other nodes in the domain. When the domain storage or shard skew is detected, you can raise a support case to work with the AWS team for remediation actions. See How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster for information on how to take remediation actions to configure your domain shard strategy for optimal performance.

Costs

The cost associated with using this solution would be minimal, around few cents per month since it generates CloudWatch metrics. The solution also runs Lambda code, and in this case the Lambda functions make API calls. For pricing details, refer to Amazon CloudWatch Pricing and AWS Lambda Pricing.

Clean up

If you decide that you no longer want to keep the Lambda function and associated resources, you can navigate to the AWS CloudFormation console, choose the stack, and choose Delete.

If you want to add the CloudWatch skew monitor metrics mechanism back in at any point, you can create the stack again from the CloudFormation template.

Conclusion

You can use this solution to get a better understanding of your OpenSearch Service domain’s storage and shard skew to improve its performance and possibly lower the cost of operating your domain. See Use Elasticsearch’s _rollover API For efficient storage distribution for more details related to shard allocation and efficient storage distribution strategy.

About the authors

Nikhil Agarwal is Sr. Technical Manager with Amazon Web Services. He is passionate about helping customers achieve operational excellence in their cloud journey and working activity on technical solutions. He is also AI/ML enthusiastic and deep dives into customer’s ML-specific use cases. Outside of work, he enjoys traveling with family and exploring different gadgets.

Karthik Chemudupati is a Principal Technical Account Manager (TAM) with AWS, focused on helping customers achieve cost optimization and operational excellence. He has more than 19 years of IT experience in software engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and worked with more than dozen Enterprise Customers across US-West. Outside of work, he enjoys spending time with his family.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Support. He has been focused on our Amazon OpenSearch Service customers and ecosystem for the past three years. Gene joined AWS in 2017. Outside of work he enjoys mountain biking, traveling, and playing Population:One in VR.

Try semantic search with the Amazon OpenSearch Service vector engine

2023-08-21 Stavros Macrakis

Post Syndicated from Stavros Macrakis original https://aws.amazon.com/blogs/big-data/try-semantic-search-with-the-amazon-opensearch-service-vector-engine/

Amazon OpenSearch Service has long supported both lexical and vector search, since the introduction of its kNN plugin in 2020. With recent developments in generative AI, including AWS’s launch of Amazon Bedrock earlier in 2023, you can now use Amazon Bedrock-hosted models in conjunction with the vector database capabilities of OpenSearch Service, allowing you to implement semantic search, retrieval augmented generation (RAG), recommendation engines, and rich media search based on high-quality vector search. The recent launch of the vector engine for Amazon OpenSearch Serverless makes it even easier to deploy such solutions.

OpenSearch Service supports a variety of search and relevance ranking techniques. Lexical search looks for words in the documents that appear in the queries. Semantic search, supported by vector embeddings, embeds documents and queries into a semantic high-dimension vector space where texts with related meanings are nearby in the vector space and therefore semantically similar, so that it returns similar items even if they don’t share any words with the query.

We’ve put together two demos on the public OpenSearch Playground to show you the strengths and weaknesses of the different techniques: one comparing textual vector search to lexical search, the other comparing cross-modal textual and image search to textual vector search. With OpenSearch’s Search Comparison Tool, you can compare the different approaches. For the demo, we’re using the Amazon Titan foundation model hosted on Amazon Bedrock for embeddings, with no fine tuning. The dataset consists of a selection of Amazon clothing, jewelry, and outdoor products.

Background

A search engine is a special kind of database, allowing you to store documents and data and then run queries to retrieve the most relevant ones. End-user search queries usually consist of text entered in a search box. Two important techniques for using that text are lexical search and semantic search. In lexical search, the search engine compares the words in the search query to the words in the documents, matching word for word. Only items that have all or most of the words the user typed match the query. In semantic search, the search engine uses a machine learning (ML) model to encode text from the source documents as a dense vector in a high-dimensional vector space; this is also called embedding the text into the vector space. It similarly codes the query as a vector and then uses a distance metric to find nearby vectors in the multi-dimensional space. The algorithm for finding nearby vectors is called kNN (k Nearest Neighbors). Semantic search does not match individual query terms—it finds documents whose vector embedding is near the query’s embedding in the vector space and therefore semantically similar to the query, so the user can retrieve items that don’t have any of the words that were in the query, even though the items are highly relevant.

Textual vector search

The demo of textual vector search shows how vector embeddings can capture the context of your query beyond just the words that compose it.

In the text box at the top, enter the query tennis clothes. On the left (Query 1), there’s an OpenSearch DSL (Domain Specific Language for queries) semantic query using the amazon_products_text_embedding index, and on the right (Query 2), there’s a simple lexical query using the amazon_products_text index. You’ll see that lexical search doesn’t know that clothes can be tops, shorts, dresses, and so on, but semantic search does.

Compare semantic and lexical results

Similarly, in a search for warm-weather hat, the semantic results find lots of hats suitable for warm weather, whereas the lexical search returns results mentioning the words “warm” and “hat,” all of which are warm hats suitable for cold weather, not warm-weather hats. Similarly, if you’re looking for long dresses with long sleeves, you might search for long long-sleeved dress. A lexical search ends up finding some short dresses with long sleeves and even a child’s dress shirt because the word “dress” appears in the description, whereas the semantic search finds much more relevant results: mostly long dresses with long sleeves, with a couple of errors.

Cross-modal image search

The demo of cross-modal textual and image search shows searching for images using textual descriptions. This works by finding images that are related to your textual descriptions using a pre-production multi-modal embedding. We’ll compare searching for visual similarity (on the left) and textual similarity (on the right). In some cases, we get very similar results.

Compare image and textual embeddings

For example, sailboat shoes does a good job with both approaches, but white sailboat shoes does much better using visual similarity. The query canoe finds mostly canoes using visual similarity—which is probably what a user would expect—but a mixture of canoes and canoe accessories such as paddles using textual similarity.

If you are interested in exploring the multi-modal model, please reach out to your AWS specialist.

Building production-quality search experiences with semantic search

These demos give you an idea of the capabilities of vector-based semantic vs. word-based lexical search and what can be accomplished by utilizing the vector engine for OpenSearch Serverless to build your search experiences. Of course, production-quality search experiences use many more techniques to improve results. In particular, our experimentation shows that hybrid search, combining lexical and vector approaches, typically results in a 15% improvement in search result quality over lexical or vector search alone on industry-standard test sets, as measured by the NDCG@10 metric (Normalized Discounted Cumulative Gain in the first 10 results). The improvement is because lexical outperforms vector for very specific names of things, and semantic works better for broader queries. For example, in the semantic vs. lexical comparison, the query saranac 146, a brand of canoe, works very well in lexical search, whereas semantic search doesn’t return relevant results. This demonstrates why the combination of semantic and lexical search provides superior results.

Conclusion

OpenSearch Service includes a vector engine that supports semantic search as well as classic lexical search. The examples shown in the demo pages show the strengths and weaknesses of different techniques. You can use the Search Comparison Tool on your own data in OpenSearch 2.9 or higher.

Further information

For further information about OpenSearch’s semantic search capabilities, see the following:

About the author

Stavros Macrakis is a Senior Technical Product Manager on the OpenSearch project of Amazon Web Services. He is passionate about giving customers the tools to improve the quality of their search results.

Amazon OpenSearch Serverless expands support for larger workloads and collections

2023-08-17 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-expands-support-for-larger-workloads-and-collections/

We recently announced new enhancements to Amazon OpenSearch Serverless that can scan and search source data sizes of up to 6 TB. At launch, OpenSearch Serverless supported searching one or more indexes within a collection, with the total combined size of up to 1 TB. With the support for 6 TB source data, you can now scale up your log analytics, machine learning applications, and ecommerce data more effectively. With OpenSearch Serverless, you can enjoy the benefits of these expanded limits without having to worry about sizing, monitoring your usage, or manually scaling an OpenSearch domain. If you are new to OpenSearch Serverless, refer to Log analytics the easy way with Amazon OpenSearch Serverless to get started.

The compute capacity in OpenSearch Serverless used for data ingestion and search and query is measured in OpenSearch Compute Units (OCUs). To support larger datasets, we have raised the OCU limit from 50 to 100 for indexing and search, including redundancy for Availability Zone outages and infrastructure failures. These OCUs are shared among various collections, each containing one or more indexes of varied sizes. You can configure maximum OCU limits on search and indexing independently using the AWS Command Line Interface (AWS CLI), SDK, or AWS Management Console to manage costs. Furthermore, you can have multiple 6 TB collections. If you wish to expand the OCU limits for indexes and collection sizes beyond 6 TB, reach out to us through AWS Support.

Set max OCU to 100

To get started, you must first change the OCU limits for indexing and search to 100. Note that you only pay for the resources consumed and not for the max OCU configuration.

Ingesting the data

You can use the load generation scripts shared in the following workshop or you can use your own application or data generator to create load. You can run multiple instances of these scripts to generate a burst in indexing requests. As seen in the following screenshot, in this test, we created six indexes approximating to 1 TB or more.

Auto scaling resources in OpenSearch Serverless

The highlighted points in the following figures show how OpenSearch Serverless responds to the increasing indexing traffic from 2,000 bulk request operations to 7,000 bulk requests per second by auto scaling the OCUs. Each bulk request includes 7,500 documents. OpenSearch Serverless uses various system signals to automatically scale out the OCUs based on your workload demand.

OpenSearch Serverless also scales down indexing OCUs when there is a decrease in your workload’s activity level. The highlighted points in the following figures show a gradual decrease in indexing traffic from 7,000 bulk ingest operations to less than 1,000 operations per second. OpenSearch Serverless reacts to the changes in load by reducing the number of OCUs.

Conclusion

We encourage you to take advantage of the 6 TB index support and put it to the test! Migrate your data, explore the improved throughput, and take advantage of the enhanced scaling capabilities. Our goal is to deliver a seamless and efficient experience that aligns with your requirements.

To get started, refer to Log analytics the easy way with Amazon OpenSearch Serverless. To get hands-on with OpenSearch Serverless, follow the Getting started with Amazon OpenSearch Serverless workshop, which has a step-by-step guide for configuring and setting up an OpenSearch Serverless collection.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the author

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Configure SAML federation for Amazon OpenSearch Serverless with Okta

2023-08-08 Aish Gunasekar

Post Syndicated from Aish Gunasekar original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-okta/

Modern applications apply security controls across many systems and their subsystems. Keeping all of these systems in sync would be a major undertaking if you tried to implement it separately. Centralized identity management is the way to maintain a single identity provider (IdP) that can authenticate actors and manage and distribute their rights.

OpenSearch is an open-source search and analytics suite that enables you to ingest, store, analyze, and visualize full text and log data. Amazon OpenSearch Serverless makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud, freeing you from the undifferentiated heavy lifting of sizing, scaling, and operating an OpenSearch cluster. When you use OpenSearch Serverless, you can integrate with your existing Security Assertion Markup Language 2.0 (SAML)-compliant IdP to provide granular access control for your OpenSearch Serverless collections. Our customers use a variety of IdPs, including AWS IAM Identity Center (successor to AWS SSO), Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0.

In this post, you will learn how to use Okta as your IdP and integrate it with OpenSearch Serverless to securely manage your users and groups for secure access to your data.

Solution overview

The flow of access requests is depicted in the following figure.

When you navigate to OpenSearch Dashboards, the workflow steps are as follows:

OpenSearch Serverless generates a SAML authentication request.
OpenSearch Serverless redirects your request back to the browser.
The browser redirects to the Okta URL via the Okta application setup.
Okta parses the SAML request, authenticates the user, and generates a SAML response.
Okta returns the encoded SAML response to the browser.
The browser sends the SAML response back to the OpenSearch Serverless Assertion Consumer Services (ACS) URL.
ACS verifies the SAML response and logs in the user with the permissions defined in the data access policy.

Prerequisites

Complete the following prerequisite steps:

Create an OpenSearch Serverless collection. For instructions, refer to Preview: Amazon OpenSearch Serverless – Run Search and Analytics Workloads without Managing Clusters.
Make a note of your AWS account ID to use while configuring your application in Okta.
Create an Okta account, which you will use as an IdP.
Create users and a group in Okta:
1. Log in to your Okta account, and in the navigation pane, choose Directory, then choose Groups.
2. Choose Add Group and name itopensearch-serverless, then choose Save.
3. Choose Assign People to add users.
4. You can add users to theopensearch-serverlessgroup by choosing the plus sign next to the user name, or you can choose Add All.
5. Add your users, then choose Save.
6. To create new users, choose People in the navigation pane under Directory, then choose Add Person.
7. Provide your first name, last name, user name (email ID), and primary email address.
8. For Password, choose Set by admin and First-time password.
9. To create your user, choose Save.
10. In the navigation pane, choose Groups, then choose theopensearch-serverless group you created earlier.

The following graphic gives a quick demonstration of setting up a user and group.

Configure an application in Okta

To configure an application in Okta, complete the following steps:

Navigate to the Applications page on the Okta console.
Choose App Integration, select SAML 2.0 web application, then choose Next.
For Name, enter a name for the app (for example, myweblogs), then choose Next.
Under Application ACS URL, enter the URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs (replace <REGION> with the corresponding Region) to generate the IdP metadata.
Select Use this for Recipient URL and Destination URL to use the same ACS URL as the recipient and destination.
Specify aws:opensearch:<AWS-Account-ID> under Audience URI (SP Entity ID). This specifies who the assertion is intended for within the SAML assertion.
Under Group Attribute Statements, enter a name that is relevant to your application, such as mygroup, and select unspecified as the name format. (Don’t forget this name, you’ll need it later.)
Select equals as the filter and enter opensearch-serverless.
Select I’m a software vendor. I’d like to integrate my app with Okta and choose Finish.
After an app is created, choose the sign-on tab, scroll down to the metadata details, and copy the value for Metadata URL.

The following graphic gives a quick demonstration of setting up an application in Okta via the preceding steps.

Next, you associate the users and groups to the application that you created in the previous step.

On the Applications page, choose the app you created earlier.
On the Assignments tab, choose Assign.
Select Assign To Groups and choose the group you wish to assign to (opensearch-serverlessin this case).
Choose Done.

The following graphic gives a quick demonstration of assigning groups to the application via the preceding steps.

Set up SAML on OpenSearch Serverless

In this section, you create a SAML provider that you’ll use for your OpenSearch Serverless collection. Complete the following steps:

Open the OpenSearch Serverless console on a new tab.
In the navigation pane, under Serverless, choose SAML authentication.
Select Add SAML provider.
Provide a recognizable name (for example, okta) and a description.
Open a new tab and enter the copied metadata URL into your browser.

You should see the metadata for the Okta application.

Take note of this metadata and copy it to your clipboard.
On the OpenSearch Service console tab, enter this metadata in the Provide metadata from your IdP section.
Under Additional settings, enter mygroup or the group attribute provided in the Okta configuration.
Choose Create a SAML provider.

The SAML provider has now been created.

The following graphic gives a quick demonstration of setting up the SAML provider in OpenSearch Serverless via the preceding steps.

Update the data access policy

You need to configure the right permissions in the data access policies associated with your OpenSearch collection so your Okta group members can access the OpenSearch Dashboards endpoint.

On the OpenSearch Serverless console, open your collection.
Choose the data access policy associated with the collection in the Data Access section.
Choose Edit.
Choose Principals and Add a SAML principal.
Select the SAML provider you created earlier and enter group/opensearch-serverless next to it.
The OpenSearch Dashboards endpoint can be accessed by all group members. You can grant access to collections, indexes, or both.
Choose Save.

Log in to OpenSearch Dashboards

Now that you have set permissions to access the dashboards, choose the Dashboards URL under the general information for the OpenSearch Serverless collection. This should take you to the website
https://collection-endpoint/_dashboards/

You will see a list with all the access options. Choose the SAML provider that you created (okta in this case) and log in using your Okta credentials. You will now be logged into OpenSearch Dashboards with the permissions that are part of the data access policy. You can perform searches or create visualizations from the dashboard.

Clean up

To avoid unwanted charges, delete the OpenSearch Serverless collection, data access policy, and SAML provider created as part of this demonstration.

Summary

In this post, you learned how to set up Okta as an IdP to access OpenSearch Dashboards using SAML. You also learned how to set up users and groups within Okta and configure their access to OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.

About the Authors

How FIS ingests and searches vector data for quick ticket resolution with Amazon OpenSearch Service

2023-08-02 Rupesh Tiwari

Post Syndicated from Rupesh Tiwari original https://aws.amazon.com/blogs/big-data/how-fis-ingests-and-searches-vector-data-for-quick-ticket-resolution-with-amazon-opensearch-service/

This post was co-written by Sheel Saket, Senior Data Science Manager at FIS, and Rupesh Tiwari, Senior Architect at Amazon Web Services.

Do you ever find yourself grappling with multiple defect logging mechanisms, scattered project management tools, and fragmented software development platforms? Have you experienced the frustration of lacking a unified view, hindering your ability to efficiently manage and identify common trending issues within your enterprise? Are you constantly facing challenges when it comes to addressing defects and their impact, causing disruptions in your production cycles?

If these questions resonate with you, then you’re not alone. FIS, a leading technology and services provider, has encountered these very challenges. In their quest for a solution, they teamed up with AWS to tackle these obstacles head-on. In this post, we take you on a journey through their collaborative project, exploring how they used Amazon OpenSearch Service to transform their operations, enhance efficiency, and gain valuable insights.

This post shares FIS’s journey in overcoming challenges and provides step-by-step instructions for provisioning the solution architecture in your AWS account. You’ll learn how to implement a transformative solution that empowers your organization with near-real-time data indexing and visualization capabilities.

In the following sections, we dive into the details of FIS’s journey and discover how they overcame these challenges, revolutionizing their approach to defect management and software development.

Challenges for near-real-time ticket visualization and search

FIS faced several challenges in achieving near-real-time ticket visualization and search capabilities, including the
following:

Integrating ticket data from tens of different third-party systems
Overcoming API call thresholds and limitations from various systems
Implementing an efficient KNN vector search algorithm for resolving issues and performing trend analysis
Establishing a robust data ingestion and indexing process for real-time updates from 15,000 tickets per day
Ensuring unified access to ticket information across 20 development teams
Providing secure and scalable access to ticket data for up to 250 teams

Despite these challenges, FIS successfully enhanced their operational efficiency, enabled quick ticket resolution, and gained valuable insights through the integration of OpenSearch Service.

Let’s delve into the technical walkthrough of the architecture diagram and mechanisms. The following section provides step-by-step instructions for provisioning and implementing the solution on your AWS Management Console, along with a helpful video tutorial.

Solution overview

The architecture diagram of FIS’s near-real-time data indexing and visualization solution incorporates various AWS services for specific functions. The solution uses GitHub as the data source, employs Amazon Simple Storage Service (Amazon S3) for scalable storage, manages APIs with Amazon API Gateway, performs serverless computing using AWS Lambda, and facilitates data streaming and ETL (extract, transform, and load) processes through Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. OpenSearch Service is employed for analytics and application monitoring. This architecture ensures a robust and scalable solution, enabling FIS to efficiently index and visualize data in near-real time. With these AWS services, FIS effectively manages their data pipeline and gains valuable insights for their business processes.

The following diagram illustrates the solution architecture.

Architecture Diagram

The workflow includes the following steps:

GitHub webhook events stream data to both Amazon S3 and OpenSearch
Service, facilitating real-time data analysis.
A Lambda function connects to an API Gateway REST API, processing and structuring the received payloads.
The Lambda function adds the structured data to a Kinesis data stream, enabling immediate data streaming and quick ticket insights.
Kinesis Data Firehose streams the records from the Kinesis data stream to an S3 bucket, simultaneously creating an index in OpenSearch Service.
OpenSearch Service uses the indexed data to provide near-real-time visualization and enable efficient ticket analysis through K-Nearest Neighbor (KNN) search, enhancing productivity and optimizing data operations.

The following sections provide step-by-step instructions for setting up the solution. Additionally, we have created a video guide that demonstrates each step in detail. You are welcome to watch the video and follow along with this post if you prefer.

Prerequisites

You should have the following prerequisites:

An AWS account
A GitHub repository for streaming events

Implement the solution

Complete the following steps to implement the solution:

Create an OpenSearch Service domain.
Create an S3 bucket named git-data.
Create a Kinesis data stream named git-data-stream.
Create a Firehose delivery stream named git-data-delivery-stream with
git-data-stream as the source and git-data as the destination, and a buffer interval of 60 seconds.
Create a Lambda function named git-webhook-handler with a timeout of 5 minutes. Add code to add data to the Kinesis data stream.
Grant the Lambda function’s execution role permission to put_record on the Kinesis data stream.
Create a REST API in API Gateway named git-webhook-handler-api. Create a resource named
git-data with a POST method, integrate it with the Lambda function git-webhook-handler created in the previous step, and deploy the REST API.
Create a delivery stream with the Kinesis data stream as the source and OpenSearch Service as the destination. Provide the AWS Identity and Access Management (IAM) role for Kinesis Data Firehose with the necessary permissions to create an index in OpenSearch Service. Finally, add the IAM role as a backend service in OpenSearch Service.
Navigate to your GitHub repository and create a webhook to enable seamless integration with the solution. Copy the REST API URL and enter this newly created webhook.

Test the solution

To test the solution, complete the following steps:

Go to your GitHub repository and choose the Star button, and verify that you receive a response with a status code of 200.
Also, check for the ShardId and SequenceNumber in the recent deliveries to confirm successful event addition to the Kinesis data stream.

Kinesis data stream

On the Kinesis console, use the Data Viewer to confirm the arrival of data records.

kinesis record data

Navigate to the OpenSearch Dashboard and choose the dev tool.
Search for the records and observe that all the Git events are displayed
in the result pane.

opensearch devtool

On the Amazon S3 console, open the bucket and view the data records.

s3 bucket records

Security

We adhere to IAM best practices to uphold security:

Craft a Lambda execution role for read/write operations on the Kinesis data stream.
Generate an IAM role for Kinesis Data Firehose to manage Amazon S3 and OpenSearch
Service access.
Link this IAM role in OpenSearch Service security to confer backend user privileges.

Clean up

To avoid incurring future charges, delete all the resources you created.

Benefits of near-real-time ticket visualization and search

During our demonstration, we showcased the utilization of GitHub as the streaming data source. However, it’s important to note that the solution we presented has the flexibility to scale and incorporate multiple data sources from various services. This allows for the consolidation and visualization of diverse data in near-real time, using the capabilities of OpenSearch Service.

With the implementation of the solution described in this post, FIS effectively overcame all the challenges they faced.

In this section, we delve into the details of the challenges and benefits they achieved:

Integrating ticket data from multiple third-party systems – Near-real-time data streaming ensures an up-to-date information flow from third-party providers for timely insights
Overcoming API call thresholds and limitations imposed by different systems – Unrestricted data flow with no threshold or rate limiting enables seamless integration and continuous updates
Accommodating scalability requirements for up to 250 teams – The asynchronous, serverless architecture effortlessly scales more than 250 times larger without infrastructure modifications
Efficiently resolving tickets and performing trend analysis – OpenSearch Service semantic KNN search identifies duplicates and defects, and optimizes operations for improved efficiency
Gaining valuable insights for business processes – Artificial intelligence (AI) and machine
learning (ML) analytics use the data stored in the S3 bucket, empowering deeper insights and informed decision-making
Ensuring secure access to ticket data and regulatory compliance – Secure data access and compliance with data protection regulations ensure data privacy and regulatory compliance

Conclusion

FIS, in collaboration with AWS, successfully addressed several challenges to achieve near-real-time ticket visualization and search capabilities. With OpenSearch Service, FIS enhanced operational efficiency by efficiently resolving ticketsand performing trend analysis. With their data ingestion and indexing process, FIS processed 15,000 tickets per day in real time. The solution provided secure and scalable access to ticket data for more than 250 teams, enabling unified collaboration. FIS experienced a remarkable 30% reduction in ticket resolution time, empowering teams to quickly address
issues.

As Sheel Saket, Senior Data Science Manager at FIS, states, “Our near-real-time solution transformed how we identify and resolve tickets, improving our overall productivity.”

Furthermore, organizations can further improve the solution by adopting Amazon OpenSearch Ingestion for data ingestion, which offers cost savings and out-of-the-box data processing capabilities. By embracing this transformative solution, organizations can optimize their ticket management, drive productivity, and deliver exceptional experiences to customers.

Want to know more? You can reach out to FIS from their official FIS contact page, follow FIS Twitter, and visit the FIS LinkedIn page.

About the Author

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Sheel Saket is a Senior Data Science Manager at FIS in Chicago, Illinois. He has over 11 years of IT experience in the finance, insurance, and e-commerce domains, and specializes in architecting large-scale AI solutions and cloud MLOps. In his spare time, Sheel enjoys listening to audiobooks, podcasts, and watching movies with his family.

AWS Week in Review – Agents for Amazon Bedrock, Amazon SageMaker Canvas New Capabilities, and More – July 31, 2023

2023-08-01 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-week-in-review-agents-for-amazon-bedrock-amazon-sagemaker-canvas-new-capabilities-and-more-july-31-2023/

This July, AWS communities in ASEAN wrote a new history. First, the AWS User Group Malaysia recently held the first AWS Community Day in Malaysia.

Another significant milestone has been achieved by the AWS User Group Philippines. They just celebrated their tenth anniversary by running 2 days of AWS Community Day Philippines. Here are a few photos from the event, including Jeff Barr sharing his experiences attending AWS User Group meetup, in Manila, Philippines 10 years ago.

Big congratulations to AWS Community Heroes, AWS Community Builders, AWS User Group leaders and all volunteers who organized and delivered AWS Community Days! Also, thank you to everyone who attended and help support our AWS communities.

Last Week’s Launches
We had interesting launches last week, including from AWS Summit, New York. Here are some of my personal highlights:

(Preview) Agents for Amazon Bedrock – You can now create managed agents for Amazon Bedrock to handle tasks using API calls to company systems, understand user requests, break down complex tasks into steps, hold conversations to gather more information, and take actions to fulfill requests.

(Coming Soon) New LLM Capabilities in Amazon QuickSight Q – We are expanding the innovation in QuickSight Q by introducing new LLM capabilities through Amazon Bedrock. These Generative BI capabilities will allow organizations to easily explore data, uncover insights, and facilitate sharing of insights.

AWS Glue Studio support for Amazon CodeWhisperer – You can now write specific tasks in natural language (English) as comments in the Glue Studio notebook, and Amazon CodeWhisperer provides code recommendations for you.

(Preview) Vector Engine for Amazon OpenSearch Serverless – This capability empowers you to create modern ML-augmented search experiences and generative AI applications without the need to handle the complexities of managing the underlying vector database infrastructure.

Last week, Amazon SageMaker Canvas also released a set of new capabilities:

AWS Open-Source Updates
As always, my colleague Ricardo has curated the latest updates for open-source news at AWS. Here are some of the highlights.

cdk-aws-observability-accelerator is a set of opinionated modules to help you set up observability for your AWS environments with AWS native services and AWS-managed observability services such as Amazon Managed Service for Prometheus, Amazon Managed Grafana, AWS Distro for OpenTelemetry (ADOT) and Amazon CloudWatch.

iac-devtools-cli-for-cdk is a command line interface tool that automates many of the tedious tasks of building, adding to, documenting, and extending AWS CDK applications.

Upcoming AWS Events
There are upcoming events that you can join to learn. Let’s start with AWS events:

AWS Storage Day (August 9)
AWS Summit Taiwan (August 2-3)
AWS Summit São Paulo (August 3)
AWS Summit Mexico City (August 30)

And let’s learn from our fellow builders and join AWS Community Days:

AWS Community Day Colombia (August 12)
AWS Community Day West Africa (August 19)

Open for Registration for AWS re:Invent
We want to be sure you know that AWS re:Invent registration is now open!

This learning conference hosted by AWS for the global cloud computing community will be held from November 27 to December 1, 2023, in Las Vegas.

Pro-tip: You can use information on the Justify Your Trip page to prove the value of your trip to AWS re:Invent trip.

Give Us Your Feedback
We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

That’s all for this week. Check back next Monday for another Week in Review.

Happy building!

— Donnie

This post is part of our Week in Review series. Check back each week for a quick round-up of interesting news and announcements from AWS!

P.S. We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

Introducing the vector engine for Amazon OpenSearch Serverless, now in preview

2023-07-26 Pavani Baddepudi

Post Syndicated from Pavani Baddepudi original https://aws.amazon.com/blogs/big-data/introducing-the-vector-engine-for-amazon-opensearch-serverless-now-in-preview/

We are pleased to announce the preview release of the vector engine for Amazon OpenSearch Serverless. The vector engine provides a simple, scalable, and high-performing similarity search capability in Amazon OpenSearch Serverless that makes it easy for you to build modern machine learning (ML) augmented search experiences and generative artificial intelligence (AI) applications without having to manage the underlying vector database infrastructure. This post summarizes the features and functionalities of our vector engine.

Using augmented ML search and generative AI with vector embeddings

Organizations across all verticals are rapidly adopting generative AI for its ability to handle vast datasets, generate automated content, and provide interactive, human-like responses. Customers are exploring ways to transform the end-user experience and interaction with their digital platform by integrating advanced conversational generative AI applications such as chatbots, question and answer systems, and personalized recommendations. These conversational applications enable you to search and query in natural language and generate responses that closely resemble human-like responses by accounting for the semantic meaning, user intent, and query context.

ML-augmented search applications and generative AI applications use vector embeddings, which are numerical representations of text, image, audio, and video data to generate dynamic and relevant content. The vector embeddings are trained on your private data and represent the semantic and contextual attributes of the information. Ideally, these embeddings can be stored and managed close to your domain-specific datasets, such as within your existing search engine or database. This enables you to process a user’s query to find the closest vectors and combine them with additional metadata without relying on external data sources or additional application code to integrate the results. Customers want a vector database option that is simple to build on and enables them to move quickly from prototyping to production so they can focus on creating differentiated applications. The vector engine for OpenSearch Serverless extends OpenSearch’s search capabilities by enabling you to store, search, and retrieve billions of vector embeddings in real time and perform accurate similarity matching and semantic searches without having to think about the underlying infrastructure.

Exploring the vector engine’s capabilities

Built on OpenSearch Serverless, the vector engine inherits and benefits from its robust architecture. With the vector engine, you don’t have to worry about sizing, tuning, and scaling the backend infrastructure. The vector engine automatically adjusts resources by adapting to changing workload patterns and demand to provide consistently fast performance and scale. As the number of vectors grows from a few thousand during prototyping to hundreds of millions and beyond in production, the vector engine will scale seamlessly, without the need for reindexing or reloading your data to scale your infrastructure. Additionally, the vector engine has separate compute for indexing and search workloads, so you can seamlessly ingest, update, and delete vectors in real time while ensuring that the query performance your users experience remains unaffected. All the data is persisted in Amazon Simple Storage Service (Amazon S3), so you get the same data durability guarantees as Amazon S3 (eleven nines). Even though we are still in preview, the vector engine is designed for production workloads with redundancy for Availability Zone outages and infrastructure failures.

The vector engine for OpenSearch Serverless is powered by the k-nearest neighbor (kNN) search feature in the open-source OpenSearch Project, proven to deliver reliable and precise results. Many customers today are using OpenSearch kNN search in managed clusters for offering semantic search and personalization in their applications. With the vector engine, you can get the same functionality with the simplicity of a serverless environment. The vector engine supports the popular distance metrics such as Euclidean, cosine similarity, and dot product, and can accommodate 16,000 dimensions, making it well-suited to support a wide range of foundational and other AI/ML models. You can also store diverse fields with various data types such as numeric, boolean, date, keyword, geopoint for metadata, and text for descriptive information to add more context to the stored vectors. Colocating the data types reduces the complexity and maintainability and avoids data duplication, version compatibility challenges, and licensing issues, effectively simplifying your application stack. Because the vector engine supports the same OpenSearch open-source suite APIs, you can take advantage of its rich query capabilities, such as full text search, advanced filtering, aggregations, geo-spatial query, nested queries for faster retrieval of data, and enhanced search results. For example, if your use case requires you to find the results within 15 miles of the requestor, the vector engine can do this in a single query, eliminating the need for maintaining two different systems and then combining the results through application logic. With support for integration with LangChain, Amazon Bedrock, and Amazon SageMaker, you can easily integrate your preferred ML and AI system with the vector engine.

The vector engine supports a wide range of use cases across various domains, including image search, document search, music retrieval, product recommendation, video search, location-based search, fraud detection, and anomaly detection. We also anticipate a growing trend for hybrid searches that combine lexical search methods with advanced ML and generative AI capabilities. For example, when a user searches for a “red shirt” on your e-commerce website, semantic search helps expand the scope by retrieving all shades of red, while preserving the tuning and boosting logic implemented on the lexical (BM25) search. With OpenSearch filtering, you can further enhance the relevance of your search results by providing users with options to refine their search based on size, brand, price range, and availability in nearby stores, allowing for a more personalized and precise experience. The hybrid search support in the vector engine enables you to query vector embeddings, metadata, and descriptive information within a single query call, making it easy to provide more accurate and contextually relevant search results without building complex application code.

You can get started in minutes with the vector engine by creating a specialized vector search collection under OpenSearch Serverless using the AWS Management Console, AWS Command Line Interface (AWS CLI), or the AWS software development kit (AWS SDK). Collections are a logical grouping of indexed data that works together to support a workload, while the physical resources are automatically managed in the backend. You don’t have to declare how much compute or storage is needed or monitor the system to make sure it’s running well. OpenSearch Serverless applies different sharding and indexing strategies for the three available collection types: time series, search, and vector search. The vector engine’s compute capacity used for data ingestion, and search and query are measured in OpenSearch Compute Units (OCUs). One OCU can handle 4 million vectors for 128 dimensions or 500K for 768 dimensions at 99% recall rate. The vector engine is built on OpenSearch Serverless, which is a highly available service and requires a minimum of 4 OCUs (two OCUs for the ingest including primary and standby, and two OCUs for the search with two active replicas across Availability Zones) for that first collection in an account. All subsequent collections using the same AWS Key Management Service (AWS KMS) key can share those OCUs.

Get started with vector embeddings

To get started using vector embeddings using the console, complete the following steps:

Create a new collection on the OpenSearch Serverless console.
Provide a name and optional description.
Currently, vector embeddings are supported exclusively by vector search collections; therefore, for Collection type, select Vector search.
Next, you must configure the security policies, which includes encryption, network, and data access policies.

We are introducing the new Easy create option, which streamlines the security configuration for faster onboarding. All the data in the vector engine is encrypted in transit and at rest by default. You can choose to bring your own encryption key or use the one provided by the service that is dedicated for your collection or account. You can choose to host your collection on a public endpoint or within a VPC. The vector engine supports fine-grained AWS Identity and Access Management (IAM) permissions so that you can define who can create, update, and delete encryption, network, collections, and indexes, thereby enabling organizational alignment.

With the security settings in place, you can finish creating the collection.

After the collection is successfully created, you can create the vector index. At this point, you can use the API or the console to create an index. An index is a collection of documents with a common data schema and provides a way for you to store, search, and retrieve your vector embeddings and other fields. The vector index supports up to 1,000 fields.

To create the vector index, you must define the vector field name, dimensions, and the distance metric.

The vector index supports up to 16,000 dimensions and three types of distance metrics: Euclidean, cosine, and dot product.

Once you have successfully created the index, you can use OpenSearch’s powerful query capabilities to get comprehensive search results.

The following example shows how easily you can create a simple property listing index with the title, description, price, and location details as fields using the OpenSearch API. By using the query APIs, this index can efficiently provide accurate results to match your search requests, such as “Find me a two-bedroom apartment in Seattle that is under $3000.”

From preview to GA and beyond

Today, we are excited to announce the preview of the vector engine, making it available for you to begin testing it out immediately. As we noted earlier, OpenSearch Serverless was designed to provide a highly available service to power your enterprise applications, with independent compute resources for index and search and built-in redundancy.

We recognize that many of you are in the experimentation phase and would like a more economical option for dev-test. Prior to GA, we plan to offer two features that will enable us to reduce the cost of your first collection. The first is a new dev-test option that enables you to launch a collection with no active standby or replica, reducing the entry cost by 50%. The vector engine still provides durability guarantees because it persists all the data in Amazon S3. The second is to initially provision a 0.5 OCU footprint, which will scale up as needed to support your workload, further lowering costs if your initial workload is in the tens of thousands to low-hundreds of thousands of vectors (depending on the number of dimensions). Between these two features, we will reduce the minimum OCUs needed to power your first collection from 4 OCUs down to 1 OCU per hour.

We are also working on features that will allow us to achieve workload pause and resume capabilities in the coming months, which is particularly useful for the vector engine because many of these use cases don’t require continuous indexing of the data.

Lastly, we are diligently focused on optimizing the performance and memory usage of the vector graphs, including improving caching, merging and more.

While we work on these cost reductions, we will be offering the first 1400 OCU-hours per month free on vector collections until the dev-test option is made available. This will enable you to test the vector engine preview for up to two weeks every month at no cost, based on your workload.

Summary

The vector engine for OpenSearch Serverless introduces a simple, scalable, and high-performing vector storage and search capability that makes it straightforward for you to quickly store and query billions of vector embeddings generated from a variety of ML models, such as those provided by Amazon Bedrock, with response times in milliseconds.

The preview release of vector engine for OpenSearch Serverless is now available in eight Regions globally: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland).

We are the excited about the future ahead and your feedback will play a vital role in guiding the progress of this product. We encourage you to try out the vector engine for OpenSearch Serverless and share your use cases, questions, and feedback in the comments section.

In the coming weeks, we will be publishing a series of posts to provide you with detailed guidance on how to integrate the vector engine with LangChain, Amazon Bedrock, and SageMaker. To learn more about the vector engine’s capabilities, refer to our Getting Started with Amazon OpenSearch Serverless documentation

About the authors

Pavani Baddepudi is a Principal Product Manager for Search Services at AWS and the lead PM for OpenSearch Serverless. Her interests include distributed systems, networking, and security. When not working, she enjoys hiking and exploring new cuisines.

Carl Meadows is Director of Product Management at AWS and is responsible for Amazon Elasticsearch Service, OpenSearch, Open Distro for Elasticsearch, and Amazon CloudSearch. Carl has been with Amazon Elasticsearch Service since before it was launched in 2015. He has a long history of working in the enterprise software and cloud services spaces. When not working, Carl enjoys making and recording music.

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 2

2023-07-26 Patrik Nagel

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-2/

Leveraging vast unstructured data poses challenges, particularly for global businesses needing cross-language data search. In Part 1 of this blog series, we built the architectural foundation for the content repository. The key component of Part 1 was the dynamic access control-based logic with a web UI to upload documents.

In Part 2, we extend the content repository with multilingual semantic search capabilities while maintaining the access control logic from Part 1. This allows users to ingest documents in content repository across multiple languages and then run search queries to get reference to semantically similar documents.

Solution overview

Building on the architectural foundation from Part 1, we introduce four new building blocks to extend the search functionality.

Optical character recognition (OCR) workflow: To automatically identify, understand, and extract text from ingested documents, we use Amazon Textract and a sample review dataset of .png format documents (Figure 1). We use Amazon Textract synchronous application programming interfaces (APIs) to capture key-value pairs for the reviewid and reviewBody attributes. Based on your specific requirements, you can choose to capture either the complete extracted text or parts the text.

Figure 1. Sample document for ingestion

Embedding generation: To capture the semantic relationship between the text, we use a machine learning (ML) model that maps words and sentences to high-dimensional vector embeddings. You can use Amazon SageMaker, a fully-managed ML service, to build, train, and deploy your ML models to production-ready hosted environments. You can also deploy ready-to-use pre-trained models from multiple avenues such as SageMaker JumpStart. For this blog post, we use the open-source pre-trained universal-sentence-encoder-multilingual model from TensorFlow Hub. The model inference endpoint deployed to a SageMaker endpoint generates embeddings for the document text and the search query. Figure 2 is an example of n-dimensional vector that is generated as the output of the reviewBody attribute text provided to the embeddings model.

Figure 2. Sample embedding representation of the value of reviewBody

Embedding ingestion: To make the embeddings searchable for the content repository users, you can use the k-Nearest Neighbor (k-NN) search feature of Amazon OpenSearch Service. The OpenSearch k-NN plugin provides different methods. For this blog post, we use the Approximate k-NN search approach, based on the Hierarchical Navigable Small World (HNSW) algorithm. HNSW uses a hierarchical set of proximity graphs in multiple layers to improve performance when searching large datasets to find the “nearest neighbors” for the search query text embeddings.

Semantic search: We make the search service accessible as an additional backend logic on Amazon API Gateway. Authenticated content repository users send their search query using the frontend to receive the matching documents. The solution maintains end-to-end access control logic by using the user’s enriched Amazon Cognito provided identity (ID) token claim with the department attribute to compare it with the ingested documents.

Technical architecture

The technical architecture includes two parts:

Implementing multilingual semantic search functionality: Describes the processing workflow for the document that the user uploads; makes the document searchable.
Running input search query: Covers the search workflow for the input query; finds and returns the nearest neighbors of the input text query to the user.

Part 1. Implementing multilingual semantic search functionality

Our previous blog post discussed blocks A through D (Figure 3), including user authentication, ID token enrichment, Amazon Simple Storage Service (Amazon S3) object tags for dynamic access control, and document upload to the source S3 bucket. In the following section, we cover blocks E through H. The overall workflow describes how an unstructured document is ingested in the content repository, run through the backend OCR and embeddings generation process and finally the resulting vector embedding are stored in OpenSearch service.

Figure 3. Technical architecture for implementing multilingual semantic search functionality

The OCR workflow extracts text from your uploaded documents.
- The source S3 bucket sends an event notification to Amazon Simple Queue Service (Amazon SQS).
- The document transformation AWS Lambda function subscribed to the Amazon SQS queue invokes an Amazon Textract API call to extract the text.
The document transformation Lambda function makes an inference request to the encoder model hosted on SageMaker. In this example, the Lambda function submits the reviewBody attribute to the encoder model to generate the embedding.
The document transformation Lambda function writes an output file in the transformed S3 bucket. The text file consists of:
- The reviewid and reviewBody attributes extracted from Step 1
- An additional reviewBody_embeddings attribute from Step 2
  Note: The workflow tags the output file with the same S3 object tags as the source document for downstream access control.
The transformed S3 bucket sends an event notification to invoke the indexing Lambda function.
The indexing Lambda function reads the text file content. Then indexing Lambda function makes an OpenSearch index API call along with source document tag as one of the indexing attributes for access control.

Part 2. Running user-initiated search query

Next, we describe how the user’s request produces query results (Figure 4).

Figure 4. Search query lifecycle

The user enters a search string in the web UI to retrieve relevant documents.
Based on the active sign-in session, the UI passes the user’s ID token to the search endpoint of the API Gateway.
The API Gateway uses Amazon Cognito integration to authorize the search API request.
Once validated, the search API endpoint request invokes the search document Lambda function.
The search document function sends the search query string as the inference request to the encoder model to receive the embedding as the inference response.
The search document function uses the embedding response to build an OpenSearch k-NN search query. The HNSW algorithm is configured with the Lucene engine and its filter option to maintain the access control logic based on the custom department claim from the user’s ID token. The OpenSearch query returns the following to the query embeddings:
- Top three Approximate k-NN
- Other attributes, such as reviewid and reviewBody
The workflow sends the relevant query result attributes back to the UI.

Prerequisites

You must have the following prerequisites for this solution:

An AWS account. Sign up to create and activate one.
The following software installed on your development machine, or use an AWS Cloud9 environment:
- AWS Command Line Interface (AWS CLI); configure it to point to your AWS account
- TypeScript; use a package manager, such as npm
- AWS Cloud Development Kit (AWS CDK)
- Docker; ensure it’s running
Appropriate AWS credentials for interacting with resources in your AWS account.

Walkthrough

Setup

The following steps deploy two AWS CDK stacks into your AWS account:

content-repo-search-stack (blog-content-repo-search-stack.ts) creates the environment detailed in Figure 3, except for the SageMaker endpoint, which you create in a spearate step.
demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

Clone the project Git repository:

git clone https://github.com/aws-samples/content-repository-with-multilingual-search content-repository

Install the necessary dependencies:

cd content-repository/backend-cdk 
npm install

Configure environment variables:

export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
export CDK_DEFAULT_REGION=$(aws configure get region)

Bootstrap your account for AWS CDK usage:

cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION

Deploy the code to your AWS account:
```
cdk deploy --all
```

The complete stack set-up may take up to 20 minutes.

Creation of SageMaker endpoint

Follow below steps to create the SageMaker endpoint in the same AWS Region where you deployed the AWS CDK stack.

1. Sign in to the SageMaker console.
2. In the navigation menu, select Notebook, then Notebook instances.
3. Choose Create notebook instance.
4. Under the Notebook instance settings, enter content-repo-notebook as the notebook instance name, and leave other defaults as-is.
5. Under the Permissions and encryption section (Figure 5), you need to set the IAM role section to the role with the prefix content-repo-search-stack. In case you don’t see this role automatically populated, select it from the drop-down. Leave the rest of the defaults, and choose Create notebook instance.
  
  Figure 5. Notebook permissions
6. The notebook creation status changes to Pending before it’s available for use within 3-4 minutes.
7. Once the notebook is in the Available status, choose Open Jupyter.
8. Choose the Upload button and upload the create-sagemaker-endpoint.ipynb file in the backend-cdk folder of the root of the blog repository.
9. Open the create-sagemaker-endpoint.ipynb notebook. Select the option Run All from the Cell menu (Figure 6). This might take up to 10 minutes.
  
  Figure 6. Run create-sagemaker-endpoint notebook cells
10. After all the cells have successfully run, verify that the AWS Systems Manager parameter sagemaker-endpoint is updated with the value of the SageMaker endpoint name. An example of value as the output of the cell is in Figure 7. In case you don’t see the output, check if the preceding steps were run correctly.
  
  Figure 7. SSM parameter updated with SageMaker endpoint
11. Verify in the SageMaker console that the inference endpoint with the prefix tensorflow-inference has been deployed and is set to status InService.
12. Upload sample data to the content repository:
  - Update the S3_BUCKET_NAME variable in the upload_documents_to_S3.sh script in the root folder of the blog repository with the s3SourceBucketName from the AWS CDK output of the content-repo-search-stack.
  - Run upload_documents_to_S3.sh script to upload 150 sample documents to the content repository. This takes 5-6 minutes. During this process, the uploaded document triggers the workflow described in the Implementing multilingual semantic search functionality.

Using the search service

At this stage, you have deployed all the building blocks for the content repository in your AWS account. Next, as part of the upload sample data to the content repository, you pushed a limited corpus of 150 sample documents (.png format). Each document is in one of the four different languages – English, German, Spanish and French. With the added multilingual search capability, you can query in one language and receive semantically similar results across different languages while maintaining the access control logic.

Access the frontend application:
- Copy the amplifyHostedAppUrl value of the AWS CDK output from the content-repo-search-stack shown in the terminal.
- Enter the URL in your web browser to access the frontend application.
- A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.
Sign into the application:
- The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. Copy the password from the terminal associated with the sales-user, which belongs to the sales department.
- Follow the prompts from the React webpage to sign in with the sales-user and change the temporary password.
Enter search queries and verify results. The search action invokes the workflow described in Running input search query. For example:
- Enter works well as the search query. Note the multilingual output and the semantically similar results (Figure 8).
- Enter bad quality as the search query. Note the multilingual output and the semantically similar results (Figure 9).
  
  Figure 9. Negative sentiment multi-lingual search result for the sales-user
Sign out as the sales-user with the Log Out button on the webpage.
Sign in using the marketing-user credentials to verify access control:
- Follow the sign in procedure in step 2 but with the marketing-user.
- This time with works well as search query, you find different output. This is because the access control only allows marketing-user to search for the documents that belong to the marketing department (Figure 10).
  
  Figure 10. Positive sentiment multilingual search result for the marketing-user

Cleanup

In the backend-cdk subdirectory of the cloned repository, delete the deployed resources: cdk destroy --all.

Additionally, you need to access the Amazon SageMaker console to delete the SageMaker endpoint and notebook instance created as part of the Walkthrough setup section.

Conclusion

In this blog, we enriched the content repository with multi-lingual semantic search features while maintaining the access control fundamentals that we implemented in Part 1. The building blocks of the semantic search for unstructured documents—Amazon Textract, Amazon SageMaker, and Amazon OpenSearch Service—set a foundation for you to customize and enhance the search capabilities for your specific use case. For example, you can leverage the fast developments in Large Language Models (LLM) to enhance the semantic search experience. You can replace the encoder model with an LLM capable of generating multilingual embeddings while still maintaining the OpenSearch service to store and index data and perform vector search.

Alcion supports their multi-tenant platform with Amazon OpenSearch Serverless

2023-07-25 Zack Rossman

Post Syndicated from Zack Rossman original https://aws.amazon.com/blogs/big-data/alcion-supports-their-multi-tenant-platform-with-amazon-opensearch-serverless/

This is a guest blog post co-written with Zack Rossman from Alcion.

Alcion, a security-first, AI-driven backup-as-a-service (BaaS) platform, helps Microsoft 365 administrators quickly and intuitively protect data from cyber threats and accidental data loss. In the event of data loss, Alcion customers need to search metadata for the backed-up items (files, emails, contacts, events, and so on) to select specific item versions for restore. Alcion uses Amazon OpenSearch Service to provide their customers with accurate, efficient, and reliable search capability across this backup catalog. The platform is multi-tenant, which means that Alcion requires data isolation and strong security so as to ensure that tenants can only search their own data.

OpenSearch Service is a fully managed service that makes it easy to deploy, scale, and operate OpenSearch in the AWS Cloud. OpenSearch is an Apache-2.0-licensed, open-source search and analytics suite, comprising OpenSearch (a search, analytics engine, and vector database), OpenSearch Dashboards (a visualization and utility user interface), and plugins that provide advanced capabilities like enterprise-grade security, anomaly detection, observability, alerting, and much more. Amazon OpenSearch Serverless is a serverless deployment option that makes it simple to use OpenSearch without configuring, managing, and scaling OpenSearch Service domains.

In this post, we share how adopting OpenSearch Serverless enabled Alcion to meet their scale requirements, reduce their operational overhead, and secure their tenants’ data by enforcing tenant isolation within their multi-tenant environment.

OpenSearch Service managed domains

For the first iteration of their search architecture, Alcion chose the managed domains deployment option in OpenSearch Service and was able to launch their search functionality in production in less than a month. To meet their security, scale, and tenancy requirements, they stored data for each tenant in a dedicated index and used fine-grained access control in OpenSearch Service to prevent cross-tenant data leaks. As their workload evolved, Alcion engineers tracked OpenSearch domain utilization via the provided Amazon CloudWatch metrics, making changes to increase storage and optimize their compute resources.

The team at Alcion used several features of OpenSearch Service managed domains to improve their operational stance. They introduced index aliases, which provide a single alias name to access (read and write) multiple underlying indexes. They also configured Index State Management (ISM) policies to help them control their data lifecycle by rolling indexes over based on index size. Together, the ISM policies and index aliases were necessary to scale indexes for large tenants. Alcion also used index templates to define the shards per index (partitioning) of their data so as to automate their data lifecycle and improve the performance and stability of their domains.

The following architecture diagram shows how Alcion configured their OpenSearch managed domains.

The following diagram shows how Microsoft 365 data was indexed to and queried from tenant-specific indexes. Alcion implemented request authentication by providing the OpenSearch primary user credentials with each API request.

OpenSearch Serverless overview and tenancy model options

OpenSearch Service managed domains provided a stable foundation for Alcion’s search functionality, but the team needed to manually provision resources to the domains for their peak workload. This left room for cost optimizations because Alcion’s workload is bursty—there are large variations in the number of search and indexing transactions per second, both for a single customer and taken as a whole. To reduce costs and operational burden, the team turned to OpenSearch Serverless, which offers auto-scaling capability.

To use OpenSearch Serverless, the first step is to create a collection. A collection is a group of OpenSearch indexes that work together to support a specific workload or use case. The compute resources for a collection, called OpenSearch Compute Units (OCUs), are shared across all collections in an account that share an encryption key. The pool of OCUs is automatically scaled up and down to meet the demands of indexing and search traffic.

The level of effort required to migrate from an OpenSearch Service managed domain to OpenSearch Serverless was manageable thanks to the fact that OpenSearch Serverless collections support the same OpenSearch APIs and libraries as OpenSearch Service managed domains. This allowed Alcion to focus on optimizing the tenancy model for the new search architecture. Specifically, the team needed to decide how to partition tenant data within collections and indexes while balancing security and total cost of ownership. Alcion engineers, in collaboration with the OpenSearch Serverless team, considered three tenancy models:

Silo model: Create a collection for each tenant
Pool model: Create a single collection and use a single index for multiple tenants
Bridge model: Create a single collection and use a single index per tenant

All three design choices had benefits and trade-offs that had to be considered for designing the final solution.

Silo model: Create a collection for each tenant

In this model, Alcion would create a new collection whenever a new customer onboarded to their platform. Although tenant data would be cleanly separated between collections, this option was disqualified because the collection creation time meant that customers wouldn’t be able to back up and search data immediately after registration.

Pool model: Create a single collection and use a single index for multiple tenants

In this model, Alcion would create a single collection per AWS account and index tenant-specific data in one of many shared indexes belonging to that collection. Initially, pooling tenant data into shared indexes was attractive from a scale perspective because this led to the most efficient use of index resources. But after further analysis, Alcion found that they would be well within the per-collection index quota even if they allocated one index for each tenant. With that scalability concern resolved, Alcion pursued the third option because siloing tenant data into dedicated indexes results in stronger tenant isolation than the shared index model.

Bridge model: Create a single collection and use a single index per tenant

In this model, Alcion would create a single collection per AWS account and create an index for each of the hundreds of tenants managed by that account. By assigning each tenant to a dedicated index and pooling these indexes in a single collection, Alcion reduced onboarding time for new tenants and siloed tenant data into cleanly separated buckets.

Implementing role-based access control for supporting multi-tenancy

OpenSearch Serverless offers a multi-point, inheritable set of security controls, covering data access, network access, and encryption. Alcion took full advantage of OpenSearch Serverless data access policies to implement role-based access control (RBAC) for each tenant-specific index with the following details:

Allocate an index with a common prefix and the tenant ID (for example, index-v1-<tenantID>)
Create a dedicated AWS Identity and Access Management (IAM) role that is used to sign requests to the OpenSearch Serverless collection
Create an OpenSearch Serverless data access policy that grants document read/write permissions within a dedicated tenant index to the IAM role for that tenant
OpenSearch API requests to a tenant index are signed with temporary credentials belonging to the tenant-specific IAM role

The following is an example OpenSearch Serverless data access policy for a mock tenant with ID t-eca0acc1-12345678910. This policy grants the IAM role document read/write access to the dedicated tenant access.

[
    {
        "Rules": [
            {
                "Resource": [
                    "index/collection-searchable-entities/index-v1-t-eca0acc1-12345678910"
                ],
                "Permission": [
                    "aoss:ReadDocument",
                    "aoss:WriteDocument",
                ],
                "ResourceType": "index"
            }
        ],
        "Principal": [
            "arn:aws:iam::12345678910:role/OpenSearchAccess-t-eca0acc1-1b9f-4b3f-95d6-12345678910"
        ],
        "Description": "Allow document read/write access to OpenSearch index belonging to tenant t-eca0acc1-1b9f-4b3f-95d6-12345678910"
    }
]

The following architecture diagram depicts how Alcion implemented indexing and searching for Microsoft 365 resources using the OpenSearch Serverless shared collection approach.

The following is the sample code snippet for sending an API request to an OpenSearch Serverless collection. Notice how the API client is initialized with a signer object that signs requests with the same IAM principal that is linked to the OpenSearch Serverless data access policy from the previous code snippet.

package alcion

import (
	"context"
	"encoding/json"
	"strings"

	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/credentials/stscreds"
	"github.com/aws/aws-sdk-go-v2/service/sts"
	"github.com/opensearch-project/opensearch-go/v2"
	"github.com/opensearch-project/opensearch-go/v2/opensearchapi"
	"github.com/opensearch-project/opensearch-go/v2/signer"
	awssignerv2 "github.com/opensearch-project/opensearch-go/v2/signer/awsv2"
	"github.com/pkg/errors"
)

const (
	// Scope the API request to the AWS OpenSearch Serverless service
	aossService = "aoss"

	// Mock values
	indexPrefix        = "index-v1-"
	collectionEndpoint = "<https://kfbr3928z4y6vot2mbpb.us-east-1.aoss.amazonaws.com>"
	tenantID           = "t-eca0acc1-1b9f-4b3f-95d6-b0b96b8c03d0"
	roleARN            = "arn:aws:iam::1234567890:role/OpenSearchAccess-t-eca0acc1-1b9f-4b3f-95d6-b0b96b8c03d0"
)

func CreateIndex(ctx context.Context, tenantID string) (*opensearchapi.Response, error) {

	sig, err := createRequestSigner(ctx)
	if err != nil {
		return nil, errors.Wrapf(err, "error creating new signer for AWS OSS")
	}

	cfg := opensearch.Config{
		Addresses: []string{collectionEndpoint},
		Signer:    sig,
	}

	aossClient, err := opensearch.NewClient(cfg)
	if err != nil {
		return nil, errors.Wrapf(err, "error creating new OpenSearch API client")
	}

  body, err := getSearchBody()
  if err != nil {
    return nil, errors.Wrapf(err, "error getting search body")
  }

	req := opensearchapi.SearchRequest{
		Index: []string{indexPrefix + tenantID},
		Body:  body,
	}

	return req.Do(ctx, aossClient)
}

func createRequestSigner(ctx context.Context) (signer.Signer, error) {

	awsCfg, err := config.LoadDefaultConfig(ctx)
	if err != nil {
		return nil, errors.Wrapf(err, "error loading default config")
	}

	stsClient := sts.NewFromConfig(awsCfg)
	provider := stscreds.NewAssumeRoleProvider(stsClient, roleARN)

	awsCfg.Credentials = aws.NewCredentialsCache(provider)
	return awssignerv2.NewSignerWithService(awsCfg, aossService)
}

func getSearchBody() (*strings.Reader, error) {
	// Match all documents, page size = 10
	query := map[string]interface{}{
		"size": 10,
	}

	queryJson, err := json.Marshal(query)
  if err != nil {
    return nil, err
  }

	return strings.NewReader(string(queryJson)), nil
}

Conclusion

In May of 2023, Alcion rolled out its search architecture based on the shared collection and dedicated index-per-tenant model in all production and pre-production environments. The team was able to tear out complex code and operational processes that had been dedicated to scaling OpenSearch Service managed domains. Furthermore, thanks to the auto scaling capabilities of OpenSearch Serverless, Alcion has reduced their OpenSearch costs by 30% and expects the cost profile to scale favorably.

In their journey from managed to serverless OpenSearch Service, Alcion benefited in their initial choice of OpenSearch Service managed domains. In migrating forward, they were able to reuse the same OpenSearch APIs and libraries for their OpenSearch Serverless collections that they used for their OpenSearch Service managed domain. Additionally, they updated their tenancy model to take advantage of OpenSearch Serverless data access policies. With OpenSearch Serverless, they were able to effortlessly adapt to their customers’ scale needs while ensuring tenant isolation.

For more information about Alcion, visit their website.

About the Authors

Zack Rossman is a Member of Technical Staff at Alcion. He is the tech lead for the search and AI platforms. Prior to Alcion, Zack was a Senior Software Engineer at Okta, developing core workforce identity and access management products for the Directories team.

Niraj Jetly is a Software Development Manager for Amazon OpenSearch Serverless. Niraj leads several data plane teams responsible for launching Amazon OpenSearch Serverless. Prior to AWS, Niraj led several product and engineering teams as CTO, VP of Engineering, and Head of Product Management for over 15 years. Niraj is a recipient of over 15 innovation awards, including being named CIO of the year in 2014 and top 100 CIO in 2013 and 2016. A frequent speaker at several conferences, he has been quoted in NPR, WSJ, and The Boston Globe.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service

2023-07-17 Hajer Bouafif

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/build-a-serverless-log-analytics-pipeline-using-amazon-opensearch-ingestion-with-managed-amazon-opensearch-service/

In this post, we show how to build a log ingestion pipeline using the new Amazon OpenSearch Ingestion, a fully managed data collector that delivers real-time log and trace data to Amazon OpenSearch Service domains. OpenSearch Ingestion is powered by the open-source data collector Data Prepper. Data Prepper is part of the open-source OpenSearch project. With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open-source project, visit Data Prepper.

In this post, we explore the logging infrastructure for a fictitious company, AnyCompany. We explore the components of the end-to-end solution and then show how to configure OpenSearch Ingestion’s main parameters and how the logs come in and out of OpenSearch Ingestion.

Solution overview

Consider a scenario in which AnyCompany collects Apache web logs. They use OpenSearch Service to monitor web access and identify possible root causes to error logs of type 4xx and 5xx. The following architecture diagram outlines the use of every component used in the log analytics pipeline: Fluent Bit collects and forwards logs; OpenSearch Ingestion processes, routes, and ingests logs; and OpenSearch Service analyzes the logs.

The workflow contains the following stages:

Generate and collect – Fluent Bit collects the generated logs and forwards them to OpenSearch Ingestion. In this post, you create fake logs that Fluent Bit forwards to OpenSearch Ingestion. Check the list of supported clients to review the required configuration for each client supported by OpenSearch Ingestion.
Process and ingest – OpenSearch Ingestion filters the logs based on response value, processes the logs using a grok processor, and applies conditional routing to ingest the error logs to an OpenSearch Service index.
Store and analyze – We can analyze the Apache httpd error logs using OpenSearch Dashboards.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

An AWS account
An active AWS Cloud9 environment on an Amazon Elastic Compute Cloud (Amazon EC2) instance
Pre-installed Docker compose in the AWS Cloud9 environment
AWS managed temporary credentials disabled in AWS Cloud9
An active OpenSearch Service domain

Configure OpenSearch Ingestion

First, you define the appropriate AWS Identity and Access Management (IAM) permissions to write to and from OpenSearch Ingestion. Then you set up the pipeline configuration in the OpenSearch Ingestion. Let’s explore each step in more detail.

Configure IAM permissions

OpenSearch Ingestion works with IAM to secure communications into and out of OpenSearch Ingestion. You need two roles, authenticated using AWS Signature V4 (SigV4) signed requests. The originating entity requires permissions to write to OpenSearch Ingestion. OpenSearch Ingestion requires permissions to write to your OpenSearch Service domain. Finally, you must create an access policy using OpenSearch Service’s fine-grained access control, which allows OpenSearch Ingestion to create indexes and write to them in your domain.

The following diagram illustrates the IAM permissions to allow OpenSearch Ingestion to write to an OpenSearch Service domain. Refer to Setting up roles and users in Amazon OpenSearch Ingestion to get more details on roles and permissions required to use OpenSearch Ingestion.

In the demo, you use the AWS Cloud9 EC2 instance profile’s credentials to sign requests sent to OpenSearch Ingestion. You use Fluent Bit to fetch the credentials and assume the role you pass in the aws_role_arn you configure later.

Create an ingestion role (called IngestionRole) to allow Fluent Bit to ingest the logs into your pipeline.

Create a trust relationship to allow Fluent Bit to assume the ingestion role, as shown in the following code. Fluent Bit attempts to fetch the credentials in the following order. In configuring the access policy for this role, you grant permission for the osis:Ingest.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "{your-account-id}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create a pipeline role (called PipelineRole) with a trust relationship for OpenSearch Ingestion to assume that role. The domain-level access policy of the OpenSearch domain grants the pipeline role access to the domain.

Finally, configure your domain’s security plugin to enable OpenSearch Ingestion’s assumed role to create indexes and write data to the domain.

In this demo, the OpenSearch Service domain uses fine-grained access control for authentication, so you need to map the OpenSearch Ingestion pipeline role to the OpenSearch backend role all_access. For instructions, refer to Step 2: Include the pipeline role in the domain access policy page.

Create the pipeline in OpenSearch Ingestion

To create an OpenSearch Ingestion pipeline, complete the following steps:

On the OpenSearch Service console, choose Pipelines in the navigation pane.
Choose Create pipeline.
For Pipeline name, enter a name.

Input the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.

Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.

In the Pipeline configuration section, configure Data Prepper to process your data by choosing the appropriate blueprint configuration template on the Configuration blueprints menu. For this post, we choose AWS-LogAggregationWithConditionalRouting.

The OpenSearch Ingestion pipeline configuration consists of four sections:

Source – This is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. In this post, you use the http_source plugin and provide the Fluent Bit output URI value within the path attribute.
Processors – This represents an intermediate processing to filter, transform, and enrich your input data. Refer to Supported plugins for more details on the list of operations supported in OpenSearch Ingestion. In this post, we use the grok processor COMMONAPACHELOG, which matches input logs against the common Apache log pattern and makes it easy to query in OpenSearch Service.
Sink – This is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. In this post, you define an OpenSearch Service domain and index as sink.
Route – This is the part of a processor that allows the pipeline to route the data into different sinks based on specific conditions. In this example, you create four routes based in the response field value of the log. If the response field value of the log line matches 2xx or 3xx, the log is sent to the OpenSearch Service index aggregated_2xx_3xx. If the response field value matches 4xx, the log is sent to the index aggregated_4xx. If the response field value matches 5xx, the log is sent to the index aggregated_5xx.

Update the blueprint based on your use case. The following code shows an example of the pipeline configuration YAML file:

version: "2"
log-aggregate-pipeline:
  source:
    http:
      # Provide the FluentBit output URI value.
      path: "/log/ingest"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG_DATATYPED}" ]
  route:
    - 2xx_status: "/response >= 200 and /response < 300"
    - 3xx_status: "/response >= 300 and /response < 400"
    - 4xx_status: "/response >= 400 and /response < 500"
    - 5xx_status: "/response >= 500 and /response < 600"
  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_2xx_3xx"
        routes:
          - 2xx_status
          - 3xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_4xx"
        routes:
          - 4xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_5xx"
        routes:
          - 5xx_status

Provide the relevant values for your domain endpoint, account ID, and Region related to your configuration.

Check the health of your configuration setup by choosing Validate pipeline when you finish the update.

When designing a production workload, deploy your pipeline within a VPC. For instructions, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.

For this post, select Public access under Network.

In the Log publishing options section, select Publish to CloudWatch logs and Create new group.

OpenSearch Ingestion uses the log levels of INFO, WARN, ERROR, and FATAL. Enabling log publishing helps you monitor your pipelines in production.

Choose Next and Create pipeline.
Select the pipeline and choose View details to see the progress of the pipeline creation.

Wait until the status changes to Active to start using the pipeline.

Send logs to the OpenSearch Ingestion pipeline

To start sending logs to the OpenSearch Ingestion pipeline, complete the following steps:

On the AWS Cloud9 console, create a Fluent Bit configuration file and update the following attributes:
- Host – Enter the ingestion URL of your OpenSearch Ingestion pipeline.
- aws_service – Enter osis.
- aws_role_arn – Enter the ARN of the IAM role IngestionRole.

The following code shows an example of the Fluent-bit.conf file:

[SERVICE]
    parsers_file          ./parsers.conf
    
[INPUT]
    name                  tail
    refresh_interval      5
    path                  /var/log/*.log
    read_from_head        true
[FILTER]
    Name parser
    Key_Name log
    Parser apache
[OUTPUT]
    Name http
    Match *
    Host {Ingestion URL}
    Port 443
    URI /log/ingest
    format json
    aws_auth true
    aws_region {AWS_region}
    aws_role_arn arn:aws:iam::{your-account-id}:role/IngestionRole
    aws_service osis
    Log_Level trace
    tls On

In the AWS Cloud9 environment, create a docker-compose YAML file to deploy Fluent Bit and Flog containers:

version: '3'
services:
  fluent-bit:
    container_name: fluent-bit
    image: docker.io/amazon/aws-for-fluent-bit
    volumes:
      - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
      - ./apache-logs:/var/log
  flog:
    container_name: flog
    image: mingrammer/flog
    command: flog -t log -f apache_common -o web/log/test.log -w -n 100000 -d 1ms -p 1000
    volumes:
      - ./apache-logs:/web/log

Before you start the Docker containers, you need to update the IAM EC2 instance role in AWS Cloud9 so it can sign the requests sent to OpenSearch Ingestion.

For demo purposes, create an IAM service-linked role and choose EC2 under Use case to allow the AWS Cloud9 EC2 instance to call OpenSearch Ingestion on your behalf.
Add the OpenSearch Ingestion policy, which is the same policy you used with IngestionRole.
Add the AdministratorAccess permission policy to the role as well.

Your role definition should look like the following screenshot.

After you create the role, go back to AWS Cloud9, select your demo environment, and choose View details.
On the EC2 instance tab, choose Manage EC2 instance to view the details of the EC2 instance attached to your AWS Cloud9 environment.

On the Amazon EC2 console, replace the IAM role of your AWS Cloud9 EC2 instance with the new role.
Open a terminal in AWS Cloud9 and run the command docker-compose up.

Check the output in the terminal—if everything is working correctly, you get status 200.

Fluent Bit collects logs from the /var/log repository in the container and pushes the data to the OpenSearch Ingestion pipeline.

Open OpenSearch Dashboards, navigate to Dev Tools, and run the command GET _cat/indices to validate that the data has been delivered by OpenSearch Ingestion to your OpenSearch Service domain.

You should see the three indexes created: aggregated_2xx_3xx, aggregated_4xx, and aggregated_5xx.

Now you can focus on analyzing your log data and reinvent your business without having to worry about any operational overhead regarding your ingestion pipeline.

Best practices for monitoring

You can monitor the Amazon CloudWatch metrics made available to you to maintain the right performance and availability of your pipeline. Check the list of available pipeline metrics related to the source, buffer, processor, and sink plugins.

Navigate to the Metrics tab for your specific OpenSearch Ingestion pipeline to explore the graphs available to each metric, as shown in the following screenshot.

In your production workloads, make sure to configure the following CloudWatch alarms to notify you when the pipeline metrics breach a specific threshold so you can promptly remediate each issue.

Managing cost

While OpenSearch Ingestion automatically provisions and scales the OCUs for your spiky workloads, you only pay for the compute resources actively used by your pipeline to ingest, process, and route data. Therefore, setting up a maximum capacity of Ingestion OCUs allows you to handle your workload peak demand while controlling cost.

For production workloads, make sure to configure a minimum of 2 Ingestion OCUs to ensure 99.9% availability for the ingestion pipeline. Check the sizing recommendations and learn how OpenSearch Ingestion responds to workload spikes.

Clean up

Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

On the AWS Cloud9 console, choose Environments in the navigation pane.
Select the environment you want to delete and choose Delete.
On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
Select the domain you want to delete and choose Delete.
Select Pipelines under Ingestion in the navigation pane.
Select the pipeline you want to delete and on the Actions menu, choose Delete.

Conclusion

In this post, you learned how to create a serverless ingestion pipeline to deliver Apache access logs to an OpenSearch Service domain using OpenSearch Ingestion. You learned the IAM permissions required to start using OpenSearch Ingestion and how to use a pipeline blueprint instead of creating a pipeline configuration from scratch.

You used Fluent Bit to collect and forward Apache logs, and used OpenSearch Ingestion to process and conditionally route the log data to different indexes in OpenSearch Service. For more examples about writing to OpenSearch Ingestion pipelines, refer to Sending data to Amazon OpenSearch Ingestion pipelines.

Finally, the post provided you with recommendations and best practices to deploy OpenSearch Ingestion pipelines in a production environment while controlling cost.

Follow this post to build your serverless log analytics pipeline, and refer to Top strategies for high volume tracing with Amazon OpenSearch Ingestion to learn more about high volume tracing with OpenSearch Ingestion.

About the authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Francisco Losada is an Analytics Specialist Solutions Architect based out of Madrid, Spain. He works with customers across EMEA to architect, implement, and evolve analytics solutions at AWS. He advocates for OpenSearch, the open-source search and analytics suite, and supports the community by sharing code samples, writing content, and speaking at conferences. In his spare time, Francisco enjoys playing tennis and running.

AWS Week in Review – Updates on Amazon FSx for NetApp ONTAP, AWS Lambda, eksctl, Karpetner, and More – July 17, 2023

2023-07-17 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-updates-on-amazon-fsx-for-netapp-ontap-aws-lambda-eksctl-karpetner-and-more-july-17-2023/

The Data Centered: Eastern Oregon, a five-part mini-documentary series looking at the real-life impact of the more than $15 billion investment AWS has made in the local community, and how the company supports jobs, generates economic growth, provides skills training and education, and unlocks opportunities for local businesses suppliers.

Last week, I watched a new episode introducing the Data Center Technician training program offered by AWS to train people with little or no previous technical experience in the skills they need to work in data centers and other information technology (IT) roles. This video reminded me of my first days of cabling and transporting servers in data centers. Remember, there are still people behind cloud computing.

Last Week’s Launches
Here are some launches that got my attention:

Amazon FSx for NetApp ONTAP Updates – Jeff Barr introduced Amazon FSx for NetApp ONTAP support for SnapLock, an ONTAP feature that gives you the power to create volumes that provide write once read many (WORM) functionality for regulatory compliance and ransomware protection. In addition, FSx for NetApp ONTAP now supports IPSec encryption of data in transit and two additional monitoring and troubleshooting capabilities that you can use to monitor file system events and diagnose network connectivity.

AWS Lambda detects and stops recursive loops in Lambda functions – In certain scenarios, due to resource misconfiguration or code defects, a processed event might be sent back to the same service or resource that invoked the Lambda function. This can cause an unintended recursive loop and result in unintended usage and costs for customers. With this launch, Lambda will stop recursive invocations between Amazon SQS, Lambda, and Amazon SNS after 16 recursive calls. For more information, refer to our documentation or the launch blog post.

Email notification

Amazon CloudFront supports for 3072-bit RSA certificates – You can now associate their 3072-bit RSA certificates with CloudFront distributions to enhance communication security between clients and CloudFront edge locations. To get started, associate a 3072-bit RSA certificate with your CloudFront distribution using console or APIs. There are no additional fees associated with this feature. For more information, please refer to the CloudFront Developer Guide.

Running GitHub Actions with AWS CodeBuild – Two weeks ago, AWS CodeBuild started to support GitHub Actions. You can now define GitHub Actions steps directly in the BuildSpec and run them alongside CodeBuild commands. Last week, the AWS DevOps Blog published the blog post about using the Liquibase GitHub Action for deploying changes to an Amazon Aurora database in a private subnet. You can learn how to integrate AWS CodeBuild and nearly 20,000 GitHub Actions developed by the open source community.

CodeBuild configuration showing the GitHub repository URL

Amazon DynamoDB local version 2.0 – You can develop and test applications by running Amazon DynamoDB local in your local development environment without incurring any additional costs. The new 2.0 version allows Java developers to use DynamoDB local to work with Spring Boot 3 and frameworks such as Spring Framework 6 and Micronaut Framework 4 to build modernized, simplified, and lightweight cloud-native applications.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Open Source Updates
Last week, we introduced new open source projects and significant roadmap contributions to the Jupyter community.

New joint maintainership between Weaveworks and AWS for eksctl – Now the eksctl open source project has been moved from the Weaveworks GitHub organization to a new top level GitHub organization—eksctl-io—that will be jointly maintained by Weaveworks and AWS moving forward. The eksctl project can now be found on GitHub.

Karpenter now supports Windows containers – Karpenter is an open source flexible, high-performance Kubernetes node provisioning and management solution that you can use to quickly scale Amazon EKS clusters. With the launch of version 0.29.0, Karpenter extends the automated node provisioning support to Windows containers running on EKS. Read this blog post for a step-by-step guide on how to get started with Karpenter for Windows node groups.

Updates in Amazon Aurora and Amazon OpenSearch Service – Following the announcement of updates to the PostgreSQL database in May by the open source community, we’ve updated Amazon Aurora PostgreSQL-Compatible Edition to support PostgreSQL 15.3, 14.8, 13.11, 12.15, and 11.20. These releases contain product improvements and bug fixes made by the PostgreSQL community, along with Aurora-specific improvements. You can also run OpenSearch version 2.7 in Amazon OpenSearch Service. With OpenSearch 2.7 (also released in May), we’ve made several improvements to observability, security analytics, index management, and geospatial capabilities in OpenSearch Service.

To learn about weekly updates for open source at AWS, check out the latest AWS open source newsletter by Ricardo.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Storage Day on August 9 – Join a one-day virtual event that will help you to better understand AWS storage services and make the most of your data. Register today.

AWS Global Summits – Sign up for the AWS Summit closest to your city: Hong Kong (July 20), New York City (July 26), Taiwan (August 2-3), São Paulo (August 3), and Mexico City (August 30).

AWS Community Days – Join a community-led conference run by AWS user group leaders in your region: Malaysia (July 22), Philippines (July 29-30), Colombia (August 12), and West Africa (August 19).

AWS re:Invent 2023 – Join us to hear the latest from AWS, learn from experts, and connect with the global cloud community. Registration is now open.

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

Take the AWS Blog Customer Survey
We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Take our survey to share insights regarding your experience on the AWS Blog.

This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Access Amazon OpenSearch Serverless collections using a VPC endpoint

2023-07-11 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/access-amazon-opensearch-serverless-collections-using-a-vpc-endpoint/

Amazon OpenSearch Serverless helps you index, analyze, and search your logs and data using OpenSearch APIs and dashboards. The OpenSearch Serverless collection is a group of indexes. API and dashboard clients can access the collections from public networks or one or more VPCs. For VPC access to collections and dashboards, you can create VPC endpoints. In this post, we demonstrate how you can create and use VPC endpoints and OpenSearch Serverless network policies to control access to your collections and OpenSearch dashboards from multiple network locations.

The demo in this post uses an AWS Lambda-based client in a VPC to ingest data into a collection via a VPC endpoint and a browser in a public network accessing the same collection.

Solution overview

To illustrate how you can ingest data into an OpenSearch Serverless collection from within a VPC, we use a Lambda function. We use a VPC-hosted Lambda function to create an index in an OpenSearch Serverless collection and add documents to the index using a VPC endpoint. We then use a publicly accessible OpenSearch Serverless dashboard to see the documents ingested from Lambda function.

The following sections detail the steps to ingest data into the collection using Lambda and access the OpenSearch Serverless dashboard.

Prerequisites

This setup assumes that you have already created a VPC with private subnets.

Ingest data using Lambda and access the OpenSearch Serverless dashboard

To set up your solution, complete the following steps:

On the OpenSearch Service console, create a private connection between your VPC and OpenSearch Serverless using a VPC endpoint. Use the private subnets and a security group from your VPC.
Create an OpenSearch collection using the VPC endpoint created in the previous step.
Create a network policy to enable VPC access to the OpenSearch endpoint so the Lambda function can ingest documents to the collection. You should also enable public access to the OpenSearch dashboard endpoint so we can see the documents ingested.
After you create the collection, create a data access policy to grant ingestion access to the Lambda function’s AWS Identity and Access Management (IAM) role.
Additionally, grant read access to the dashboard user’s IAM role.
Add IAM permissions to the Lambda function’s IAM role and the dashboard user’s IAM role for the OpenSearch Serverless collection.
Create a Lambda function in the same VPC and subnet that we used for the OpenSearch endpoint (see the following code). This function creates an index called sitcoms-eighties in the OpenSearch Serverless collection and adds a sample document to the index:

import datetime
import time
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

host = '<Insert-OpenSearch-Serverless-Endpoint>'
region = 'us-east-1'
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service,session_token=credentials.token)

# Build the OpenSearch client
client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

def lambda_handler (event, context):

    # Create index
    response = client.indices.create('sitcoms-eighties')
    print('\nCreating index:')
    print(response)
    time.sleep(5)
    
    dt = datetime.datetime.now()
    # Add a document to the index.
    response = client.index(
        index='sitcoms-eighties',
        body={
            'title': 'Seinfeld',
            'creator': 'Larry David',
            'year': 1989,
            'createtime': dt
        },
        id='1',
    )
    print('\nDocument added:')
    print(response)

Run the Lambda function, and you should see the output as shown in the following screenshot.
You can now see the documents from this index through your publicly accessible OpenSearch Dashboards URL.
Create the index pattern in OpenSearch Dashboards, and then you can see the documents as shown in the following screenshot.

Use a VPC DNS resolver from your network

A client in your VPN network can connect to the collection or dashboards over a VPC endpoint. The client needs to find the VPC endpoint’s IP address using an Amazon Route 53 inbound resolver endpoint. To learn more about Route 53 inbound resolver endpoints, refer to Resolving DNS queries between VPCs and your network. The following diagram shows a sample setup.

The flow for this architecture is as follows:

The DNS query for the OpenSearch Serverless client is routed to a locally configured on-premises DNS server.
The on-premises DNS as configured performs conditional forwarding for the zone us-east-1.aoss.amazonaws.com to a Route 53 inbound resolver endpoint IP address. You must replace your Region name in the preceding zone name.
The inbound resolver endpoint performs DNS resolution by forwarding the query to the private hosted zone that was created along with the OpenSearch Serverless VPC endpoint.
The IP addresses returned by the DNS query are the private IP addresses of the interface VPC endpoint, which allow your on-premises host to establish private connectivity over AWS Site-to-Site VPN.
The interface endpoint is a collection of one or more elastic network interfaces with a private IP address in your account that serves as an entry point for traffic going to an OpenSearch Serverless endpoint.

Summary

OpenSearch Serverless allows you to set up and control access to the service using VPC endpoints and network policies. In this post, we explored how to access an OpenSearch Serverless collection API and dashboard from within a VPC, on premises, and public networks. If you have any questions or suggestions, please write to us in the comments section.

About the Authors

Vivek Kansal works with the Amazon OpenSearch team. In his role as Principal Software Engineer, he uses his experience in the areas of security, policy engines, cloud-native solutions, and networking to help secure customer data in OpenSearch Service and OpenSearch Serverless in an evolving threat landscape.

AWS Week in Review – Generative AI with LLM Hands-on Course, Amazon SageMaker Data Wrangler Updates, and More – July 3, 2023

2023-07-03 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-week-in-review-generative-ai-with-llm-hands-on-course-amazon-sagemaker-data-wrangler-updates-and-more-july-3-2023/

In last week’s AWS Week in Review post, Danilo mentioned that it’s summer in London. Well, I’m based in Singapore, and it’s mostly summer here. But, June is a special month here as it marks the start of durian season.

Starting next week, I’ll be travelling to Thailand, Malaysia, and the Philippines. But before I go, I want to share some interesting updates from last week for you.

Let’s get started.

Last Week’s Launches
Here are some launches that caught my attention:

New Hands-on Course: Generative AI with Large Language Models – Generative AI has been a technology highlight for the past few months. If you are on your journey to learn large language models (LLM), then you can try the new hands-on course Generative AI with LLMs at Coursera. Antje wrote a post to announce this collaboration course between DeepLearning.AI and AWS. This course is designed to prepare data scientists and engineers to become experts in selecting, training, fine-tuning, and deploying LLMs for real-world applications.

Amazon SageMaker Data Wrangler direct connection to Snowflake – With this announcement, you can now browse databases, tables, schemas, and query data from Snowflake in SageMaker Data Wrangler. This unlocks the possibility for you to join your data with other popular data sources, such as S3, Amazon Athena, Amazon Redshift, Amazon EMR and over 50 SaaS applications to create the right data set for machine learning.

Amazon SageMaker Role Manager now provides CDK library to create fine-grained permissions — The CDK support for Amazon SageMaker Role Manager lets you define permissions with fine-grained access for SageMaker users, jobs, and SageMaker pipelines programmatically. This will reduce manual efforts and consistent permissions management. For example, the following code grants permissions with a set of related machine learning activities specific to a persona.


export class myCDKStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const persona = new Persona(this, 'example-persona-id', {
        activities: [
            Activity.runStudioAppsV2(this, 'example-id1', {}), 
            Activity.accessS3Buckets(this, 'example-id2', {s3buckets: [s3.S3Bucket.fromBucketName('DOC-EXAMPLE-BUCKET')]}) 
            Activity.accessAwsServices(this, 'example-id3', {})
        ]
    });

    const role = persona.createRole(this, 'example-IAM-role-id', 'example-IAM-role-name');
    
    }
}

AWS SDK for SAP ABAP – Great news for SAP ABAP developers! We just recently announced the general availability of the AWS SDK for SAP ABAP. With this, ABAP developers can use simple, secure and configurable connections between ABAP environments and 200+ supported AWS services in all AWS Regions, including AWS GovCloud (US) Regions. This AWS SDK helps ABAP developers to modernize their business processes with AWS services.

Amazon OpenSearch Ingestion now supports ingesting events from Amazon Security Lake – Amazon OpenSearch Ingestion now lets you bring data in the Apache Parquet format. As Amazon Security Lake also uses Open Cybersecurity Schema Framework (OCSF) in Apache Parquet format, it means you can easily ingest data from Amazon Security Lake.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

AWS Open-Source Updates
As always, my colleague Ricardo has curated the latest updates for open-source news at AWS. Here are some of the highlights.

lightsail-miab-installer – This handy command-line tool developed by my colleague Rio Astamal was designed to simplify the process of setting up Mail-in-a-Box on Amazon Lightsail. With lightsail-miab-installer, you can effortlessly streamline the installation and configuration of Mail-in-a-Box, making it even more accessible and user-friendly.

rdsconn – This amazing tool, created by AWS Hero Aidan Steele, simplifies the process of connecting to an AWS RDS instance within a VPC directly from your laptop. Using the recently launched EC2 Instance Connect, rdsconn eliminates the need for cumbersome SSH tunnels.

cdk-appflow – If you’re using AWS CDK to build your applications and Amazon AppFlow to create bidirectional data transfer integrations between various SaaS applications and AWS, then you’re going to love cdk-appflow, a new AWS CDK construct for Amazon AppFlow. It’s currently in technical preview, but you’re more than welcome to try it and provide us with your feedback.

Upcoming AWS Events
There are also upcoming events that you can join to learn. Let’s start with AWS Summit events:

And, let’s learn from our fellow builders and join AWS Community Days:

AWS Community Day Malaysia, July 22
AWS Community Day Philippines, July 29-30

Open for Registration for AWS re:Invent
Before I end this post, AWS re:Invent registration is now open!

This learning conference hosted by AWS for the global cloud computing community will be held from Nov 27 to Dec 1, 2023 in Las Vegas.

Pro-tip: You can use information on the Justify Your Trip page to prove the value of your trip to AWS re:Invent.

That’s all for this week. Check back next Monday for another Week in Review.

Happy building.
— Donnie

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Amazon OpenSearch Service’s vector database capabilities explained

2023-06-22 Jon Handler

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. It comprises a search engine, OpenSearch, which delivers low-latency search and aggregations, OpenSearch Dashboards, a visualization and dashboarding tool, and a suite of plugins that provide advanced capabilities like alerting, fine-grained access control, observability, security monitoring, and vector storage and processing. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.

As an end-user, when you use OpenSearch’s search capabilities, you generally have a goal in mind—something you want to accomplish. Along the way, you use OpenSearch to gather information in support of achieving that goal (or maybe the information is the original goal). We’ve all become used to the “search box” interface, where you type some words, and the search engine brings back results based on word-to-word matching. Let’s say you want to buy a couch in order to spend cozy evenings with your family around the fire. You go to Amazon.com, and you type “a cozy place to sit by the fire.” Unfortunately, if you run that search on Amazon.com, you get items like fire pits, heating fans, and home decorations—not what you intended. The problem is that couch manufacturers probably didn’t use the words “cozy,” “place,” “sit,” and “fire” in their product titles or descriptions.

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus. By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes. It also provides other database functionality like managing vector data alongside other data types, workload management, access control and more. OpenSearch’s k-NN plugin provides core vector database functionality for OpenSearch, so when your customer searches for “a cozy place to sit by the fire” in your catalog, you can encode that prompt and use OpenSearch to perform a nearest neighbor query to surface that 8-foot, blue couch with designer arranged photographs in front of fireplaces.

Using OpenSearch Service as a vector database

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic search

With semantic search, you improve the relevance of retrieved results using language-based embeddings on search documents. You enable your search customers to use natural language queries, like “a cozy place to sit by the fire” to find their 8-foot-long blue couch. For more information, refer to Building a semantic search engine in OpenSearch to learn how semantic search can deliver a 15% relevance improvement, as measured by normalized discounted cumulative gain (nDCG) metrics compared with keyword search. For a concrete example, our Improve search relevance with ML in Amazon OpenSearch Service workshop explores the difference between keyword and semantic search, based on a Bidirectional Encoder Representations from Transformers (BERT) model, hosted by Amazon SageMaker to generate vectors and store them in OpenSearch. The workshop uses product question answers as an example to show how keyword search using the keywords/phrases of the query leads to some irrelevant results. Semantic search is able to retrieve more relevant documents by matching the context and semantics of the query. The following diagram shows an example architecture for a semantic search application with OpenSearch Service as the vector database.

Architecture diagram showing how to use Amazon OpenSearch Service to perform semantic search to improve relevance

Retrieval Augmented Generation with LLMs

RAG is a method for building trustworthy generative AI chatbots using generative LLMs like OpenAI, ChatGPT, or Amazon Titan Text. With the rise of generative LLMs, application developers are looking for ways to take advantage of this innovative technology. One popular use case involves delivering conversational experiences through intelligent agents. Perhaps you’re a software provider with knowledge bases for product information, customer self-service, or industry domain knowledge like tax reporting rules or medical information about diseases and treatments. A conversational search experience provides an intuitive interface for users to sift through information through dialog and Q&A. Generative LLMs on their own are prone to hallucinations—a situation where the model generates a believable but factually incorrect response. RAG solves this problem by complementing generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

As illustrated in the following diagram, the query workflow starts with a question that is encoded and used to retrieve relevant knowledge articles from the vector database. Those results are sent to the generative LLM whose job is to augment those results, typically by summarizing the results as a conversational response. By complementing the generative model with a knowledge base, RAG grounds the model on facts to minimize hallucinations. You can learn more about building a RAG solution in the Retrieval Augmented Generation module of our semantic search workshop.

Architecture diagram showing how to use Amazon OpenSearch Service to perform retrieval-augmented generation

Recommendation engine

Recommendations are a common component in the search experience, especially for ecommerce applications. Adding a user experience feature like “more like this” or “customers who bought this also bought that” can drive additional revenue through getting customers what they want. Search architects employ many techniques and technologies to build recommendations, including Deep Neural Network (DNN) based recommendation algorithms such as the two-tower neural net model, YoutubeDNN. A trained embedding model encodes products, for example, into an embedding space where products that are frequently bought together are considered more similar, and therefore are represented as data points that are closer together in the embedding space. Another possibility
is that product embeddings are based on co-rating similarity instead of purchase activity. You can employ this affinity data through calculating the vector similarity between a particular user’s embedding and vectors in the database to return recommended items. The following diagram shows an example architecture of building a recommendation engine with OpenSearch as a vector store.

Architecture diagram showing how to use Amazon OpenSearch Service as a recommendation engine

Media search

Media search enables users to query the search engine with rich media like images, audio, and video. Its implementation is similar to semantic search—you create vector embeddings for your search documents and then query OpenSearch Service with a vector. The difference is you use a computer vision deep neural network (e.g. Convolutional Neural Network (CNN)) such as ResNet to convert images into vectors. The following diagram shows an example architecture of building an image search with OpenSearch as the vector store.

Architecture diagram showing how to use Amazon OpenSearch Service to search rich media like images, videos, and audio files

Understanding the technology

OpenSearch uses approximate nearest neighbor (ANN) algorithms from the NMSLIB, FAISS, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. The engine details are as follows:

Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information in this section to help determine which engine will best meet your requirements.

In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option.

.	NMSLIB-HNSW	FAISS-HNSW	FAISS-IVF	Lucene-HNSW
Max Dimension	16,000	16,000	16,000	1024
Filter	Post filter	Post filter	Post filter	Filter while search
Training Required	No	No	Yes	No
Similarity Metrics	l2, innerproduct, cosinesimil, l1, linf	l2, innerproduct	l2, innerproduct	l2, cosinesimil
Vector Volume	Tens of billions	Tens of billions	Tens of billions	< Ten million
Indexing latency	Low	Low	Lowest	Low
Query Latency & Quality	Low latency & high quality	Low latency & high quality	Low latency & low quality	High latency & high quality
Vector Compression	Flat	Flat Product Quantization	Flat Product Quantization	Flat
Memory Consumption	High	High Low with PQ	Medium Low with PQ	High

Approximate and exact nearest-neighbor search

The OpenSearch Service k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors: approximate k-NN, score script (exact k-NN), and painless extensions (exact k-NN).

Approximate k-NN

The first method takes an approximate nearest neighbor approach—it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints, and more scalable search. Approximate k-NN is the best choice for searches over large indexes (that is, hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the score script method or painless extensions.

Score script

The second method extends the OpenSearch Service score script functionality to run a brute force, exact k-NN search over knn_vector fields or fields that can represent binary objects. With this approach, you can run k-NN search on a subset of vectors in your index (sometimes referred to as a pre-filter search). This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

Painless extensions

The third method adds the distance functions as painless extensions that you can use in more complex combinations. Similar to the k-NN score script, you can use this method to perform a brute force, exact k-NN search across an index, which also supports pre-filtering. This approach has slightly slower query performance compared to the k-NN score script. If your use case requires more customization over the final score, you should use this approach over score script k-NN.

Vector search algorithms

The simple way to find similar vectors is to use k-nearest neighbors (k-NN) algorithms, which compute the distance between a query vector and the other vectors in the vector database. As we mentioned earlier, the score script k-NN and painless extensions search methods use the exact k-NN algorithms under the hood. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate nearest neighbor (ANN) search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. There are different ANN search algorithms; for example, locality sensitive hashing, tree-based, cluster-based, and graph-based. OpenSearch implements two ANN algorithms: Hierarchical Navigable Small Worlds (HNSW) and Inverted File System (IVF). For a more detailed explanation of how the HNSW and IVF algorithms work in OpenSearch, see blog post “Choose the k-NN algorithm for your billion-scale use case with OpenSearch”.

Hierarchical Navigable Small Worlds

The HNSW algorithm is one of the most popular algorithms out there for ANN search. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

Inverted File

The IVF algorithm separates your index vectors into a set of buckets, then, to reduce your search time, only searches through a subset of these buckets. However, if the algorithm just randomly split up your vectors into different buckets, and only searched a subset of them, it would yield a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

Vector similarity metrics

All search engines use a similarity metric to rank and sort results and bring the most relevant results to the top. When you use a plain text query, the similarity metric is called TF-IDF, which measures the importance of the terms in the query and generates a score based on the number of textual matches. When your query includes a vector, the similarity metrics are spatial in nature, taking advantage of proximity in the vector space. OpenSearch supports several similarity or distance measures:

Euclidean distance – The straight-line distance between points.
L1 (Manhattan) distance – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.
L-infinity (chessboard) distance – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.
Inner product – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.
Cosine similarity – The cosine of the angle between two vectors in a vector space.
Hamming distance – For binary-coded vectors, the number of bits that differ between the two vectors.

Advantage of OpenSearch as a vector database

When you use OpenSearch Service as a vector database, you can take advantage of the service’s features like usability, scalability, availability, interoperability, and security. More importantly, you can use OpenSearch’s search features to enhance the search experience. For example, you can use Learning to Rank in OpenSearch to integrate user clickthrough behavior data into your search application and improve search relevance. You can also combine OpenSearch text search and vector search capabilities to search documents with keyword and semantic similarity. You can also use other fields in the index to filter documents to improve relevance. For advanced users, you can use a hybrid scoring model to combine OpenSearch’s text-based relevance score, computed with the Okapi BM25 function and its vector search score to improve the ranking of your search results.

Scale and limits

OpenSearch as vector database support billions of vector records. Keep in mind the following calculator regarding number of vectors and dimensions to size your cluster.

Number of vectors

OpenSearch VectorDB takes advantage of the sharding capabilities of OpenSearch and can scale to billions of vectors at single-digit millisecond latencies by sharding vectors and scale horizontally by adding more nodes. The number of vectors that can fit in a single machine is a function of the off-heap memory availability on the machine. The number of nodes required will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. The more nodes, the more memory and better performance. The amount of memory available per node is computed as memory_available = (node_memory – jvm_size) * circuit_breaker_limit, with the following parameters:

node_memory – The total memory of the instance.
jvm_size – The OpenSearch JVM heap size. This is set to half of the instance’s RAM, capped at approximately 32 GB.
circuit_breaker_limit – The native memory usage threshold for the circuit breaker. This is set to 0.5.

Total cluster memory estimation depends on total number of vector records and algorithms. HNSW and IVF have different memory requirements. You can refer to Memory Estimation for more details.

Number of dimensions

OpenSearch’s current dimension limit for the vector field knn_vector is 16,000 dimensions. Each dimension is represented as a 32-bit float. The more dimensions, the more memory you’ll need to index and search. The number of dimensions is usually determined by the embedding models that translate the entity to a vector. There are a lot of options to choose from when building your knn_vector field. To determine the correct methods and parameters to choose, refer to Choosing the right method.

Customer stories:

Amazon Music

Amazon Music is always innovating to provide customers with unique and personalized experiences. One of Amazon Music’s approaches to music recommendations is a remix of a classic Amazon innovation, item-to-item collaborative filtering, and vector databases. Using data aggregated based on user listening behavior, Amazon Music has created an embedding model that encodes music tracks and customer representations into a vector space where neighboring vectors represent tracks that are similar. 100 million songs are encoded into vectors, indexed into OpenSearch, and served across multiple geographies to power real-time recommendations. OpenSearch currently manages 1.05 billion vectors and supports a peak load of 7,100 vector queries per second to power Amazon Music recommendations.

The item-to-item collaborative filter continues to be among the most popular methods for online product recommendations because of its effectiveness at scaling to large customer bases and product catalogs. OpenSearch makes it easier to operationalize and further the scalability of the recommender by providing scale-out infrastructure and k-NN indexes that grow linearly with respect to the number of tracks and similarity search in logarithmic time.

The following figure visualizes the high-dimensional space created by the vector embedding.

A visualization of the vector encoding of Amazon Music entries in the large vector space

Brand protection at Amazon

Amazon strives to deliver the world’s most trustworthy shopping experience, offering customers the widest possible selection of authentic products. To earn and maintain our customers’ trust, we strictly prohibit the sale of counterfeit products, and we continue to invest in innovations that ensure only authentic products reach our customers. Amazon’s brand protection programs build trust with brands by accurately representing and completely protecting their brand. We strive to ensure that public perception mirrors the trustworthy experience we deliver. Our brand protection strategy focuses on four pillars: (1) Proactive Controls (2) Powerful Tools to Protect Brands (3) Holding Bad Actors Accountable (4) Protecting and Educating Customers. Amazon OpenSearch Service is a key part of Amazon’s Proactive Controls.

In 2022, Amazon’s automated technology scanned more than 8 billion attempted changes daily to product detail pages for signs of potential abuse. Our proactive controls found more than 99% of blocked or removed listings before a brand ever had to find and report it. These listings were suspected of being fraudulent, infringing, counterfeit, or at risk of other forms of abuse. To perform these scans, Amazon created tooling that uses advanced and innovative techniques, including the use of advanced machine learning models to automate the detection of intellectual property infringements in listings across Amazon’s stores globally. A key technical challenge in implementing such automated system is the ability to search for protected intellectual property within a vast billion-vector corpus in a fast, scalable and cost effective manner. Leveraging Amazon OpenSearch Service’s scalable vector database capabilities and distributed architecture, we successfully developed an ingestion pipeline that has indexed a total of 68 billion, 128- and 1024-dimension vectors into OpenSearch Service to enable brands and automated systems to conduct infringement detection, in real-time, through a highly available and fast (sub-second) search API.

Conclusion

Whether you’re building a generative AI solution, searching rich media and audio, or bringing more semantic search to your existing search-based application, OpenSearch is a capable vector database. OpenSearch supports a variety of engines, algorithms, and distance measures that you can employ to build the right solution. OpenSearch provides a scalable engine that can support vector search at low latency and up to billions of vectors. With OpenSearch and its vector DB capabilities, your users can find that 8-foot-blue couch easily, and relax by a cozy fire.

About the Authors

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Dylan Tong is a Senior Product Manager at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

Vamshi Vijay Nakkirtha is a Software Engineering Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.