Tag Archives: Amazon OpenSearch Service

Single sign-on SSO for Amazon OpenSearch Service using SAML and Keycloak

Post Syndicated from Sajeev Attiyil Bhaskaran original https://aws.amazon.com/blogs/big-data/single-sign-on-sso-for-amazon-opensearch-service-using-saml-and-keycloak/

A standard use case for customers is to integrate existing identity providers (IdPs) with Amazon OpenSearch Service. OpenSearch Service offers built-in support for single sign-on (SSO) authentication for OpenSearch Dashboards, and uses SAML protocol. The SAML authentication for OpenSearch Service lets you integrate your existing third-party IdPs, such as Okta, Ping Identity, OneLogin, Auth0, ADFS, Azure Active Directory, and Keycloak, with OpenSearch Service dashboards.

In this post, we walk you through how to configure service provider-initiated authentication for OpenSearch Dashboards by using OpenSearch Service and Keycloak. We also discuss how to set up users, groups, and roles in Keycloak and configure their access to OpenSearch Dashboards.

Solution overview

The following diagram illustrates the SAML authentication flow for this solution.

image1

The sign-in flow consists of the following steps.

  1. The user opens a browser to navigate to the OpenSearch Dashboards endpoint of OpenSearch Service in a virtual private cloud (VPC), for example https://vpc-abc123.us-east-1.es.amazonaws.com/_dashboards.
  2. The service provider (OpenSearch Service) uses the information about the IdP (Keycloak) to generate a SAML authentication request. The service provider redirects SAML authentication requests back to the browser.
  3. The browser relays the SAML authentication request to Keycloak. Keycloak parses the SAML authentication request and asks for the user to insert their login and password to authenticate.
  4. After a successful authentication, Keycloak generates a SAML authentication response that includes authenticated user details from Keycloak and sends the encoded SAML response to the browser.
  5. The browser relays the SAML response to OpenSearch Service Assertion Consumer Service (ACS) URL.
  6. OpenSearch Service validates the SAML response. If the validation checks are passed, the user is redirected to the front page of OpenSearch Dashboards. The authorization is performed according to the roles mapped to the user.

Prerequisites

To complete this walkthrough, you should have the following set up:

  • An OpenSearch Service domain running OpenSearch or Elasticsearch version 6.7 or later with fine-grained access control enabled within a VPC.
  • Keycloak installed and configured. In this post, we created the IdP in the same VPC of the OpenSearch domain. There is no need for a direct connection between the IdP and the service provider, so you can have the IdP in a different network as well.
  • A properly configured security group for OpenSearch Service and Keycloak IdP server to receive inbound traffic from users.
  • A browser with network connectivity to both Keycloak and OpenSearch Dashboards.

Enable SAML authentication for OpenSearch Service

The first step is to enable SAML authentication for OpenSearch Service. Complete the following steps:

  1. On the OpenSearch Service console, open the details page for your OpenSearch Service domain.
  2. On the Security configuration tab, choose Edit.
  3. Select Enable SAML authentication.

image2

Enabling this option automatically populates different IdP URLs, which is required to configure SAML support in the Keycloak IdP. Note down the values under Service provider entity ID and SP-initiated SSO URL. The OpenSearch Dashboards login flow can be configured either as service provider-initiated or IdP-initiated. The service provider-initiated login flow is initiated by OpenSearch Service, and the IdP-initiated login flow is initiated by the IdP (for example, Keycloak). In this post, we use a service provider-initiated login flow.

image3

Configure Keycloak as IdP

During the SAML authentication process, when the user is authenticated, the browser receives a SAML assertion token from Keycloak and forwards it to OpenSearch Service. The OpenSearch Service domain authorizes the user with backend roles according to the attributes presented in the token.

To configure Keycloak as IdP, complete the following steps:

  1. Log in to the Keycloak IdP admin console with admin user privileges (for example, https://<Keycloak server>:8081/admin/).
  2. Choose Create Realm.
  3. For Realm name, enter a name (for example, Amazon_OpenSearch) and choose Create.

For managing OpenSearch Service specific roles, users, and groups, you first create a separate client realm that provides a logical space to manage objects.

  1. In the navigation pane, choose your realm, then choose Clients.
  2. Choose Create client.
  3. In the General Settings window, for Client type, choose SAML
  4. For Client ID, use the service provider entity ID you copied earlier, then choose Next
    image6
  5. Under Login settings, enter the service provider-initiated SSO URL copied from earlier (for example, https://vpc-abc123.us-east-1.es.amazonaws.com/_dashboards/_opendistro/_security/saml/acs) and choose Save.image7
  6. On the client settings tab, under Signature and Encryption, turn on Sign Assertions and keep all other options as default, then choose Save.
    image8
  7. On the Keys tab, under Signing keys config, turn Client signature required off.

image9

Configure Keycloak users, roles, and groups

After you have configured the Keycloak IdP client for OpenSearch Service, you can create roles, groups, and users on the IdP side. For this post, we create two roles, two groups, and two users, as listed in the following table.

Users Groups Roles
super_user_1 super_user_group super_user_role
readonly_user_1 readonly_user_group readonly_user_role

Complete the following steps:

  1. In the navigation pane for your realm, choose Realm roles.
  2. Choose Create role.image10
  3. For Role name, enter a name (for this post, super_user_role) and choose Save.image11
  4. Repeat these steps to create a second role, readonly_user_role.

Now let’s create groups, assign the roles to the groups, and map the users to the groups.

  1. Under your realm, choose Groups in the navigation pane.
  2. Choose Create group.
  3. For Name, enter a group name (for example, super_user_group) and choose Save.image12
  4. Repeat these steps to create a second group, readonly_user_group.

When the new groups are created, they will be listed on the Groups page.

image13

  1. On the details page for each group, on the Role mapping tab, choose Assign role.image14
  2. For the group super_user_group, select the role super_user_role and choose Assign.

image15

  1. Repeat these steps to assign the role readonly_user_role to the group readonly_user_group.

The last step is to create users and assign them to groups so they automatically inherit group privileges. For this post, we create two users, super_user_1 and readonly_user_1, with dashboard admin and dashboard read-only privileges, respectively.

  1. Under your realm, choose Users in the navigation pane.
  2. Choose Create new user.
  3. Under General, configure the user details, including user name, first name, last name, and email, then choose Create.

  1. Set a temporary password on the Credentials tab after you create the user.
  2. Choose Add user and repeat these steps to add your second user, readonly_user_1.
  3. To join a user to a specific group, choose Join Group on the Groups tab of the respective user.

image17

  1. Select the group the user is joining and choose Join. For example, the user super_user_1 is joining the group super_user_group.

  1. Repeat these steps for the user readonly_user_1 to join the group readonly_user_group.

Next, you can remove the default role mapping for the users because you already assigned the roles to their respective groups.

  1. On the Role Mapping tab, select the default role.
  2. Unassign the default role for the user by choosing Unassign and then Remove.
  3. Repeat these steps for the other user.

image19

  1. Choose Client scopes in the navigation pane.
  2. In the Name column, choose role_list.

image20

  1. On the Mappers tab, choose role list.

image21

  1. Turn on Single Role Attribute and choose Save.

Download SAML metadata from Keycloak

The configuration of Keycloak is now complete, so you can download the SAML metadata file from Keycloak. The SAML metadata is in XML format and is needed to configure SAML in the OpenSearch Service domain.

  1. Under your realm, choose Realm settings in the navigation pane.
  2. On the General tab, choose SAML 2.0 Identify Provider Metadata under Endpoints.image23

This will generate an IdP metadata file in another window. This XML file contains information on the provider, such as a TLS certificate, SSO endpoints, and the IdP entity ID.

  1. Download this XML file locally so you can upload this file on the OpenSearch Service console in later steps.

Integrate OpenSearch Service SAML with Keycloak

To integrate OpenSearch Service with the Keycloak IDP, you need to upload the IdP metadata XML file on the OpenSearch Service console.

  1. On the OpenSearch Service console, navigate to your domain.
  2. Choose Security configuration, then choose Edit.
  3. Under Metadata from IdP, choose Import from XML file to import the file and auto-populate the IdP entity ID.

Alternatively, you can copy and paste the contents of the entity ID property from the metadata file.

image24

  1. For SAML master backend role, enter super_user_role.

This means that a user with this role is provided manager user privileges to the cluster, but can only use permissions within OpenSearch Dashboards.

image25

  1. Expand the Additional settings section
  2. For Roles key, enter an attribute from the assertion (in our case, Role) and choose Save Changes.image26

Test the OpenSearch Dashboards SAML authentication with Keycloak

You’re now ready to test the SAML integration with Keycloak as an IdP.

  1. Choose the OpenSearch Dashboards URL provided on OpenSearch Service console.

It will automatically redirect you to the Keycloak sign-in page for authentication.

  1. Enter the admin user name (super_user_1) and password and choose Sign In.

Upon successful authentication, it will redirect you to the home page of OpenSearch Dashboards. If you encounter issues at this step, refer to SAML troubleshooting for common issues.

Internally, the security plugin maps the backend role super_user_role to the reserved security roles all_access and security_manager. Therefore, Keycloak users with the backend role super_user_role are authorized with the privileges of the manager user in the domain. To grant read-only dashboard access to user readonly_user_1, log in to OpenSearch Dashboards as the user super_user_1. Then map the role readonly_user_role as a backend role for the reserved security role opensearch_dashboards_read_only.

When establishing access control for the cluster, it’s crucial to carefully manage the permissions granted to users, adhering to the principle of least privilege. By having both super_user_role with administrative capabilities and read-only readonly_user_role, you can strike a balance. This approach allows a small number of trusted users to have full administrative access within OpenSearch Dashboards, while also enabling read-only access for other stakeholders who require visibility but don’t need more access.

At the time of writing, if you specify the <SingleLogoutService /> details in the Keycloak metadata XML, when you sign out from OpenSearch Dashboards, it will call Keycloak directly and try to sign the user out. This doesn’t work currently with some of the versions of OpenSearch Service, because Keycloak expects the sign-out request to be signed with a certificate that OpenSearch Service doesn’t currently support. If you remove <SingleLogoutService /> from the metadata XML file, OpenSearch Service will use its own internal sign-out mechanism and sign the user out on the OpenSearch Service side. No calls will be made to Keycloak for signing out.

Clean up

If you don’t want to continue using the solution, delete the resources you created:

  • OpenSearch Service domain
  • VPN and Keycloak instance

Conclusion

In this post, you learned how to configure Keycloak as an IdP to access OpenSearch Dashboards using SAML. To learn more about OpenSearch Service and SAML integration, refer to SAML authentication for OpenSearch Dashboards. Stay tuned for a series of posts focusing on SAML integrations with OpenSearch Service and Amazon OpenSearch Serverless.


About the Author

image27Sajeev is a Senior Cloud Engineer (Big Data & Analytics) and a Subject Matter Expert for Amazon OpenSearch Service. He works closely with AWS customers to provide them architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions and leads onsite and online sessions for customers to design the best solutions for their use cases.

Elevate your search and analytics skills with the new Amazon OpenSearch Service YouTube channel

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/elevate-your-search-and-analytics-skills-with-the-new-amazon-opensearch-service-youtube-channel/

Attention all developers, architects, and IT professionals! We’re thrilled to announce the launch of the official Amazon OpenSearch Service YouTube channel—a comprehensive resource for anyone looking to master Amazon OpenSearch Service. Whether you’re just getting started with searches , vectors, analytics, or you’re looking to optimize large-scale implementations, our channel can be your go-to resource to help you unlock the full potential of OpenSearch Service.

OpenSearch is a distributed search and analytics suite that is open source, community-driven, Apache License v2 licensed, and governed by the OpenSearch Software Foundation, under the Linux Foundation.

Amazon OpenSearch Service is a managed service that makes it straightforward to deploy, operate, and scale OpenSearch domains in AWS. OpenSearch Service offers a robust set of features that can transform the way you handle log analytics, real-time monitoring, vector search, and advanced search workloads. But to truly unlock its full potential, you need more than just the basics. That’s where our new YouTube channel comes in.

Dive into a world of practical expertise

We’ve carefully curated a collection of videos that are designed to provide you with the tools, techniques, and insights you need to navigate OpenSearch Service with confidence. The channel also offers a direct line of communication between you and the AWS team. By leaving comments on our videos, you can share your feedback, ideas, and pain points. We’ll be closely monitoring these comments and using them to shape the content we create and influence the future roadmap of OpenSearch Service.

Here’s what sets our channel apart:

  • Bite-sized learning – Our videos are short and concise—each one packed with practical information that you can consume in under 15 minutes. Whether you’re looking for a quick tutorial or a deep-dive into advanced features, we make it effortless for you to learn on the go.
  • Curated content – We hand-pick the most important topics around OpenSearch Service and break them down into simple-to-follow, informative videos. From configuring clusters to scaling for petabyte-scale analytics, we cover the most relevant use cases to help you build, manage, and optimize your OpenSearch environment.
  • Organized playlists – The channel is organized into playlists based on workloads and features of OpenSearch Service such as log analytics, observability, lexical search, vector search, generative AI, and more.
  • Influence the AWS team – You can leave comments on our videos, and your feedback will be reviewed by the AWS team. This input allows us to work backward from your feedback to influence the content we create and feed into the product roadmap.

What you’ll learn

On the OpenSearch Service YouTube channel, you can expect new content regularly, including:

  1. Log Analytics and Observability

    Learn how to ingest, search, and visualize logs at scale with OpenSearch, making log analytics efficient and powerful for enterprises of all sizes. Gain deep insights into using OpenSearch Service for observability, including infrastructure monitoring, application performance management (APM), and more.

  2. Lexical and Semantic Search

    Discover the key differences between lexical and semantic search, and learn how to implement both in OpenSearch Service. We’ll cover optimizing search relevancy, handling complex queries, using machine learning models for semantic understanding and much more.

  3. Vector Database & GenAI

    Explore OpenSearch Service’s vector database capabilities to power advanced semantic search and AI-driven applications. Learn how generative AI models can enhance your search solutions.

  4. Operational Best Practices

    Learn the best practices for running OpenSearch Service in production, covering everything from security and scaling to performance tuning and cost management.

  5. And More

    Expect a wide array of content, including deep dives into new features, architecture best practices, how to demo videos, use case showcases, and interviews with industry experts.

Subscribe to stay ahead

Whether you’re a beginner looking to get started or an experienced professional seeking to optimize your workflows, make sure to subscribe to the OpenSearch Service YouTube channel so you don’t miss out on the latest tutorials, insights, and updates. Get ready to elevate your search and analytics skills and be part of shaping the future of this channel and powerful service.


About the Authors

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Sohaib Katariwala is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He has over 14 years of experience helping organizations derive insights from their data.

Wendy Neu is a Senior Manager at AWS focused on leading the NoSQL Specialist Solutions Architecture team worldwide. She is passionate about Data Services and leverages her extensive expertise to help customers optimize their data storage, management, and analytics strategies, enabling them to drive innovation and achieve their business goals.

Take manual snapshots and restore in a different domain spanning across various Regions and accounts in Amazon OpenSearch Service

Post Syndicated from Madhan Kumar Baskaran original https://aws.amazon.com/blogs/big-data/take-manual-snapshots-and-restore-in-a-different-domain-spanning-across-various-regions-and-accounts-in-amazon-opensearch-service/

Snapshots are crucial for data backup and disaster recovery in Amazon OpenSearch Service. These snapshots allow you to generate backups of your domain indexes and cluster state at specific moments and save them in a reliable storage location such as Amazon Simple Storage Service (Amazon S3).

Snapshots play a critical role in providing the availability, integrity and ability to recover data in OpenSearch Service domains. By implementing a robust snapshot strategy, you can mitigate risks associated with data loss, streamline disaster recovery processes and maintain compliance with data management best practices.

This post provides a detailed walkthrough about how to efficiently capture and manage manual snapshots in OpenSearch Service. It covers the essential steps for taking snapshots of your data, implementing safe transfer across different AWS Regions and accounts, and restoring them in a new domain. This guide is designed to help you maintain data integrity and continuity while navigating complex multi-Region and multi-account environments in OpenSearch Service.

Refer to this developer guide to understand more about index snapshots

Understanding manual snapshots

Manual snapshots are point-in-time backups of your OpenSearch Service domain that are initiated by the user. Contrary to automated snapshots, which are taken on a regular basis in accordance with the specified retention policy by OpenSearch Service, manual snapshots give you the ability to take backups whenever required, whether for the full cluster or for individual indices. This is particularly useful when you want to preserve a specific state of your data for future reference or before implementing significant changes to your domain.

Snapshots are not instantaneous. They take time to complete and don’t represent perfect point-in-time views of the domain. While a snapshot is in progress, you can still index documents and make other requests to the domain, but new documents and updates to existing documents generally aren’t included in the snapshot. The snapshot includes primary shards as they existed when you initiate the snapshot process.

The following are some scenarios where manual snapshots play an important role:

  • Data recovery – The primary purpose of snapshots, whether manual or automated, is to provide a means of data recovery in the event of a failure or data loss. If something goes wrong with your domain, you can restore it to a previous state using a snapshot.
  • Migration – Manual snapshots can be useful when you want to migrate data from one domain to another. You can create a snapshot of the source domain and then restore it on the target domain.
  • Testing and development – You can use snapshots to create copies of your data for testing or development purposes. This allows you to experiment with your data without affecting the production environment.
  • Backup control – Manual snapshots give you more control over your backup process. You can choose exactly when to create a snapshot, which can be useful if you have specific requirements that are not met by automated snapshots.
  • Long-term archiving – Manual snapshots can be kept for as long as you want, which can be useful for long-term archiving of data. Automated snapshots, on the other hand, are often deleted after a certain period of time.

Solution overview

The following sections outline the procedure for taking a manual snapshot and then restoring it in a different domain, spanning across various Regions and accounts. The high-level steps are as follows:

  1. Create an AWS Identity and Access Management (IAM) role and user.
  2. Register a manual snapshot repository.
  3. Take manual snapshots.
  4. Set up S3 bucket replication.
  5. Create an IAM role and user in the target account.
  6. Add a bucket policy.
  7. Register the repository and restore snapshots in the target domain.

Prerequisite

This post assumes you have the following resources set up:

  • An active and running OpenSearch Service domain.
  • An S3 bucket to store the manual snapshots of your OpenSearch Service domain. The bucket has to be in the same Region where the OpenSearch Service domain is hosted.

Create an IAM role and user

Complete the following steps to create your IAM role and user:

  1. Create an IAM role to grant permissions to OpenSearch Service. For this post, we name the role TheSnapshotRole.
  2. Create a new policy using the following code and attach it to the role to allow access to the S3 bucket.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "iam:PassRole"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name/*"
      ]
    }
  ]
}
  1. Edit the trust relationship of TheSnapshotRole to specify OpenSearch Service in the Principal statement, as shown in the following example. Under the Condition block, we recommend that you use the aws:SourceAccount and aws:SourceArn condition keys to protect yourself against the confused deputy problem. The source account is the owner and the source ARN is the ARN of the OpenSearch Service domain.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "es.amazonaws.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "account-id"
        },
        "ArnLike": {
          "aws:SourceArn": "arn:aws:es:region:account-id:domain/domain-name"
        }
      }
    }
  ]
}
  1. Generate an IAM user to register the snapshot repository. For this post, we name the user TheSnapUser.
  2. To register a snapshot repository, you need to pass TheSnapshotRole to OpenSearch Service. You also need access to the es:ESHttpPut To grant both of these permissions, attach the following policy to the IAM role whose credentials are being used to sign the request.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/TheSnapshotRole"
    },
    {
      "Effect": "Allow",
      "Action": "es:ESHttpPut",
      "Resource": "arn:aws:es:region:123456789012:domain/domain-name/*"
    }
  ]
}

Register a manual snapshot repository

Complete the following steps to map the snapshot role and the user in OpenSearch Dashboards (if using fine-grained access control):

  1. Navigate to the OpenSearch Dashboards endpoint connected to your OpenSearch Service domain.
  2. Sign in with the admin user or a user with the security_manager role
  3. From the main menu, choose Security, Roles, and select the manage_snapshots role
  4. Choose Mapped users, then choose Manage mapping.
  5. Add the ARN of TheSnapshotRole for Backend role and the ARN of TheSnapUser for User:
    1. arn:aws:iam::123456789123:role/TheSnapshotRole
    2. arn:aws:iam::123456789123:user/TheSnapUser
  6. Choose Map and confirm the user and role shows up under Mapped users.
  7. To register a snapshot repository, send a PUT request to the OpenSearch Service domain endpoint through an API platform like Postman or Insomnia. For more details, see Registering a manual snapshot repository.

Note: While using Postman or Insomnia to run the API calls mentioned throughout this blog, choose AWS IAM v4 as the authentication method and input your IAM credentials in the Authorization section. Ensure you use the credentials of an OpenSearch user who has the ‘all_access’ OpenSearch role assigned on the domain.

curl -XPUT domain-endpoint/_snapshot/my-snapshot-repo-name
{
  "type": "s3",
  "settings": {
    "bucket": "s3-bucket-name",
    "region": "region",
    "role_arn": "arn:aws:iam::123456789012:role/TheSnapshotRole"
  }
}

If your domain resides within a virtual private cloud (VPC), you must be connected to the VPC for the request to successfully register the snapshot repository. Accessing a VPC varies by network configuration, but likely involves connecting to a VPN or corporate network. To check that you can reach the OpenSearch Service domain, navigate to https://<your-vpc-domain.region>.es.amazonaws.com in a web browser and verify that you receive the default JSON response.

Take manual snapshots

Taking a snapshot isn’t possible if another snapshot is currently in progress. The Ultrawarm storage tier migration process also utilizes snapshots to move data between hot and warm storage, running this process in the background. Additionally, automated snapshots are taken based on the schedule configured for the cluster by the service. See Protecting data with encryption for protecting your Amazon S3 data.

  1. To verify, run the following command
curl -XGET 'domain-endpoint/_snapshot/_status
  1. After you confirm no snapshot is running, run the following command to take a manual snapshot
curl -XPUT 'domain-endpoint/_snapshot/repository-name/snapshot-name

  1. Run the following command to verify the state of all snapshots of your domain
curl -XGET 'domain-endpoint/_snapshot/repository-name/_all?pretty

Set up S3 bucket replication

Before you start, have the following in place:

  1. Locate the destination bucket where the data will be replicated. If you don’t have one, create a new S3 bucket in a distinct region, separate from the region of the source bucket.
  2. To allow access to objects in this bucket by other AWS accounts (because the destination OpenSearch Service domain is in a different account), you need to enable access control lists (ACLs) on the bucket. ACLs will be used to specify and manage access permissions for the bucket and its objects.

Complete the following steps to set up S3 bucket replication. For more information, see Walkthroughs: Examples for configuring replication.

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose the bucket you want to replicate (the source bucket with snapshots).
  3. On the Management tab, choose Create replication rule.
  4. Replication requires versioning to be enabled for the source bucket, so choose Enable bucket versioning and enable versioning.
  5. Specify the following details:
    1. For Rule ID, enter a name for your rule.
    2. For Status, choose Enabled.
    3. For Rule scope, specify the data to be replicated.
    4. For Destination S3 bucket, enter the target bucket name where the data will be replicated.
    5. For IAM role, choose Create new role.
  6. Choose Save.
  7. In the Replicate existing objects pop-up window, select Yes, replicate existing objects to start replication.
  8. Choose Submit.

You will see a new active replication rule in the replication table on the Management tab of the source S3 bucket.

Create an IAM role and user in the target account

Complete the following steps to create your IAM role and user in the target account.

  1. Create an IAM role to grant permissions to the target OpenSearch Service. For this post, name the role DestinationSnapshotRole.
  2. Create a new policy using the following code and attach it to the role DestinationSnapshotRole to allow access to the target S3 bucket
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name" -> Replicated s3 bucket
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "iam:PassRole"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name/*" -> Replicated s3 bucket 
      ]
    }
  ]
}
  1. Edit the trust relationship of DestinationSnapshotRole to specify OpenSearch Service in the Principal statement as shown in the following example.
{
  "Version":"2012-10-17",
  "Statement":[
    {
      "Sid":"",
      "Effect":"Allow",
      "Principal":{
        "Service":"es.amazonaws.com"
      },
      "Action":"sts:AssumeRole",
      "Condition":{
        "StringEquals":{
          "aws:SourceAccount":"account-id" -> Target Account
        },
        "ArnLike":{
          "aws:SourceArn":"arn:aws:es:region:account-id:domain/domain-name/*" -> Target OpenSearch Domain
        }
      }
    }
  ]
}
  1. Generate an IAM user to register the snapshot repository. For this post, name the user DestinationSnapUser.
  2. To register a snapshot repository, you need to pass DestinationSnapshotRole to OpenSearch Service. You also need access to the es:ESHttpPut To grant both of these permissions, attach the following policy to the IAM role whose credentials are being used to sign the request
{
  "Version":"2012-10-17",
  "Statement":[
    {
      "Effect":"Allow",
      "Action":"iam:PassRole",
      "Resource":"arn:aws:iam::123456789012:role/DestinationSnapshotRole"
    },
    {
      "Effect":"Allow",
      "Action":"es:ESHttpPut",
      "Resource":"arn:aws:es:region:123456789012:domain/domain-name/*" -> Target OpenSearch Domain
    }
  ]
}

Complete the following steps to map the snapshot role and user in the target OpenSearch Dashboards (if using fine-grained access control).

  1. Navigate to the OpenSearch Dashboard’s endpoint connected with your OpenSearch Service domain.
  2. Sign in with the admin user or a user with the security_manager role
  3. From the main menu, choose Security, Roles, and choose the manage_snapshots role
  4. Choose Mapped users, then choose Manage mapping.
  5. Add the ARN of TheSnapshotRole for Backend role and the ARN of TheSnapUser for User:
    1. arn:aws:iam::123456789123:role/DestinationSnapshotRole
    2. arn:aws:iam::123456789123:user/DestinationSnapUser
  6. Choose Map and confirm the user and role shows up under Mapped users.

Add a bucket policy

In the destination S3 bucket details page, on the Permissions tab, choose Edit, then add the following bucket policy. This policy allows the target OpenSearch Service domain from another AWS account to access the snapshot created by a different AWS account.

{
  "Version":"2012-10-17",
  "Id":"Policy1568001010746",
  "Statement":[
    {
      "Sid":"Stmt1568000712531",
      "Effect":"Allow",
      "Principal":{
        "AWS":"arn:aws:iam::Account B:role/cross" -> DestinationSnapshotRole
      },
      "Action":"s3:*",
      "Resource":"arn:aws:s3:::snapshot"
    },
    {
      "Sid":"Stmt1568001007239",
      "Effect":"Allow",
      "Principal":{
        "AWS":"arn:aws:iam::Account B:role/cross" -> DestinationSnapshotRole
      },
      "Action":[
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource":"arn:aws:s3:::snapshot/*"
    }
  ]
}

Register the repository and restore snapshots in the target domain

To complete this step, you need an active and running OpenSearch Service domain in the target account.

Identify the snapshot you want to restore. Make sure all settings for this index, such as custom analyzer packages or allocation requirement settings, and data are compatible with the domain. Then complete the following steps

  1. To register the repository in the target OpenSearch Service domain, run the following command.
curl -XPUT domain-endpoint/_snapshot/my-snapshot-repo-name
{
  "type": "s3",
  "settings": {
    "bucket": "s3-bucket-name",
    "region": "region",
    "role_arn": "arn:aws:iam::123456789012:role/DestinationSnapshotRole"
  }
}
  1. After you register the repository, run the following command to see all snapshots.
curl -XGET 'domain-endpoint/_snapshot/repository-name/_all?pretty
  1. To restore a snapshot, run the following command.
curl -XPOST 'domain-endpoint/_snapshot/repository-name/snapshot-name/_restore
  1. Alternately, you might want to restore all indexes except the dashboards and fine-grained access control indexes.
curl -XPOST 'domain-endpoint/_snapshot/repository-name/snapshot-name/_restore' \
-d '{"indices": "-.kibana*,-.opendistro*"}' \
-H 'Content-Type: application/json'
  1. Sign in to OpenSearch Dashboards connected to the target OpenSearch Service domain and run the following command to check if the data is getting restored.
curl -XGET _cat/indices?v
  1. Run the following recovery command to check the progress of the restore operation.
curl -XGET _cat/recovery?v

Troubleshooting

This re:Post article addresses the majority of common errors that arise when attempting to restore a manual snapshot, along with effective solutions to resolve them.

Conclusion

In this post, we presented a procedure for taking manual snapshots and restoring them in OpenSearch Service. With manual snapshots, you have the power to manage your data backups, preserving key moments in time, confidently experimenting with domain modifications, and protecting against any data loss. Additionally, being able to restore snapshots across various domains, Regions, and accounts enables a new degree of data portability and flexibility, giving you the freedom to better manage and optimize your domains.

With great data protection comes great innovation. Now that you’re equipped with this knowledge, you can explore the endless possibilities that OpenSearch Service offers, confident in your ability to secure, restore, and thrive in the dynamic world of cloud-based data analytics and management.

See blog post to understand how to use snapshot management policies to manage automated snapshot in OpenSearch Service.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

Stay tuned for more exciting updates and new features in Amazon OpenSearch Service.


About the authors

Madhan Kumar Baskaran works as a Search Engineer at AWS, specializing in Amazon OpenSearch Service. His primary focus involves assisting customers in constructing scalable search applications and analytics solutions. Based in Bellevue, Washington, Madhan has a keen interest in data engineering and DevOps.

Priyanshi Omer is a Customer Success Engineer at AWS OpenSearch, based in Bengaluru. Her primary focus involves assisting customers in constructing scalable search applications and analytics solutions. She works closely with customers to help them migrate their workloads and aids existing customers in fine-tuning their clusters to achieve better performance and cost savings. Outside of work, she enjoys spending time with her cats and playing video games

Extract insights in a 30TB time series workload with Amazon OpenSearch Serverless

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/extract-insights-in-a-30tb-time-series-workload-with-amazon-opensearch-serverless/

In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling large data while extracting insights is a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management.

Amazon OpenSearch Serverless reduces the burden of manual infrastructure provisioning and scaling while still empowering you to ingest, analyze, and visualize your time-series data, simplifying data management and enabling you to derive actionable insights from data.

We recently announced a new capacity level of 30TB for time series data per account per AWS Region. The OpenSearch Serverless compute capacity for data ingestion and search/query is measured in OpenSearch Compute Units (OCUs), which are shared among various collections with the same AWS Key Management Service (AWS KMS) key. To accommodate larger datasets, OpenSearch Serverless now supports up to 500 OCUs per account per Region, each for indexing and search respectively, more than double from the previous limit of 200. You can configure the maximum OCU limits on search and indexing independently, giving you the reassurance of managing costs effectively. You can also monitor real-time OCU usage with Amazon CloudWatch metrics to gain a better perspective on your workload’s resource consumption. With the support for 30TB datasets, you can analyze data at the 30TB level to unlock valuable operational insights and make data-driven decisions to troubleshoot application downtime, improve system performance, or identify fraudulent activities.

This post discusses how you can analyze 30TB time series datasets with OpenSearch Serverless.

Innovations and optimizations to support larger data size and faster responses

Sufficient disk, memory, and CPU resources are crucial for handling extensive data effectively and conducting thorough analysis. These resources are not just beneficial but crucial for our operations. In time series collections, the OCU disk typically contains older shards that are not frequently accessed, referred to as warm shards. We have introduced a new feature called warm shard recovery prefetch. This feature actively monitors recently queried data blocks for a shard. It prioritizes them during shard movements, such as shard balancing, vertical scaling, and deployment activities. More importantly, it accelerates auto-scaling and provides faster readiness for varying search workloads, thereby significantly improving our system’s performance. The results provided later in this post provide details on the improvements.

A few select customers worked with us on early adoption prior to General Availability. In these trials, we observed up to 66% improvement in warm query performance for some customer workloads. This significant improvement shows the effectiveness of our new features. Additionally, we have enhanced the concurrency between coordinator and worker nodes, allowing more requests to be processed as the OCUs increases through auto scaling. This enhancement has resulted in up to a 10% improvement in query performance for hot and warm queries.

We have enhanced our system’s stability to handle time-series collections of up to 30 TB effectively. Our team is committed to improving system performance, as demonstrated by our ongoing enhancements to the auto-scaling system. These improvements comprised of enhanced shard distribution for optimal placement after rollover, auto-scaling policies based on queue length, and a dynamic sharding strategy that adjusts shard count based on ingestion rate.

In the following section we share an example test setup of a 30TB workload that we used internally, detailing the data being used and generated, along with our observations and results. Performance may vary depending on the specific workload.

Ingest the data

You can use the load generation scripts shared in the following workshop, or you can use your own application or data generator to create a load. You can run multiple instances of these scripts to generate a burst in indexing requests. As shown in the following screenshot, we tested with an index, sending approximately 30 TB of data over a period of 15 days. We used our load generator script to send the traffic to a single index, retaining data for 15 days using a data life cycle policy.

Test methodology

We set the deployment type to ‘Enable redundancy’ to enable data replication across Availability Zones. This deployment configuration will lead to 12-24 hours of data in hot storage (OCU disk memory) and the rest in Amazon Simple Storage Service (Amazon S3). With a defined set of search performance and the preceding ingestion expectation, we set the max OCUs to be 500 for both indexing and search.

As part of the testing, we observed auto-scaling behavior and graphed it. The indexing took around 8 hours to get stabilized at 80 OCU.

On the Search side, it took around 2 days to get stabilized at 80 OCU.

Observations:

Ingestion

The ingestion performance achieved was consistently over 2 TB per day

Search

Queries were of two types, with time ranging from 15 minutes to 15 days.

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}]}}}

For example

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-1d","lte":"now"}}}]}}}

The following chart provides the various percentile performance on the aggregation query

The second query was

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"State"}}]}}}

For example

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"California"}}]}}}

The following chart provides the various percentile performance on the search query

The following chart summarizes the time range for different queries

Time-range Query P50 (ms) P90 (ms) P95 (ms) P99 (ms)
15 minutes {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}]}}} 325 403.867 441.917 514.75
1 day {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}]}}} 7,693.06 12,294 13,411.19 17,481.4
1 hour {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}]}}} 1,061.66 1,397.27 1,482.75 1,719.53
1 year {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}]}}} 2,758.66 10,758 12,028 22,871.4
4 hour {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}]}}} 3,870.79 5,233.73 5,609.9 6,506.22
7 day {“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}]}}} 5,395.68 17,538.12 19,159.18 22,462.32
15 minutes {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}} 139 190 234.55 6,071.96
1 day {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}} 678.917 1,366.63 2,423 7,893.56
1 hour {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 259.167 305.8 343.3 1,125.66
1 year {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 2,166.33 2,469.7 4,804.9 9,440.11
4 hours {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 462.933 653.6 725.3 1,583.37
7 days {“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}} 1,353 2,745.1 4,338.8 9,496.36

Conclusion

OpenSearch Serverless not only supports a larger data size than prior releases but also introduces performance improvements like warm shard pre-fetch and concurrency optimization for better query response. These features reduce the latency of warm queries and improve auto-scaling to handle varied workloads. We encourage you to take advantage of the 30TB index support and put it to the test! Migrate your data, explore the improved throughput, and take advantage of the enhanced scaling capabilities.

To get started, refer to Log analytics the easy way with Amazon OpenSearch Serverless. To get hands-on experience with OpenSearch Serverless, follow the Getting started with Amazon OpenSearch Serverless workshop, which has a step-by-step guide for configuring and setting up an OpenSearch Serverless collection.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and AI/ML. He holds a Bachelor’s degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes and hang gliders and ride his motorcycle.

Milav Shah is an Engineering Leader with Amazon OpenSearch Service. He focuses on search experience for OpenSearch customers. He has extensive experience building highly scalable solutions in databases, real-time streaming and distributed computing. He also possesses functional domain expertise in verticals like Internet of Things, fraud protection, gaming and AI/ML. In his free time, he likes to ride cycle, hike, and play chess.

Qiaoxuan Xue is a Senior Software Engineer at AWS leading the search and benchmarking areas of the Amazon OpenSearch Serverless Project. His passion lies in finding solutions for intricate challenges within large-scale distributed systems. Outside of work, he enjoys woodworking, biking, playing basketball, and spending time with his family and dog.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Achieve cross-Region resilience with Amazon OpenSearch Ingestion

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/

Cross-Region deployments provide increased resilience to maintain business continuity during outages, natural disasters, or other operational interruptions. Many large enterprises, design and deploy special plans for readiness during such situations. They rely on solutions built with AWS services and features to improve their confidence and response times. Amazon OpenSearch Service is a managed service for OpenSearch, a search and analytics engine at scale. OpenSearch Service provides high availability within an AWS Region through its Multi-AZ deployment model and provides Regional resiliency with cross-cluster replication. Amazon OpenSearch Serverless is a deployment option that provides on-demand auto scaling, to which we continue to bring in many features.

With the existing cross-cluster replication feature in OpenSearch Service, you designate a domain as a leader and another as a follower, using an active-passive replication model. Although this model offers a way to continue operations during Regional impairment, it requires you to manually configure the follower. Additionally, after recovery, you need to reconfigure the leader-follower relationship between the domains.

In this post, we outline two solutions that provide cross-Region resiliency without needing to reestablish relationships during a failback, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Simple Storage Service (Amazon S3). These solutions apply to both OpenSearch Service managed clusters and OpenSearch Serverless collections. We use OpenSearch Serverless as an example for the configurations in this post.

Solution overview

We outline two solutions in this post. In both options, data sources local to a region write to an OpenSearch ingestion (OSI) pipeline configured within the same region. The solutions are extensible to multiple Regions, but we show two Regions as an example as Regional resiliency across two Regions is a popular deployment pattern for many large-scale enterprises.

You can use these solutions to address cross-Region resiliency needs for OpenSearch Serverless deployments and active-active replication needs for both serverless and provisioned options of OpenSearch Service, especially when the data sources produce disparate data in different Regions.

Prerequisites

Complete the following prerequisite steps:

  1. Deploy OpenSearch Service domains or OpenSearch Serverless collections in all the Regions where resiliency is needed.
  2. Create S3 buckets in each Region.
  3. Configure AWS Identity and Access Management (IAM) permissions needed for OSI. For instructions, refer to Amazon S3 as a source. Choose Amazon Simple Queue Service (Amazon SQS) as the method for processing data.

After you complete these steps, you can create two OSI pipelines one in each Region with the configurations detailed in the following sections.

Use OpenSearch Ingestion (OSI) for cross-Region writes

In this solution, OSI takes the data that is local to the Region it’s in and writes it to the other Region. To facilitate cross-Region writes and increase data durability, we use an S3 bucket in each Region. The OSI pipeline in the other Region reads this data and writes to the collection in its local Region. The OSI pipeline in the other Region follows a similar data flow.

While reading data, you have choices: Amazon SQS or Amazon S3 scans. For this post, we use Amazon SQS because it helps provide near real-time data delivery. This solution also facilitates writing directly to these local buckets in the case of pull-based OSI data sources. Refer to Source under Key concepts to understand the different types of sources that OSI uses.

The following diagram shows the flow of data.

The data flow consists of the following steps:

  1. Data sources local to a Region write their data to the OSI pipeline in their Region. (This solution also supports sources directly writing to Amazon S3.)
  2. OSI writes this data into collections followed by S3 buckets in the other Region.
  3. OSI reads the other Region data from the local S3 bucket and writes it to the local collection.
  4. Collections in both Regions now contain the same data.

The following snippets shows the configuration for the two pipelines.

#pipeline config for cross region writes
version: "2"
write-pipeline:
  source:
    http:
      path: "/logs"
  processor:
    - parse_json:
  sink:
    # First sink to same region collection
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"
    - s3:
        # Second sink to cross region S3 bucket
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "osi-cross-region-bucket"
        object_key:
          path_prefix: "osi-crw/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: 60s
        codec:
          ndjson:

The code for the write pipeline is as follows:

#pipeline config to read data from local S3 bucket
version: "2"
read-write-pipeline:
  source:
    s3:
      # S3 source with SQS 
      acknowledgments: true
      notification_type: "sqs"
      compression: "none"
      codec:
        newline:
      sqs:
        queue_url: "https://sqs.us-east-1.amazonaws.com/1234567890/my-osi-cross-region-write-q"
        maximum_messages: 10
        visibility_timeout: "60s"
        visibility_duplication_protection: true
      aws:
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123567890:role/pipe-line-role"
  processor:
    - parse_json:
  route:
  # Routing uses the s3 keys to ensure OSI writes data only once to local region 
    - local-region-write: "contains(/s3/key, \"osi-local-region-write\")"
    - cross-region-write: "contains(/s3/key, \"osi-cross-region-write\")"
  sink:
    - pipeline:
        name: "local-region-write-cross-region-write-pipeline"
    - pipeline:
        name: "local-region-write-pipeline"
        routes:
        - local-region-write
local-region-write-cross-region-write-pipeline:
  # Read S3 bucket with cross-region-write
  source:
    pipeline: 
      name: "read-write-pipeline"
  sink:
   # Sink to local-region managed OpenSearch service 
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::12345678890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"
local-region-write-pipeline:
  # Read local-region write  
  source:
    pipeline: 
      name: "read-write-pipeline"
  processor:
    - delete_entries:
        with_keys: ["s3"]
  sink:
     # Sink to cross-region S3 bucket 
    - s3:
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "osi-cross-region-write-bucket"
        object_key:
          path_prefix: "osi-cross-region-write/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: "60s"
        codec:
          ndjson:

To separate management and operations, we use two prefixes, osi-local-region-write and osi-cross-region-write, for buckets in both Regions. OSI uses these prefixes to copy only local Region data to the other Region. OSI also creates the keys s3.bucket and s3.key to decorate documents written to a collection. We remove this decoration while writing across Regions; it will be added back by the pipeline in the other Region.

This solution provides near real-time data delivery across Regions, and the same data is available across both Regions. However, although OpenSearch Service contains the same data, the buckets in each Region contain only partial data. The following solution addresses this.

Use Amazon S3 for cross-Region writes

In this solution, we use the Amazon S3 Region replication feature. This solution supports all the data sources available with OSI. OSI again uses two pipelines, but the key difference is that OSI writes the data to Amazon S3 first. After you complete the steps that are common to both solutions, refer to Examples for configuring live replication for instructions to configure Amazon S3 cross-Region replication. The following diagram shows the flow of data.

The data flow consists of the following steps:

  1. Data sources local to a Region write their data to OSI. (This solution also supports sources directly writing to Amazon S3.)
  2. This data is first written to the S3 bucket.
  3. OSI reads this data and writes to the collection local to the Region.
  4. Amazon S3 replicates cross-Region data and OSI reads and writes this data to the collection.

The following snippets show the configuration for both pipelines.

version: "2"
s3-write-pipeline:
  source:
    http:
      path: "/logs"
  processor:
    - parse_json:
  sink:
    # Write to S3 bucket that has cross region replication enabled
    - s3:
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "s3-cross-region-bucket"
        object_key:
          path_prefix: "pushedlogs/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: 60s
          event_count: 2
        codec:
          ndjson:

The code for the write pipeline is as follows:

version: "2"
s3-read-pipeline:
  source:
    s3:
      acknowledgments: true
      notification_type: "sqs"
      compression: "none"
      codec:
        newline:
      # Configure SQS to notify OSI pipeline
      sqs:
        queue_url: "https://sqs.us-east-2.amazonaws.com/1234567890/my-s3-crr-q"
        maximum_messages: 10
        visibility_timeout: "15s"
        visibility_duplication_protection: true
      aws:
        region: "us-east-2"
        sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
  processor:
    - parse_json:
  # Configure OSI sink to move the files from S3 to OpenSearch Serverless
  sink:
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          # Role must have access to S3 OpenSearch Pipeline and OpenSearch Serverless
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"

The configuration for this solution is relatively simpler and relies on Amazon S3 cross-Region replication. This solution makes sure that the data in the S3 bucket and OpenSearch Serverless collection are the same in both Regions.

For more information about the SLA for this replication and metrics that are available to monitor the replication process, refer to S3 Replication Update: Replication SLA, Metrics, and Events.

Impairment scenarios and additional considerations

Let’s consider a Regional impairment scenario. For this use case, we assume that your application is powered by an OpenSearch Serverless collection as a backend. When a region is impaired, these applications can simply failover to the OpenSearch Serverless collection in the other Region and continue operations without interruption, because the entirety of the data present before the impairment is available in both collections.

When the Region impairment is resolved, you can failback to the OpenSearch Serverless collection in that Region either immediately or after you allow some time for the missing data to be backfilled in that Region. The operations can then continue without interruption.

You can automate these failover and failback operations to provide a seamless user experience. This automation is not in scope of this post, but will be covered in a future post.

The existing cross-cluster replication solution, requires you to manually reestablish a leader-follower relationship, and restart replication from the beginning once recovered from an impairment. But the solutions discussed here automatically resume replication from the point where it last left off. If for some reason only Amazon OpenSearch service that is collections or domain were to fail, the data is still available in a local buckets and it will be back filled as soon the collection or domain becomes available.

You can effectively use these solutions in an active-passive replication model as well. In those scenarios, it’s sufficient to have minimum set of resources in the replication Region like a single S3 bucket. You can modify this solution to solve different scenarios using additional services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), which has a built-in replication feature.

When building cross-Region solutions, consider cross-Region data transfer costs for AWS. As a best practice, consider adding a dead-letter queue to all your production pipelines.

Conclusion

In this post, we outlined two solutions that achieve Regional resiliency for OpenSearch Serverless and OpenSearch Service managed clusters. If you need explicit control over writing data cross Region, use solution one. In our experiments with few KBs of data majority of writes completed within a second between two chosen regions. Choose solution two if you need simplicity the solution offers. In our experiments replication completed completely in a few seconds. 99.99% of objects will be replicated within 15 minutes.  These solutions also serve as an architecture for an active-active replication model in OpenSearch Service using OpenSearch Ingestion.

You can also use OSI as a mechanism to search for data available within other AWS services, like Amazon S3, Amazon DynamoDB, and Amazon DocumentDB (with MongoDB compatibility). For more details, see Working with Amazon OpenSearch Ingestion pipeline integrations.


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

AWS Weekly Roundup: Amazon EC2 X8g Instances, Amazon Q generative SQL for Amazon Redshift, AWS SDK for Swift, and more (Sep 23, 2024)

Post Syndicated from Abhishek Gupta original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-ec2-x8g-instances-amazon-q-generative-sql-for-amazon-redshift-aws-sdk-for-swift-and-more-sep-23-2024/

AWS Community Days have been in full swing around the world. I am going to put the spotlight on AWS Community Day Argentina where Jeff Barr delivered the keynote, talks and shared his nuggets of wisdom with the community, including a fun story of how he once followed Bill Gates to a McDonald’s!

I encourage you to read about his experience.

Last week’s launches
Here are the launches that got my attention, starting off with the GA releases.

Amazon EC2 X8g Instances are now generally availableX8g instances are powered by AWS Graviton4 processors and deliver up to 60% better performance than AWS Graviton2-based Amazon EC2 X2gd instances. These instances offer larger sizes with up to 3x more vCPU (up to 48xlarge) and memory (up to 3TiB) than Graviton2-based X2gd instances.

Amazon Q generative SQL for Amazon Redshift is now generally available – Amazon Q generative SQL in Amazon Redshift Query Editor is an out-of-the-box web-based SQL editor for Amazon Redshift. It uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

AWS SDK for Swift is now generally availableAWS SDK for Swift provides a modern, user-friendly, and native Swift interface for accessing Amazon Web Services from Apple platforms, AWS Lambda, and Linux-based Swift on Server applications. Now that it’s GA, customers can use AWS SDK for Swift for production workloads. Learn more in the AWS SDK for Swift Developer Guide.

AWS Amplify now supports long-running tasks with asynchronous server-side function calls – Developers can use AWS Amplify to invoke Lambda function asynchronously for operations like generative AI model inferences, batch processing jobs, or message queuing without blocking the GraphQL API response. This improves responsiveness and scalability, especially for scenarios where immediate responses are not required or where long-running tasks need to be offloaded.

Amazon Keyspaces (for Apache Cassandra) now supports add-column for multi-Region tables – With this launch, you can modify the schema of your existing multi-Region tables in Amazon Keyspaces (for Apache Cassandra) to add new columns. You only have to modify the schema in one of its replica Regions and Keyspaces will replicate the new schema to the other Regions where the table exists.

Amazon Corretto 23 is now generally availableAmazon Corretto is a no-cost, multi-platform, production-ready distribution of OpenJDK. Corretto 23 is an OpenJDK 23 Feature Release that includes an updated Vector API, expanded pattern matching and switch expression, and more. It will be supported through April, 2025.

Use OR1 instances for existing Amazon OpenSearch Service domains – With OpenSearch 2.15, you can leverage OR1 instances for your existing Amazon OpenSearch Service domains by simply updating your existing domain configuration, and choosing OR1 instances for data nodes. This will seamlessly move domains running OpenSearch 2.15 to OR1 instances using a blue/green deployment.

Amazon S3 Express One Zone now supports AWS KMS with customer managed keys – By default, S3 Express One Zone encrypts all objects with server-side encryption using S3 managed keys (SSE-S3). With S3 Express One Zone support for customer managed keys, you have more options to encrypt and manage the security of your data. S3 Bucket Keys are always enabled when you use SSE-KMS with S3 Express One Zone, at no additional cost.

Use AWS Chatbot to interact with Amazon Bedrock agents from Microsoft Teams and Slack – Before, customers had to develop custom chat applications in Microsoft Teams or Slack and integrate it with Amazon Bedrock agents. Now they can invoke their Amazon Bedrock agents from chat channels by connecting the agent alias with an AWS Chatbot channel configuration.

AWS CodeBuild support for managed GitLab runners – Customers can configure their AWS CodeBuild projects to receive GitLab CI/CD job events and run them on ephemeral hosts. This feature allows GitLab jobs to integrate natively with AWS, providing security and convenience through features such as IAM, AWS Secrets Manager, AWS CloudTrail, and Amazon VPC.

We launched existing services in additional Regions:

Other AWS news
Here are some additional projects, blog posts, and news items that you might find interesting:

Secure Cross-Cluster Communication in EKS – It demonstrates how you can use Amazon VPC Lattice and Pod Identity to secure cross-EKS-cluster application communication, along with an example that you can use as a reference to adapt to your own microservices applications.

Improve RAG performance using Cohere Rerank – This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank.

AWS open source news and updates – My colleague Ricardo Sueiras writes about open source projects, tools, and events from the AWS Community; check out Ricardo’s page for the latest updates.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Italy (Sep. 27), Taiwan (Sep. 28), Saudi Arabia (Sep. 28)), Netherlands (Oct. 3), and Romania (Oct. 5).

Browse all upcoming AWS led in-person and virtual events and developer-focused events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Abhishek

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Weekly Roundup: Oracle Database@AWS, Amazon RDS, AWS PrivateLink, Amazon MSK, Amazon EventBridge, Amazon SageMaker and more

Post Syndicated from Matheus Guimaraes original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-oracle-databaseaws-amazon-rds-aws-privatelink-amazon-msk-amazon-eventbridge-amazon-sagemaker-and-more/

Hello, everyone!

It’s been an interesting week full of AWS news as usual, but also full of vibrant faces filling up the rooms in a variety of events happening this month.

Let’s start by covering some of the releases that have caught my attention this week.

My Top 3 AWS news of the week

Amazon RDS for MySQL zero-ETL integrations is now generally available and it comes with exciting new features. You are now able to configure zero-ETL integrations in your AWS CloudFormation templates, and you also now have the ability to set up multiple integrations from a source Amazon RDS for MySQL database with up to five Amazon Redshift warehouses. Lastly, you can now also apply data filters which determine which database and tables get automatically replicated. Read this blog post where I review aspects of this release and show you how to get started with data filtering if you want to know more. Incidentally, this release pairs well with another release this week: Amazon Redshift now allows you to alter the sort keys of tables replicated via zero-ETL integrations.

Oracle Database@AWS has been announced as part of a strategic partnership between Amazon Web Services (AWS) and Oracle. This offering allows customers to access Oracle Autonomous Database and Oracle Exadata Database Service directly within AWS simplifying cloud migration for enterprise workloads. Key features include zero-ETL integration between Oracle and AWS services for real-time data analysis, enhanced security, and optimized performance for hybrid cloud environments. This collaboration addresses the growing demand for multi-cloud flexibility and efficiency. It will be available in preview later in the year with broader availability in 2025 as it expands to new Regions.

Amazon OpenSearch Service now supports version 2.15, featuring improvements in search performance, query optimization, and AI-powered application capabilities. Key updates include radial search for vector space queries, optimizations for neural sparse and hybrid search, and the ability to enable vector and hybrid search on existing indexes. Additionally, it also introduces new features like a toxicity detection guardrail and an ML inference processor for enriching ingest pipelines. Read this guide to see how you can upgrade your Amazon OpenSearch Service domain.

So simple yet so good
These releases are simple in nature, but have a big impact.

AWS Resource Access Manager (RAM) now supports AWS PrivateLink – With this release, you can now securely share resources across AWS accounts with private connectivity, without exposing traffic to the public internet. This integration allows for more secure and streamlined access to shared services via VPC endpoints, improving network security and simplifying resource sharing across organizations.

AWS Network Firewall now supports AWS PrivateLink – another security quick-win, you can now securely access and manage Network Firewall resources without exposing traffic to the public internet.

AWS IAM Identity Center now enables users to customize their experience – You can set the language and visual mode preferences, including dark mode for improved readability and reduced eye strain. This update supports 12 different languages and enables users to adjust their settings for a more personalized experience when accessing AWS resources through the portal​.

Others
Amazon EventBridge Pipes now supports customer managed KMS keysAmazon EventBridge Pipes now supports customer-managed keys for server-side encryption. This update allows customers to use their own AWS Key Management Service (KMS) keys to encrypt data when transferring between sources and targets, offering more control and security over sensitive event data. The feature enhances security for point-to-point integrations without the need for custom integration code. See instructions on how to configure this in the updated documentation. 

AWS Glue Data Catalog now supports enhanced storage optimization for Apache Iceberg tables – This includes automatic removal of unnecessary data files, orphan file management, and snapshot retention. These optimizations help reduce storage costs and improve query performance by continuously monitoring and compacting tables, making it easier to manage large-scale datasets stored in Amazon S3. See this Big Data blog post for a deep dive into this new feature.

Amazon MSK Replicator now supports the replication of Kafka topics across clusters while preserving identical topic namesThis simplifies cross-cluster replication processes allowing users to replicate data across regions without needing to reconfigure client applications. This reduces setup complexity and enhances support for more seamless failovers in multi-cluster streaming architectures​. See this Amazon MSK Replicator developer guide to learn more about it.

Amazon SageMaker introduces sticky session routing for inferenceThis allows requests from the same client to be directed to the same model instance for the duration of a session improving consistency and reducing latency, particularly in real-time inference scenarios like chatbots or recommendation systems, where session-based interactions are crucial​. Read about how to configure it in this documentation guide.

Events
The AWS GenAI Lofts continue to pop up around the world! This week, developers in San Francisco had the opportunity to attend two very exciting events at the AWS Gen AI Loft in San Francisco including the “Generative AI on AWS” meetup last Tuesday, featuring discussions about extended reality, future AI tools, and more. Then things got playful on Thursday with the demonstration of an Amazon Bedrock-powered MineCraft bot and AI video game battles! If you’re around San Francisco before October 19th make sure to check out the schedule to see the list of events that you can join.

AWS GenAI Loft San Francisco talk

Make sure to check out the AWS GenAI Loft in Sao Paulo, Brazil, which opened recently, and the AWS GenAI Loft in London, which opens September 30th. You can already start registering for events before they fill up including one called “The future of development” that offers a whole day of targeted learning for developers to help them accelerate their skills.

Our AWS communities have also been very busy throwing incredible events! I was privileged to be a speaker at AWS Community Day Belfast where I got to finally meet all of the organizers of this amazing thriving community in Northern Ireland. If you haven’t been to a community day, I really recommend you check them out! You are sure to leave energized by the dedication and passion from communities leaders like Matt Coulter, Kristi Perreault, Matthew Wilson, Chloe McAteer, and their community members – not to mention the smiles all around. 🙂

AWS Community Belfast organizers and codingmatheus

Certifications
If you’ve been postponing taking an AWS certification exam, now is the perfect time! Register free for the AWS Certified: Associate Challenge before December 12, 2024 and get a 50% discount voucher to take any of the following exams: AWS Certified Solutions Architect – Associate, AWS Certified Developer – Associate, AWS Certified SysOps Administrator – Associate, or AWS Certified Data Engineer – Associate. My colleague Jenna Seybold has posted a collection of study material for each exam; check it out if you’re interested.

Also, don’t forget that the brand new AWS Certified AI Practitioner exam is now available. It is in beta stage, but you can already take it. If you pass it before February 15, 2025, you get an Early Adopter badge to add to your collection.

Conclusion
I hope you enjoyed the news this week!

Keep learning!

Amazon OpenSearch Service: Managed and community driven

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-managed-and-community-driven/

I’ve always loved the problem of search. At its core, search is about receiving a question, understanding that question, and then retrieving the best answer for it. A long time ago, I did an AI robotics project for my PhD that married a library of plan fragments to a real-world situation, through search. I’ve worked on and built a commercial search engine from the ground up in a prior job. And in my career at AWS, I’ve worked as a solutions architect, helping our customers adopt our search services in all their incarnations.

Like many developers, I share a passion for open source. This stems partly from my academic background, where scholars work for the greater good, building upon and benefiting from previous achievements in their fields. I’ve used and contributed to numerous open source technologies, ranging from small projects with a single purpose to large-scale initiatives with passionate, engaged communities. The search community has its own, special and academic flavor, because search itself is related to long-standing academic endeavors like information retrieval, psychology, and (symbolic) AI. Open source software has played a prominent role in this community. Search technology has been democratized, especially over the past 10–15 years, through open source projects like Apache Lucene, Apache Solr, Apache License, 2.0 version of Elasticsearch, and OpenSearch.

It’s that context that makes me so excited that today the Linux Foundation announced the OpenSearch Software Foundation. As part of the creation of the OpenSearch Foundation, AWS has transferred ownership of OpenSearch to the Linux Foundation. At the launch of the project in April of 2021, in introducing OpenSearch, we spoke of our desire to “ensure users continue to have a secure, high-quality, fully open source search and analytics suite with a rich roadmap of new and innovative functionality.” We’ve maintained that desire and commitment, and with this transfer, are deepening that commitment, and bringing in the broader community with open governance to help with that goal.

There are two key points regarding this announcement: first, nothing is changing if you’re a customer of Amazon OpenSearch Service; second a lot is changing on the open source side, and that’s a net benefit for the service. We’re moving into a future that includes an acceleration in innovation for the OpenSearch Project, driven by deeper collaboration and participation with the community. Ultimately, that’s going to come to the service and benefit our AWS customers.

Amazon OpenSearch Service: How we’ve worked

Amazon’s focus from the beginning was to work on OpenSearch in the open. Our first task was to release a working code base with code import and renaming capabilities. We launched OpenSearch1.0 in July 2021, followed by renaming our managed service to Amazon OpenSearch Service in September 2021. With the launch of Amazon OpenSearch Service, we announced support for OpenSearch 1.0 as an engine choice.

As our team at Amazon and the community grew and innovated in the OpenSearch Project, we brought those changes to Amazon OpenSearch Service along with support for the corresponding versions. At AWS, we embraced open source by jointly publishing and discussing ideas, RFCs,and feature requests with the community. As time passed and the project progressed, we onboarded community maintainers and accepted contributions from various sources within and outside AWS.

As an Amazon OpenSearch Service customer, you’ll continue to see updates and new versions flowing from open source to our managed service. You’ll also experience ongoing innovation driven by our investment in growing the project, its community, and code base.

Today the OpenSearch project has significant momentum, with more than 700 million software downloads and participation from thousands of contributors and more than 200 project maintainers. The OpenSearch Software Foundation launches with support from premier members AWS, SAP, and Uber and general members Aiven, Aryn, Atlassian, Canonical, Digital Ocean, Eliatra, Graylog, NetApp® Instaclustr, and Portal26.

Amazon OpenSearch Service: Going forward

This announcement doesn’t change anything for Amazon OpenSearch Service. Amazon remains committed to innovating for and contributing to the OpenSearch Project, with a growing number of committers and maintainers. If anything, this innovation will accelerate with broader and deeper participation bringing more diverse ideas from the global community. At the core of this commitment is our founding and continuing desire to “ensure users continue to have a secure, high-quality, fully open source search and analytics suite with a rich roadmap of new and innovative functionality.” We plan to continue closely working with the project, contributing code improvements and bringing those improvements to our managed service.

This announcement doesn’t change how you connect with or use Amazon OpenSearch Service. OpenSearch Service will continue to be a fully managed service, providing OpenSearch and OpenSearch Dashboards at service-provided endpoints, and with the full suite of existing managed-service features. If you’re using Amazon OpenSearch Service, you won’t need to change anything. There won’t be any licensing changes or cost changes driven by the move to a foundation.

Amazon will continue bringing its expertise to the project, funding new innovations where our customers need them the most, such as cloud-native large scale distributed systems, search, analytics, machine learning and AI. The Linux Foundation will also facilitate collaboration with other open source organizations such as Cloud Native Computing Foundation (CNCF), which is instrumental for cloud-native, open source projects. Our goal will remain to solve some of the most challenging customer problems, open source first. Finally, given the open source nature of the product we think there’s a big opportunity and are excited to partner with our customers to solve their problems together, in code.

We’ve always encouraged our customers to participate in the OpenSearch Project. Now, the project has a well-defined structure and management with the governing board, and technical steering committee, each staffed with members from diverse backgrounds, both in and out of Amazon. The governing board will look after the project’s funding and management, the technical steering committee will take care of the technical direction of the project. This opens the door wider for you to directly participate in shaping the technology you’re using in our managed service. If you’re an Amazon OpenSearch Service customer, the project welcomes your contributions, big or small, from filing issues and feature requests to commenting on RFCs and contributing code.

Conclusion

This is an exciting time, for the project, for the community, and for Amazon OpenSearch Service. As an AWS customer, you don’t need to make any changes in use, and there aren’t any changes in the Apache License, 2.0 or the pricing. But, moving to the Linux Foundation will help bring the spirit of cooperation from the open source world to the technology and from there to Amazon OpenSearch Service. As search continues to mature, together we’ll continue to get better at understanding questions, and providing relevant results.

You can read more about the OpenSearch Foundation announcement on the AWS Open Source blog.


About the author

Jon Handler is the Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Differentiate generative AI applications with your data using AWS analytics and managed databases

Post Syndicated from Diego Colombatto original https://aws.amazon.com/blogs/big-data/differentiate-generative-ai-applications-with-your-data-using-aws-analytics-and-managed-databases/

While the potential of generative artificial intelligence (AI) is increasingly under evaluation, organizations are at different stages in defining their generative AI vision. In many organizations, the focus is on large language models (LLMs), and foundation models (FMs) more broadly. This is just the tip of the iceberg, because what enables you to obtain differential value from generative AI is your data.

Generative AI applications are still applications, so you need the following:

  • Operational databases to support the user experience for interaction steps outside of invoking generative AI models
  • Data lakes to store your domain-specific data, and analytics to explore them and understand how to use them in generative AI
  • Data integrations and pipelines to manage (sourcing, transforming, enriching, and validating, among others) and render data usable with generative AI
  • Governance to manage aspects such as data quality, privacy and compliance to applicable privacy laws, and security and access controls

LLMs and other FMs are trained on a generally available collective body of knowledge. If you use them as is, they’re going to provide generic answers with no differential value for your company. However, if you use generative AI with your domain-specific data, it can provide a valuable perspective for your business and enable you to build differentiated generative AI applications and products that will stand out from others. In essence, you have to enrich the generative AI models with your differentiated data.

On the importance of company data for generative AI, McKinsey stated that “If your data isn’t ready for generative AI, your business isn’t ready for generative AI.”

In this post, we present a framework to implement generative AI applications enriched and differentiated with your data. We also share a reusable, modular, and extendible asset to quickly get started with adopting the framework and implementing your generative AI application. This asset is designed to augment catalog search engine capabilities with generative AI, improving the end-user experience.

You can extend the solution in directions such as the business intelligence (BI) domain with customer 360 use cases, and the risk and compliance domain with transaction monitoring and fraud detection use cases.

Solution overview

There are three key data elements (or context elements) you can use to differentiate the generative AI responses:

  • Behavioral context – How do you want the LLM to behave? Which persona should the FM impersonate? We call this behavioral context. You can provide these instructions to the model through prompt templates.
  • Situational context – Is the user request part of an ongoing conversation? Do you have any conversation history and states? We call this situational context. Also, who is the user? What do you know about user and their request? This data is derived from your purpose-built data stores and previous interactions.
  • Semantic context – Is there any meaningfully relevant data that would help the FMs generate the response? We call this semantic context. This is typically obtained from vector stores and searches. For example, if you’re using a search engine to find products in a product catalog, you could store product details, encoded into vectors, into a vector store. This will enable you to run different kinds of searches.

Using these three context elements together is more likely to provide a coherent, accurate answer than relying purely on a generally available FM.

There are different approaches to design this type of solution; one method is to use generative AI with up-to-date, context-specific data by supplementing the in-context learning pattern using Retrieval Augmented Generation (RAG) derived data, as shown in the following figure. A second approach is to use your fine-tuned or custom-built generative AI model with up-to-date, context-specific data.

The framework used in this post enables you to build a solution with or without fine-tuned FMs and using all three context elements, or a subset of these context elements, using the first approach. The following figure illustrates the functional architecture.

Technical architecture

When implementing an architecture like that illustrated in the previous section, there are some key aspects to consider. The primary aspect is that, when the application receives the user input, it should process it and provide a response to the user as quickly as possible, with minimal response latency. This part of the application should also use data stores that can handle the throughput in terms of concurrent end-users and their activity. This means predominantly using transactional and operational databases.

Depending on the goals of your use case, you might store prompt templates separately in Amazon Simple Storage Service (Amazon S3) or in a database, if you want to apply different prompts for different usage conditions. Alternatively, you might treat them as code and use source code control to manage their evolution over time.

NoSQL databases like Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon MemoryDB can provide low read latencies and are well suited to handle your conversation state and history (situational context). The document and key value data models allow you the flexibility to adjust the schema of the conversation state over time.

User profiles or other user information (situational context) can come from a variety of database sources. You can store that data in relational databases like Amazon Aurora, NoSQL databases, or graph databases like Amazon Neptune.

The semantic context originates from vector data stores or machine learning (ML) search services. Amazon Aurora PostgreSQL-Compatible Edition with pgvector and Amazon OpenSearch Service are great options if you want to interact with vectors directly. Amazon Kendra, our ML-based search engine, is a great fit if you want the benefits of semantic search without explicitly maintaining vectors yourself or tuning the similarity algorithms to be used.

Amazon Bedrock is a fully managed service that makes high-performing FMs from leading AI startups and Amazon available through a unified API. You can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock provides integrations with both Aurora and OpenSearch Service, so you don’t have to explicitly query the vector data store yourself.

The following figure summarizes the AWS services available to support the solution framework described so far.

Catalog search use case

We present a use case showing how to augment the search capabilities of an existing search engine for product catalogs, such as ecommerce portals, using generative AI and customer data.

Each customer will have their own requirements, so we adopt the framework presented in the previous sections and show an implementation of the framework for the catalog search use case. You can use this framework for both catalog search use cases and as a foundation to be extended based on your requirements.

One additional benefit about this catalog search implementation is that it’s pluggable to existing ecommerce portals, search engines, and recommender systems, so you don’t have to redesign or rebuild your processes and tools; this solution will augment what you currently have with limited changes required.

The solution architecture and workflow is shown in the following figure.

The workflow consists of the following steps:

  1. The end-user browses the product catalog and submits a search, in natual language, using the web interface of the frontend catalog application (not shown). The catalog frontend application sends the user search to the generative AI application. Application logic is currently implemented as a container, but it can be deployed with AWS Lambda as required.
  2. The generative AI application connects to Amazon Bedrock to convert the user search into embeddings.
  3. The application connects with OpenSearch Service to search and retrieve relevant search results (using an OpenSearch index containing products). The application also connects to another OpenSearch index to get user reviews for products listed in the search results. In terms of searches, different options are possible, such as k-NN, hybrid search, or sparse neural search. For this post, we use k-NN search. At this stage, before creating the final prompt for the LLM, the application can perform an additional step to retrieve situational context from operational databases, such as customer profiles, user preferences, and other personalization information.
  4. The application gets prompt templates from an S3 data lake and creates the engineered prompt.
  5. The application sends the prompt to Amazon Bedrock and retrieves the LLM output.
  6. The user interaction is stored in a data lake for downstream usage and BI analysis.
  7. The Amazon Bedrock output retrieved in Step 5 is sent to the catalog application frontend, which shows results on the web UI to the end-user.
  8. DynamoDB stores the product list used to display products in the ecommerce product catalog. DynamoDB zero-ETL integration with OpenSearch Service is used to replicate product keys into OpenSearch.

Security considerations

Security and compliance are key concerns for any business. When adopting the solution described in this post, you should always factor in the Security Pillar best practices from the AWS Well-Architecture Framework.

There are different security categories to consider and different AWS Security services you can use in each security category. The following are some examples relevant for the architecture shown in this post:

  • Data protection – You can use AWS Key Management Service (AWS KMS) to manage keys and encrypt data based on the data classification policies defined. You can also use AWS Secrets Manager to manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles.
  • Identity and access management – You can use AWS Identity and Access Management (IAM) to specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS.
  • Detection and response – You can use AWS CloudTrail to track and provide detailed audit trails of user and system actions to support audits and demonstrate compliance. Additionally, you can use Amazon CloudWatch to observe and monitor resources and applications.
  • Network security – You can use AWS Firewall Manager to centrally configure and manage firewall rules across your accounts and AWS network security services, such as AWS WAF, AWS Network Firewall, and others.

Conclusion

In this post, we discussed the importance of using customer data to differentiate generative AI usage in applications. We presented a reference framework (including a functional architecture and a technical architecture) to implement a generative AI application using customer data and an in-context learning pattern with RAG-provided data. We then presented an example of how to apply this framework to design a generative AI application using customer data to augment search capabilities and personalize the search results of an ecommerce product catalog.

Contact AWS to get more information on how to implement this framework for your use case. We’re also happy to share the technical asset presented in this post to help you get started building generative AI applications with your data for your specific use case.


About the Authors

Diego Colombatto is a Senior Partner Solutions Architect at AWS. He brings more than 15 years of experience in designing and delivering Digital Transformation projects for enterprises. At AWS, Diego works with partners and customers advising how to leverage AWS technologies to translate business needs into solutions.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to Data Analytics and Artificial Intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on Data and AI.

Tiziano Curci is a Manager, EMEA Data & AI PDS at AWS. He leads a team that works with AWS Partners (G/SI and ISV), to leverage the most comprehensive set of capabilities spanning databases, analytics and machine learning, to help customers unlock the through power of data through an end-to-end data strategy.

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Post Syndicated from Abhishek Pan original https://aws.amazon.com/blogs/big-data/how-zs-built-a-clinical-knowledge-repository-for-semantic-search-using-amazon-opensearch-service-and-amazon-neptune/

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. This platform is an advanced information retrieval system engineered to assist healthcare professionals and researchers in navigating vast repositories of medical documents, medical literature, research articles, clinical guidelines, protocol documents, activity logs, and more. The goal of this search platform is to locate specific information efficiently and accurately to support clinical decision-making, research, and other healthcare-related activities by combining queries across all the different types of clinical documentation.

ZS is a management consulting and technology firm focused on transforming global healthcare. We use leading-edge analytics, data, and science to help clients make intelligent decisions. We serve clients in a wide range of industries, including pharmaceuticals, healthcare, technology, financial services, and consumer goods. We developed and host several applications for our customers on Amazon Web Services (AWS). ZS is also an AWS Advanced Consulting Partner as well as an Amazon Redshift Service Delivery Partner. As it relates to the use case in the post, ZS is a global leader in integrated evidence and strategy planning (IESP), a set of services that help pharmaceutical companies to deliver a complete and differentiated evidence package for new medicines.

ZS uses several AWS service offerings across the variety of their products, client solutions, and services. AWS services such as Amazon Neptune and Amazon OpenSearch Service form part of their data and analytics pipelines, and AWS Batch is used for long-running data and machine learning (ML) processing tasks.

Clinical data is highly connected in nature, so ZS used Neptune, a fully managed, high performance graph database service built for the cloud, as the database to capture the ontologies and taxonomies associated with the data that formed the supporting a knowledge graph. For our search requirements, We have used OpenSearch Service, an open source, distributed search and analytics suite.

About the clinical document search platform

Clinical documents comprise of a wide variety of digital records including:

  • Study protocols
  • Evidence gaps
  • Clinical activities
  • Publications

Within global biopharmaceutical companies, there are several key personas who are responsible to generate evidence for new medicines. This evidence supports decisions by payers, health technology assessments (HTAs), physicians, and patients when making treatment decisions. Evidence generation is rife with knowledge management challenges. Over the life of a pharmaceutical asset, hundreds of studies and analyses are completed, and it becomes challenging to maintain a good record of all the evidence to address incoming questions from external healthcare stakeholders such as payers, providers, physicians, and patients. Furthermore, almost none of the information associated with evidence generation activities (such as health economics and outcomes research (HEOR), real-world evidence (RWE), collaboration studies, and investigator sponsored research (ISR)) exists as structured data; instead, the richness of the evidence activities exists in protocol documents (study design) and study reports (outcomes). Therein lies the irony—teams who are in the business of knowledge generation struggle with knowledge management.

ZS unlocked new value from unstructured data for evidence generation leads by applying large language models (LLMs) and generative artificial intelligence (AI) to power advanced semantic search on evidence protocols. Now, evidence generation leads (medical affairs, HEOR, and RWE) can have a natural-language, conversational exchange and return a list of evidence activities with high relevance considering both structured data and the details of the studies from unstructured sources.

Overview of solution

The solution was designed in layers. The document processing layer supports document ingestion and orchestration. The semantic search platform (application) layer supports backend search and the user interface. Multiple different types of data sources, including media, documents, and external taxonomies, were identified as relevant for capture and processing within the semantic search platform.

Document processing solution framework layer

All components and sub-layers are orchestrated using Amazon Managed Workflows for Apache Airflow. The pipeline in Airflow is scaled automatically based on the workload using Batch. We can broadly divide layers here as shown in the following figure:

This diagram represents document processing solution framework layers. It provide details of Orchestration Pipeline which is hosted in Amazon MWAA and which contains components like Data Crawling, Data Ingestion, NLP layer and finally Database Ingestion.

Document Processing Solution Framework Layers

Data crawling:

In the data crawling layer, documents are retrieved from a specified source SharePoint location and deposited into a designated Amazon Simple Storage Service (Amazon S3) bucket. These documents could be in variety of formats, such as PDF, Microsoft Word, and Excel, and are processed using format-specific adapters.

Data ingestion:

  • The data ingestion layer is the first step of the proposed framework. At this later, data from a variety of sources smoothly enters the system’s advanced processing setup. In the pipeline, the data ingestion process takes shape through a thoughtfully structured sequence of steps.
  • These steps include creating a unique run ID each time a pipeline is run, managing natural language processing (NLP) model versions in the versioning table, identifying document formats, and ensuring the health of NLP model services with a service health check.
  • The process then proceeds with the transfer of data from the input layer to the landing layer, creation of dynamic batches, and continuous tracking of document processing status throughout the run. In case of any issues, a failsafe mechanism halts the process, enabling a smooth transition to the NLP phase of the framework.

Database ingestion:

The reporting layer processes the JSON data from the feature extraction layer and converts it into CSV files. Each CSV file contains specific information extracted from dedicated sections of documents. Subsequently, the pipeline generates a triple file using the data from these CSV files, where each set of entities signifies relationships in a subject-predicate-object format. This triple file is intended for ingestion into Neptune and OpenSearch Service. In the full document embedding module, the document content is segmented into chunks, which are then transformed into embeddings using LLMs such as llama-2 and BGE. These embeddings, along with metadata such as the document ID and page number, are stored in OpenSearch Service. We use various chunking strategies to enhance text comprehension. Semantic chunking divides text into sentences, grouping them into sets, and merges similar ones based on embeddings.

Agentic chunking uses LLMs to determine context-driven chunk sizes, focusing on proposition-based division and simplifying complex sentences. Additionally, context and document aware chunking adapts chunking logic to the nature of the content for more effective processing.

NLP:

The NLP layer serves as a crucial component in extracting specific sections or entities from documents. The feature extraction stage proceeds with localization, where sections are identified within the document to narrow down the search space for further tasks like entity extraction. LLMs are used to summarize the text extracted from document sections, enhancing the efficiency of this process. Following localization, the feature extraction step involves extracting features from the identified sections using various procedures. These procedures, prioritized based on their relevance, use models like Llama-2-7b, mistral-7b, Flan-t5-xl, and Flan-T5-xxl to extract important features and entities from the document text.

The auto-mapping phase ensures consistency by mapping extracted features to standard terms present in the ontology. This is achieved through matching the embeddings of extracted features with those stored in the OpenSearch Service index. Finally, in the Document Layout Cohesion step, the output from the auto-mapping phase is adjusted to aggregate entities at the document level, providing a cohesive representation of the document’s content.

Semantic search platform application layer

This layer, shown in the following figure, uses Neptune as the graph database and OpenSearch Service as the vector engine.

Semantic search platform application layer

Semantic search platform application layer

Amazon OpenSearch Service:

OpenSearch Service served the dual purpose of facilitating full-text search and embedding-based semantic search. The OpenSearch Service vector engine capability helped to drive Retrieval-Augmented Generation (RAG) workflows using LLMs. This helped to provide a summarized output for search after the retrieval of a relevant document for the input query. The method used for indexing embeddings was FAISS.

OpenSearch Service domain details:

  • Version of OpenSearch Service: 2.9
  • Number of nodes: 1
  • Instance type: r6g.2xlarge.search
  • Volume size: Gp3: 500gb
  • Number of Availability Zones: 1
  • Dedicated master node: Enabled
  • Number of Availability Zones: 3
  • No of master Nodes: 3
  • Instance type(Master Node) : r6g.large.search

To determine the nearest neighbor, we employ the Hierarchical Navigable Small World (HNSW) algorithm. We used the FAISS approximate k-NN library for indexing and searching and the Euclidean distance (L2 norm) for distance calculation between two vectors.

Amazon Neptune:

Neptune enables full-text search (FTS) through the integration with OpenSearch Service. A native streaming service for enabling FTS provided by AWS was established to replicate data from Neptune to OpenSearch Service. Based on the business use case for search, a graph model was defined. Considering the graph model, subject matter experts from the ZS domain team curated custom taxonomy capturing hierarchical flow of classes and sub-classes pertaining to clinical data. Open source taxonomies and ontologies were also identified, which would be part of the knowledge graph. Sections and entities were identified to be extracted from clinical documents. An unstructured document processing pipeline developed by ZS processed the documents in parallel and populated triples in RDF format from documents for Neptune ingestion.

The triples are created in such a way that semantically similar concepts are linked—hence creating a semantic layer for search. After the triples files are created, they’re stored in an S3 bucket. Using the Neptune Bulk Loader, we were able to load millions of triples to the graph.

Neptune ingests both structured and unstructured data, simplifying the process to retrieve content across different sources and formats. At this point, we were able to discover previously unknown relationships between the structured and unstructured data, which was then made available to the search platform. We used SPARQL query federation to return results from the enriched knowledge graph in the Neptune graph database and integrated with OpenSearch Service.

Neptune was able to automatically scale storage and compute resources to accommodate growing datasets and concurrent API calls. Presently, the application sustains approximately 3,000 daily active users. Concurrently, there is an observation of approximately 30–50 users initiating queries simultaneously within the application environment. The Neptune graph accommodates a substantial repository of approximately 4.87 million triples. The triples count is increasing because of our daily and weekly ingestion pipeline routines.

Neptune configuration:

  • Instance Class: db.r5d.4xlarge
  • Engine version: 1.2.0.1

LLMs:

Large language models (LLMs) like Llama-2, Mistral and Zephyr are used for extraction of sections and entities. Models like Flan-t5 were also used for extraction of other similar entities used in the procedures. These selected segments and entities are crucial for domain-specific searches and therefore receive higher priority in the learning-to-rank algorithm used for search.

Additionally, LLMs are used to generate a comprehensive summary of the top search results.

The LLMs are hosted on Amazon Elastic Kubernetes Service (Amazon EKS) with GPU-enabled node groups to ensure rapid inference processing. We’re using different models for different use cases. For example, to generate embeddings we deployed a BGE base model, while Mistral, Llama2, Zephyr, and others are used to extract specific medical entities, perform part extraction, and summarize search results. By using different LLMs for distinct tasks, we aim to enhance accuracy within narrow domains, thereby improving the overall relevance of the system.

Fine tuning :

Already fine-tuned models on pharma-specific documents were used. The models used were:

  • PharMolix/BioMedGPT-LM-7B (finetuned LLAMA-2 on medical)
  • emilyalsentzer/Bio_ClinicalBERT
  • stanford-crfm/BioMedLM
  • microsoft/biogpt

Re ranker, sorter, and filter stage:

Remove any stop words and special characters from the user input query to ensure a clean query. Upon pre-processing the query, create combinations of search terms by forming combinations of terms with varying n-grams. This step enriches the search scope and improves the chances of finding relevant results. For instance, if the input query is “machine learning algorithms,” generating n-grams could result in terms like “machine learning,” “learning algorithms,” and “machine learning algorithms”. Run the search terms simultaneously using the search API to access both Neptune graph and OpenSearch Service indexes. This hybrid approach broadens the search coverage, tapping into the strengths of both data sources. Specific weight is assigned to each result obtained from the data sources based on the domain’s specifications. This weight reflects the relevance and significance of the result within the context of the search query and the underlying domain. For example, a result from Neptune graph might be weighted higher if the query pertains to graph-related concepts, i.e. the search term is related directly to the subject or object of a triple, whereas a result from OpenSearch Service might be given more weightage if it aligns closely with text-based information. Documents that appear in both Neptune graph and OpenSearch Service receive the highest priority, because they likely offer comprehensive insights. Next in priority are documents exclusively sourced from the Neptune graph, followed by those solely from OpenSearch Service. This hierarchical arrangement ensures that the most relevant and comprehensive results are presented first. After factoring in these considerations, a final score is calculated for each result. Sorting the results based on their final scores ensures that the most relevant information is presented in the top n results.

Final UI

An evidence catalogue is aggregated from disparate systems. It provides a comprehensive repository of completed, ongoing and planned evidence generation activities. As evidence leads make forward-looking plans, the existing internal base of evidence is made readily available to inform decision-making.

The following video is a demonstration of an evidence catalog:

Customer impact

When completed, the solution provided the following customer benefits:

  • The search on multiple data source (structured and unstructured documents) enables visibility of complex hidden relationships and insights.
  • Clinical documents often contain a mix of structured and unstructured data. Neptune can store structured information in a graph format, while the vector database can handle unstructured data using embeddings. This integration provides a comprehensive approach to querying and analyzing diverse clinical information.
  • By building a knowledge graph using Neptune, you can enrich the clinical data with additional contextual information. This can include relationships between diseases, treatments, medications, and patient records, providing a more holistic view of healthcare data.
  • The search application helped in staying informed about the latest research, clinical developments, and competitive landscape.
  • This has enabled customers to make timely decisions, identify market trends, and help positioning of products based on a comprehensive understanding of the industry.
  • The application helped in monitoring adverse events, tracking safety signals, and ensuring that drug-related information is easily accessible and understandable, thereby supporting pharmacovigilance efforts.
  • The search application is currently running in production with 3000 active users.

Customer success criteria

The following success criteria were use to evaluate the solution:

  • Quick, high accuracy search results: The top three search results were 99% accurate with an overall latency of less than 3 seconds for users.
  • Identified, extracted portions of the protocol: The sections identified has a precision of 0.98 and recall of 0.87.
  • Accurate and relevant search results based on simple human language that answer the user’s question.
  • Clear UI and transparency on which portions of the aligned documents (protocol, clinical study reports, and publications) matched the text extraction.
  • Knowing what evidence is completed or in-process reduces redundancy in newly proposed evidence activities.

Challenges faced and learnings

We faced two main challenges in developing and deploying this solution.

Large data volume

The unstructured documents were required to be embedded completely and OpenSearch Service helped us achieve this with the right configuration. This involved deploying OpenSearch Service with master nodes and allocating sufficient storage capacity for embedding and storing unstructured document embeddings entirely. We stored up to 100 GB of embeddings in OpenSearch Service.

Inference time reduction

In the search application, it was vital that the search results were retrieved with lowest possible latency. With the hybrid graph and embedding search, this was challenging.

We addressed high latency issues by using an interconnected framework of graphs and embeddings. Each search method complemented the other, leading to optimal results. Our streamlined search approach ensures efficient queries of both the graph and the embeddings, eliminating any inefficiencies. The graph model was designed to minimize the number of hops required to navigate from one entity to another, and we improved its performance by avoiding the storage of bulky metadata. Any metadata too large for the graph was stored in OpenSearch, which served as our metadata store for graph and vector store for embeddings. Embeddings were generated using context-aware chunking of content to reduce the total embedding count and retrieval time, resulting in efficient querying with minimal inference time.

The Horizontal Pod Autoscaler (HPA) provided by Amazon EKS, intelligently adjusts pod resources based on user-demand or query loads, optimizing resource utilization and maintaining application performance during peak usage periods.

Conclusion

In this post, we described how to build an advanced information retrieval system designed to assist healthcare professionals and researchers in navigating through a diverse range of medical documents, including study protocols, evidence gaps, clinical activities, and publications. By using Amazon OpenSearch Service as a distributed search and vector database and Amazon Neptune as a knowledge graph, ZS was able to remove the undifferentiated heavy lifting associated with building and maintaining such a complex platform.

If you’re facing similar challenges in managing and searching through vast repositories of medical data, consider exploring the powerful capabilities of OpenSearch Service and Neptune. These services can help you unlock new insights and enhance your organization’s knowledge management capabilities.


About the authors

Abhishek Pan is a Sr. Specialist SA-Data working with AWS India Public sector customers. He engages with customers to define data-driven strategy, provide deep dive sessions on analytics use cases, and design scalable and performant analytical applications. He has 12 years of experience and is passionate about databases, analytics, and AI/ML. He is an avid traveler and tries to capture the world through his lens.

Gourang Harhare is a Senior Solutions Architect at AWS based in Pune, India. With a robust background in large-scale design and implementation of enterprise systems, application modernization, and cloud native architectures, he specializes in AI/ML, serverless, and container technologies. He enjoys solving complex problems and helping customer be successful on AWS. In his free time, he likes to play table tennis, enjoy trekking, or read books

Kevin Phillips is a Neptune Specialist Solutions Architect working in the UK. He has 20 years of development and solutions architectural experience, which he uses to help support and guide customers. He has been enthusiastic about evangelizing graph databases since joining the Amazon Neptune team, and is happy to talk graph with anyone who will listen.

Sandeep Varma is a principal in ZS’s Pune, India, office with over 25 years of technology consulting experience, which includes architecting and delivering innovative solutions for complex business problems leveraging AI and technology. Sandeep has been critical in driving various large-scale programs at ZS Associates. He was the founding member the Big Data Analytics Centre of Excellence in ZS and currently leads the Enterprise Service Center of Excellence. Sandeep is a thought leader and has served as chief architect of multiple large-scale enterprise big data platforms. He specializes in rapidly building high-performance teams focused on cutting-edge technologies and high-quality delivery.

Alex Turok has over 16 years of consulting experience focused on global and US biopharmaceutical companies. Alex’s expertise is in solving ambiguous, unstructured problems for commercial and medical leadership. For his clients, he seeks to drive lasting organizational change by defining the problem, identifying the strategic options, informing a decision, and outlining the transformation journey. He has worked extensively in portfolio and brand strategy, pipeline and launch strategy, integrated evidence strategy and planning, organizational design, and customer capabilities. Since joining ZS, Alex has worked across marketing, sales, medical, access, and patient services and has touched over twenty therapeutic categories, with depth in oncology, hematology, immunology and specialty therapeutics.

Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using Amazon OpenSearch Service

Post Syndicated from Yuanbo Li original https://aws.amazon.com/blogs/big-data/integrate-sparse-and-dense-vectors-to-enhance-knowledge-retrieval-in-rag-using-amazon-opensearch-service/

In the context of Retrieval-Augmented Generation (RAG), knowledge retrieval plays a crucial role, because the effectiveness of retrieval directly impacts the maximum potential of large language model (LLM) generation.

Currently, in RAG retrieval, the most common approach is to use semantic search based on dense vectors. However, dense embeddings do not perform well in understanding specialized terms or jargon in vertical domains. A more advanced method is to combine traditional inverted-index(BM25) based retrieval, but this approach requires spending a considerable amount of time customizing lexicons, synonym dictionaries, and stop-word dictionaries for optimization.

In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using Amazon OpenSearch Service and run some experiments on some public datasets to show its advantages. The full code is available in the github repo aws-samples/opensearch-dense-spase-retrieval.

What’s Sparse vector retrieval

Sparse vector retrieval is a recall method based on an inverted index, with an added step of term expansion. It comes in two modes: document-only and bi-encoder. For more details about these two terms, see Improving document retrieval with sparse semantic encoders.

Simply put, in document-only mode, term expansion is performed only during document ingestion. In bi-encoder mode, term expansion is conducted both during ingestion and at the time of query. Bi-encoder mode improves performance but may cause more latency. The following figure demonstrates its effectiveness.

Neural sparse search in OpenSearch achieves 12.7%(document-only) ~ 20%(bi-encoder) higher NDCG@10, comparable to the TAS-B dense vector model.

With neural sparse search, you don’t need to configure the dictionary yourself. It will automatically expand terms for the user. Additionally, in an OpenSearch index with a small and specialized dataset, while hit terms are generally few, the calculated term frequency may also lead to unreliable term weights. This may lead to significant bias or distortion in BM25 scoring. However, sparse vector retrieval first expands terms, greatly increasing the number of hit terms compared to before. This helps produce more reliable scores.

Although the absolute metrics of the sparse vector model can’t surpass those of the best dense vector models, it possesses unique and advantageous characteristics. For instance, in terms of the NDCG@10 metric, as mentioned in Improving document retrieval with sparse semantic encoders, evaluations on some datasets reveal that its performance could be better than state-of-the-art dense vector models, such as in the DBPedia dataset. This indicates a certain level of complementarity between them. Intuitively, for some extremely short user inputs, the vectors generated by dense vector models might have significant semantic uncertainty, where overlaying with a sparse vector model could be beneficial. Additionally, sparse vector retrieval still maintains interpretability, and you can still observe the scoring calculation through the explanation command. To take advantage of both methods, OpenSearch has already introduced a built-in feature called hybrid search.

How to combine dense and sparse?

1. Deploy a dense vector model

To get more valuable test results, we selected Cohere-embed-multilingual-v3.0, which is one of several popular models used in production for dense vectors. We can access it through Amazon Bedrock and use the following two functions to create a connector for bedrock-cohere and then register it as a model in OpenSearch. You can get its model ID from the response.

def create_bedrock_cohere_connector(account_id, aos_endpoint, input_type='search_document'):
    # input_type could be search_document | search_query
    service = 'es'
    session = boto3.Session()
    credentials = session.get_credentials()
    region = session.region_name
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

    path = '/_plugins/_ml/connectors/_create'
    url = 'https://' + aos_endpoint + path

    role_name = "OpenSearchAndBedrockRole"
    role_arn = "arn:aws:iam::{}:role/{}".format(account_id, role_name)
    model_name = "cohere.embed-multilingual-v3"

    bedrock_url = "https://bedrock-runtime.{}.amazonaws.com/model/{}/invoke".format(region, model_name)

    payload = {
      "name": "Amazon Bedrock Connector: Cohere doc embedding",
      "description": "The connector to the Bedrock Cohere multilingual doc embedding model",
      "version": 1,
      "protocol": "aws_sigv4",
      "parameters": {
        "region": region,
        "service_name": "bedrock"
      },
      "credential": {
        "roleArn": role_arn
      },
      "actions": [
        {
          "action_type": "predict",
          "method": "POST",
          "url": bedrock_url,
          "headers": {
            "content-type": "application/json",
            "x-amz-content-sha256": "required"
          },
          "request_body": "{ \"texts\": ${parameters.texts}, \"input_type\": \"search_document\" }",
          "pre_process_function": "connector.pre_process.cohere.embedding",
          "post_process_function": "connector.post_process.cohere.embedding"
        }
      ]
    }
    headers = {"Content-Type": "application/json"}

    r = requests.post(url, auth=awsauth, json=payload, headers=headers)
    return json.loads(r.text)["connector_id"]
    
def register_and_deploy_aos_model(aos_client, model_name, model_group_id, description, connecter_id):
    request_body = {
        "name": model_name,
        "function_name": "remote",
        "model_group_id": model_group_id,
        "description": description,
        "connector_id": connecter_id
    }

    response = aos_client.transport.perform_request(
        method="POST",
        url=f"/_plugins/_ml/models/_register?deploy=true",
        body=json.dumps(request_body)
    )

    returnresponse 

2. Deploy a sparse vector model

Currently, you can’t deploy the sparse vector model in an OpenSearch Service domain. You must deploy it in Amazon SageMaker first, then integrate it through an OpenSearch Service model connector. For more information, see Amazon OpenSearch Service ML connectors for AWS services.

Complete the following steps:

2.1 On the OpenSearch Service console, choose Integrations in the navigation pane.

2.2 Under Integration with Sparse Encoders through Amazon SageMaker, choose to configure a VPC domain or public domain.

Next, you configure the AWS CloudFormation template.

2.3 Enter the parameters as shown in the following screenshot.

2.4 Get the sparse model ID from the stack output.

3. Set up pipelines for ingestion and search

Use the following code to create pipelines for ingestion and search. With these two pipelines, there’s no need to perform model inference, just text field ingestion.

PUT /_ingest/pipeline/neural-sparse-pipeline
{
  "description": "neural sparse encoding pipeline",
  "processors" : [
    {
      "sparse_encoding": {
        "model_id": "<nerual_sparse_model_id>",
        "field_map": {
           "content": "sparse_embedding"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "<cohere_ingest_model_id>",
        "field_map": {
          "doc": "dense_embedding"
        }
      }
    }
  ]
}

PUT /_search/pipeline/hybird-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "l2"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.5,
              0.5
            ]
          }
        }
      }
    }
  ]
}

4. Create an OpenSearch index with dense and sparse vectors

Use the following code to create an OpenSearch index with dense and sparse vectors. You must specify the default_pipeline as the ingestion pipeline created in the previous step.

PUT {index-name}
{
    "settings" : {
        "index":{
            "number_of_shards" : 1,
            "number_of_replicas" : 0,
            "knn": "true",
            "knn.algo_param.ef_search": 32
        },
        "default_pipeline": "neural-sparse-pipeline"
    },
    "mappings": {
        "properties": {
            "content": {"type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart"},
            "dense_embedding": {
                "type": "knn_vector",
                "dimension": 1024,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 32
                    }
                }            
            },
            "sparse_embedding": {
                "type": "rank_features"
            }
        }
    }
}

Testing methodology

1. Experimental data selection

For retrieval evaluation, we used to use the datasets from BeIR. But not all datasets from BeIR are suitable for RAG. To mimic the knowledge retrieval scenario, we choose BeIR/fiqa and squad_v2 as our experimental datasets. The schema of its data is shown in the following figures.

The following is a data preview of squad_v2.

The following is a query preview of BeIR/fiqa.

The following is a corpus preview of BeIR/fiqa.

You can find question and context equivalent fields in the BeIR/fiqa datasets. This is almost the same as the knowledge recall in RAG. In subsequent experiments, we input the context field into the index of OpenSearch as text content, and use the question field as a query for the retrieval test.

2. Test data ingestion

The following script ingests data into the OpenSearch Service domain:

import json
from setup_model_and_pipeline import get_aos_client
from beir.datasets.data_loader import GenericDataLoader
from beir import LoggingHandler, util

aos_client = get_aos_client(aos_endpoint)

def ingest_dataset(corpus, aos_client, index_name, bulk_size=50):
    i=0
    bulk_body=[]
    for _id , body in tqdm(corpus.items()):
        text=body["title"]+" "+body["text"]
        bulk_body.append({ "index" : { "_index" : index_name, "_id" : _id } })
        bulk_body.append({ "content" : text })
        i+=1
        if i % bulk_size==0:
            response=aos_client.bulk(bulk_body,request_timeout=100)
            try:
                assert response["errors"]==False
            except:
                print("there is errors")
                print(response)
                time.sleep(1)
                response = aos_client.bulk(bulk_body,request_timeout=100)
            bulk_body=[]
        
    response=aos_client.bulk(bulk_body,request_timeout=100)
    assert response["errors"]==False
    aos_client.indices.refresh(index=index_name)

url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
ingest_dataset(corpus, aos_client=aos_client, index_name=index_name)

3. Performance evaluation of retrieval

In RAG knowledge retrieval, we usually focus on the relevance of top results, so our evaluation uses recall@4 as the metric indicator. The whole test will include various retrieval methods to compare, such as bm25_only, sparse_only, dense_only, hybrid_sparse_dense, and hybrid_dense_bm25.

The following script uses hybrid_sparse_dense to demonstrate the evaluation logic:

def search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk=4):
    request_body = {
      "size": topk,
      "query": {
        "hybrid": {
          "queries": [
            {
              "neural_sparse": {
                  "sparse_embedding": {
                    "query_text": query,
                    "model_id": sparse_model_id,
                    "max_token_score": 3.5
                  }
              }
            },
            {
              "neural": {
                  "dense_embedding": {
                      "query_text": query,
                      "model_id": dense_model_id,
                      "k": 10
                    }
                }
            }
          ]
        }
      }
    }

    response = aos_client.transport.perform_request(
        method="GET",
        url=f"/{index_name}/_search?search_pipeline=hybird-search-pipeline",
        body=json.dumps(request_body)
    )

    return response["hits"]["hits"]
    
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
run_res={}
for _id, query in tqdm(queries.items()):
    hits = search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk)
    run_res[_id]={item["_id"]:item["_score"] for item in hits}
    
for query_id, doc_dict in tqdm(run_res.items()):
    if query_id in doc_dict:
        doc_dict.pop(query_id)
res = EvaluateRetrieval.evaluate(qrels, run_res, [1, 4, 10])
print("search_by_dense_sparse:")
print(res)

Results

In the context of RAG, usually the developer doesn’t pay attention to the metric NDCG@10; the LLM will pick up the relevant context automatically. We care more about the recall metric. Based on our experience of RAG, we measured recall@1, recall@4, and recall@10 for your reference.

The dataset BeIR/fiqa is mainly used for evaluation of retrieval, whereas squad_v2 is mainly used for evaluation of reading comprehension. In terms of retrieval, squad_v2 is much less complicated than BeIR/fiqa. In the real RAG context, the difficulty of retrieval may not be as high as with BeIR/fiqa, so we evaluate both datasets.

The hybird_dense_sparse metric is always beneficial. The following table shows our results.

Dataset BeIR/fiqa squad_v2
Method\Metric Recall@1 Recall@4 Recall@10 Recall@1 Recall@4 Recall@10
bm25 0.112 0.215 0.297 0.59 0.771 0.851
dense 0.156 0.316 0.398 0.671 0.872 0.925
sparse 0.196 0.334 0.438 0.684 0.865 0.926
hybird_dense_sparse 0.203 0.362 0.456 0.704 0.885 0.942
hybird_dense_bm25 0.156 0.316 0.394 0.671 0.871 0.925

Conclusion

The new neural sparse search feature in OpenSearch Service version 2.11, when combined with dense vector retrieval, can significantly improve the effectiveness of knowledge retrieval in RAG scenarios. Compared to the combination of bm25 and dense vector retrieval, it’s more straightforward to use and more likely to achieve better results.

OpenSearch Service version 2.12 has recently upgraded its Lucene engine, significantly enhancing the throughput and latency performance of neural sparse search. But the current neural sparse search only supports English. In the future, other languages might be supported. As the technology continues to evolve, it stands to become a popular and widely applicable way to enhance retrieval performance.


About the Author

YuanBo Li is a Specialist Solution Architect in GenAI/AIML at Amazon Web Services. His interests include RAG (Retrieval-Augmented Generation) and Agent technologies within the field of GenAI, and he dedicated to proposing innovative GenAI technical solutions tailored to meet diverse business needs.

Charlie Yang is an AWS engineering manager with the OpenSearch Project. He focuses on machine learning, search relevance, and performance optimization.

River Xie is a Gen AI specialist solution architecture at Amazon Web Services. River is interested in Agent/Mutli Agent workflow, Large Language Model inference optimization, and passionate about leveraging cutting-edge Generative AI technologies to develop modern applications that solve complex business challenges.

Ren Guo is a manager of Generative AI Specialist Solution Architect Team for the domains of AIML and Data at AWS, Greater China Region.

Reducing long-term logging expenses by 4,800% with Amazon OpenSearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/reducing-long-term-logging-expenses-by-4800-with-amazon-opensearch-service/

When you use Amazon OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data in various tiers, enabling you to trade off data latency, durability, and availability. In October 2023, OpenSearch Service announced support for im4gn data nodes, with NVMe SSD storage of up to 30 TB. In November 2023, OpenSearch Service introduced or1, the OpenSearch-optimized instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon Simple Storage Service (Amazon S3) to provide 11 nines of durability. Finally, in May 2024, OpenSearch Service announced general availability for Amazon OpenSearch Service zero-ETL integration with Amazon S3. These new features join OpenSearch’s existing UltraWarm instances, which provide an up to 90% reduction in storage cost per GB, and UltraWarm’s cold storage option, which lets you detach UltraWarm indexes and durably store rarely accessed data in Amazon S3.

This post works through an example to help you understand the trade-offs available in cost, latency, throughput, data durability and availability, retention, and data access, so that you can choose the right deployment to maximize the value of your data and minimize the cost.

Examine your requirements

When designing your logging solution, you need a clear definition of your requirements as a prerequisite to making smart trade-offs. Carefully examine your requirements for latency, durability, availability, and cost. Additionally, consider which data you choose to send to OpenSearch Service, how long you retain data, and how you plan to access that data.

For the purposes of this discussion, we divide OpenSearch instance storage into two classes: ephemeral backed storage and Amazon S3 backed storage. The ephemeral backed storage class includes OpenSearch nodes that use Nonvolatile Memory Express SSDs (NVMe SSDs) and Amazon Elastic Block Store (Amazon EBS) volumes. The Amazon S3 backed storage class includes UltraWarm nodes, UltraWarm cold storage, or1 instances, and Amazon S3 storage you access with the service’s zero-ETL with Amazon S3. When designing your logging solution, consider the following:

  • Latency – if you need results in milliseconds, then you must use ephemeral backed storage. If seconds or minutes are acceptable, you can lower your cost by using Amazon S3 backed storage.
  • Throughput – As a general rule, ephemeral backed storage instances will provide higher throughput. Instances that have NVMe SSDs, like the im4gn, generally provide the best throughput, with EBS volumes providing good throughput. or1 instances take advantage of Amazon EBS storage for primary shards while using Amazon S3 with segment replication to reduce the compute cost of replication, thereby offering indexing throughput that can match or even exceed NVMe-based instances.
  • Data durability – Data stored in the hot tier (you deploy these as data nodes) has the lowest latency, and also the lowest durability. OpenSearch Service provides automated recovery of data in the hot tier through replicas, which provide durability with added cost. Data that OpenSearch stores in Amazon S3 (UltraWarm, UltraWarm cold storage, zero-ETL with Amazon S3, and or1 instances) gets the benefit of 11 nines of durability from Amazon S3.
  • Data availabilityBest practices dictate that you use replicas for data in ephemeral backed storage. When you have at least one replica, you can continue to access all of your data, even during a node failure. However, each replica adds a multiple of cost. If you can tolerate temporary unavailability, you can reduce replicas through or1 instances, with Amazon S3 backed storage.
  • Retention – Data in all storage tiers incurs cost. The longer you retain data for analysis, the more cumulative cost you incur for each GB of that data. Identify the maximum amount of time you must retain data before it loses all value. In some cases, compliance requirements may restrict your retention window.
  • Data access – Amazon S3 backed storage instances generally have a much higher storage to compute ratio, providing cost savings but with insufficient compute for high-volume workloads. If you have high query volume or your queries span a large volume of data, ephemeral backed storage is the right choice. Direct query (Amazon S3 backed storage) is perfect for large volume queries for infrequently queried data.

As you consider your requirements along these dimensions, your answers will guide your choices for implementation. To help you make trade-offs, we work through an extended example in the following sections.

OpenSearch Service cost model

To understand how to cost an OpenSearch Service deployment, you need to understand the cost dimensions. OpenSearch Service has two different deployment options: managed clusters and serverless. This post considers managed clusters only, because Amazon OpenSearch Serverless already tiers data and manages storage for you. When you use managed clusters, you configure data nodes, UltraWarm nodes, and cluster manager nodes, selecting Amazon Elastic Compute Cloud (Amazon EC2) instance types for each of these functions. OpenSearch Service deploys and manages these nodes for you, providing OpenSearch and OpenSearch Dashboards through a REST endpoint. You can choose Amazon EBS backed instances or instances with NVMe SSD drives. OpenSearch Service charges an hourly cost for the instances in your managed cluster. If you choose Amazon EBS backed instances, the service will charge you for the storage provisioned, and any provisioned IOPs you configure. If you choose or1 nodes, UltraWarm nodes, or UltraWarm cold storage, OpenSearch Service charges for the Amazon S3 storage consumed. Finally, the service charges for data transferred out.

Example use case

We use an example use case to examine the trade-offs in cost and performance. The cost and sizing of this example are based on best practices, and are directional in nature. Although you can expect to see similar savings, all workloads are unique and your actual costs may vary substantially from what we present in this post.

For our use case, Fizzywig, a fictitious company, is a large soft drink manufacturer. They have many plants for producing their beverages, with copious logging from their manufacturing line. They started out small, with an all-hot deployment and generating 10 GB of logs daily. Today, that has grown to 3 TB of log data daily, and management is mandating a reduction in cost. Fizzywig uses their log data for event debugging and analysis, as well as historical analysis over one year of log data. Let’s compute the cost of storing and using that data in OpenSearch Service.

Ephemeral backed storage deployments

Fizzywig’s current deployment is 189 r6g.12xlarge.search data nodes (no UltraWarm tier), with ephemeral backed storage. When you index data in OpenSearch Service, OpenSearch builds and stores index data structures that are usually about 10% larger than the source data, and you need to leave 25% free storage space for operating overhead. Three TB of daily source data will use 4.125 TB of storage for the first (primary) copy, including overhead. Fizzywig follows best practices, using two replica copies for maximum data durability and availability, with the OpenSearch Service Multi-AZ with Standby option, increasing the storage need to 12.375 TB per day. To store 1 year of data, multiply by 365 days to get 4.5 PB of storage needed.

To provision this much storage, they could also choose im4gn.16xlarge.search instances, or or1.16.xlarge.search instances. The following table gives the instance counts for each of these instance types, and with one, two, or three copies of the data.

. Max Storage (GB)
per Node

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search 30,000 52 104 156
or1.16xlarge.search 36,000 42 84 126
r6g.12xlarge.search 24,000 63 126 189

The preceding table and the following discussion are strictly based on storage needs. or1 instances and im4gn instances both provide higher throughput than r6g instances, which will reduce cost further. The amount of compute saved varies between 10–40% depending on the workload and the instance type. These savings do not pass straight through to the bottom line; they require scaling and modification of the index and shard strategy to fully realize them. The preceding table and subsequent calculations take the general assumption that these deployments are over-provisioned on compute, and are storage-bound. You would see more savings for or1 and im4gn, compared with r6g, if you had to scale higher for compute.

The following table represents the total cluster costs for the three different instance types across the three different data storage sizes specified. These are based on on-demand US East (N. Virginia) AWS Region costs and include instance hours, Amazon S3 cost for the or1 instances, and Amazon EBS storage costs for the or1 and r6g instances.

.

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search $3,977,145 $7,954,290 $11,931,435
or1.16xlarge.search $4,691,952 $9,354,996 $14,018,041
r6g.12xlarge.search $4,420,585 $8,841,170 $13,261,755

This table gives you the one-copy, two-copy, and three-copy costs (including Amazon S3 and Amazon EBS costs, where applicable) for this 4.5 PB workload. For this post, “one copy” refers to the first copy of your data, with the replication factor set to zero. “Two copies” includes a replica copy of all of the data, and “three copies” includes a primary and two replicas. As you can see, each replica adds a multiple of cost to the solution. Of course, each replica adds availability and durability to the data. With one copy (primary only), you would lose data in the case of a single node outage (with an exception for or1 instances). With one replica, you might lose some or all data in a two-node outage. With two replicas, you could lose data only in a three-node outage.

The or1 instances are an exception to this rule. or1 instances can support a one-copy deployment. These instances use Amazon S3 as a backing store, writing all index data to Amazon S3, as a means of replication, and for durability. Because all acknowledged writes are persisted in Amazon S3, you can run with a single copy, but with the risk of losing availability of your data in case of a node outage. If a data node becomes unavailable, any impacted indexes will be unavailable (red) during the recovery window (usually 10–20 minutes). Carefully evaluate whether you can tolerate this unavailability with your customers as well as your system (for example, your ingestion pipeline buffer). If so, you can drop your cost from $14 million to $4.7 million based on the one-copy (primary) column illustrated in the preceding table.

Reserved Instances

OpenSearch Service supports Reserved Instances (RIs), with 1-year and 3-year terms, with no up-front cost (NURI), partial up-front cost (PURI), or all up-front cost (AURI). All reserved instance commitments lower cost, with 3-year, all up-front RIs providing the deepest discount. Applying a 3-year AURI discount, annual costs for Fizzywig’s workload gives costs as shown in the following table.

. Primary Primary + Replica Primary + 2 Replicas
im4gn.16xlarge.search $1,909,076 $3,818,152 $5,727,228
or1.16xlarge.search $3,413,371 $6,826,742 $10,240,113
r6g.12xlarge.search $3,268,074 $6,536,148 $9,804,222

RIs provide a straightforward way to save cost, with no code or architecture changes. Adopting RIs for this workload brings the im4gn cost for three copies down to $5.7 million, and the one-copy cost for or1 instances down to $3.2 million.

Amazon S3 backed storage deployments

The preceding deployments are useful as a baseline and for comparison. In actuality, you would choose one of the Amazon S3 backed storage options to keep costs manageable.

OpenSearch Service UltraWarm instances store all data in Amazon S3, using UltraWarm nodes as a hot cache on top of this full dataset. UltraWarm works best for interactive querying of data in small time-bound slices, such as running multiple queries against 1 day of data from 6 months ago. Evaluate your access patterns carefully and consider whether UltraWarm’s cache-like behavior will serve you well. UltraWarm first-query latency scales with the amount of data you need to query.

When designing an OpenSearch Service domain for UltraWarm, you need to decide on your hot retention window and your warm retention window. Most OpenSearch Service customers use a hot retention window that varies between 7–14 days, with warm retention making up the rest of the full retention period. For our Fizzywig scenario, we use 14 days hot retention and 351 days of UltraWarm retention. We also use a two-copy (primary and one replica) deployment in the hot tier.

The 14-day, hot storage need (based on a daily ingestion rate of 4.125 TB) is 115.5 TB. You can deploy six instances of any of the three instance types to support this indexing and storage. UltraWarm stores a single replica in Amazon S3, and doesn’t need additional storage overhead, making your 351-day storage need 1.158 PiB. You can support this with 58 UltraWarm1.large.search instances. The following table gives the total cost for this deployment, with 3-year AURIs for the hot tier. The or1 instances’ Amazon S3 cost is rolled into the S3 column.

. Hot UltraWarm S3 Total
im4gn.16xlarge.search $220,278 $1,361,654 $333,590 $1,915,523
or1.16xlarge.search $337,696 $1,361,654 $418,136 $2,117,487
r6g.12xlarge.search $270,410 $1,361,654 $333,590 $1,965,655

You can further reduce the cost by moving data to UltraWarm cold storage. Cold storage reduces cost by reducing availability of the data—to query the data, you must issue an API call to reattach the target indexes to the UltraWarm tier. A typical pattern for 1 year of data keeps 14 days hot, 76 days in UltraWarm, and 275 days in cold storage. Following this pattern, you use 6 hot nodes and 13 UltraWarm1.large.search nodes. The following table illustrates the cost to run Fizzywig’s 3 TB daily workload. The or1 cost for Amazon S3 usage is rolled into the UltraWarm nodes + S3 column.

. Hot UltraWarm nodes + S3 Cold Total
im4gn.16xlarge.search $220,278 $377,429 $261,360 $859,067
or1.16xlarge.search $337,696 $461,975 $261,360 $1,061,031
r6g.12xlarge.search $270,410 $377,429 $261,360 $909,199

By employing Amazon S3 backed storage options, you’re able to reduce cost even further, with a single-copy or1 deployment at $337,000, and a maximum of $1 million annually with or1 instances.

OpenSearch Service zero-ETL for Amazon S3

When you use OpenSearch Service zero-ETL for Amazon S3, you keep all your secondary and older data in Amazon S3. Secondary data is the higher-volume data that has lower value for direct inspection, such as VPC Flow Logs and WAF logs. For these deployments, you keep the majority of infrequently queried data in Amazon S3, and only the most recent data in your hot tier. In some cases, you sample your secondary data, keeping a percentage in the hot tier as well. Fizzywig decides that they want to have 7 days of all of their data in the hot tier. They will access the rest with direct query (DQ).

When you use direct query, you can store your data in JSON, Parquet, and CSV formats. Parquet format is optimal for direct query and provides about 75% compression on the data. Fizzywig is using Amazon OpenSearch Ingestion, which can write Parquet format data directly to Amazon S3. Their 3 TB of daily source data compresses to 750 GB of daily Parquet data. OpenSearch Service maintains a pool of compute units for direct query. You are billed hourly for these OpenSearch Compute Units (OCUs), scaling based on the amount of data you access. For this conversation, we assume that Fizzywig will have some debugging sessions and run 50 queries daily over one day worth of data (750 GB). The following table summarizes the annual cost to run Fizzywig’s 3 TB daily workload, 7 days hot, 358 days in Amazon S3.

. Hot DQ Cost OR1 S3 Raw Data S3 Total
im4gn.16xlarge.search $220,278 $2,195 $0 $65,772 $288,245
or1.16xlarge.search $337,696 $2,195 $84,546 $65,772 $490,209
r6g.12xlarge.search $270,410 $2,195 $0 $65,772 $338,377

That’s quite a journey! Fizzywig’s cost for logging has come down from as high as $14 million annually to as low as $288,000 annually using direct query with zero-ETL from Amazon S3. That’s a savings of 4,800%!

Sampling and compression

In this post, we have looked at one data footprint to let you focus on data size, and the trade-offs you can make depending on how you want to access that data. OpenSearch has additional features that can further change the economics by reducing the amount of data you store.

For logs workloads, you can employ OpenSearch Ingestion sampling to reduce the size of data you send to OpenSearch Service. Sampling is appropriate when your data as a whole has statistical characteristics where a part can be representative of the whole. For example, if you’re running an observability workload, you can often send as little as 10% of your data to get a representative sampling of the traces of request handling in your system.

You can further employ a compression algorithm for your workloads. OpenSearch Service recently released support for Zstandard (zstd) compression that can bring higher compression rates and lower decompression latencies as compared to the default, best compression.

Conclusion

With OpenSearch Service, Fizzywig was able to balance cost, latency, throughput, durability and availability, data retention, and preferred access patterns. They were able to save 4,800% for their logging solution, and management was thrilled.

Across the board, im4gn comes out with the lowest absolute dollar amounts. However, there are a couple of caveats. First, or1 instances can provide higher throughput, especially for write-intensive workloads. This may mean additional savings through reduced need for compute. Additionally, with or1’s added durability, you can maintain availability and durability with lower replication, and therefore lower cost. Another factor to consider is RAM; the r6g instances provide additional RAM, which speeds up queries for lower latency. When coupled with UltraWarm, and with different hot/warm/cold ratios, r6g instances can also be an excellent choice.

Do you have a high-volume, logging workload? Have you benefitted from some or all of these methods? Let us know!


About the Author

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have vector, search, and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor’s of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Embed Amazon OpenSearch Service dashboards in your application

Post Syndicated from Vibhu Pareek original https://aws.amazon.com/blogs/big-data/embed-amazon-opensearch-service-dashboards-in-your-application/

Customers across diverse industries rely on Amazon OpenSearch Service for interactive log analytics, real-time application monitoring, website search, vector database, deriving meaningful insights from data, and visualizing these insights using OpenSearch Dashboards. Additionally, customers often seek out capabilities that enable effortless sharing of visual dashboards and seamless embedding of these dashboards within their applications, further enhancing user experience and streamlining workflows.

In this post, we show how to embed a live Amazon Opensearch dashboard in your application, allowing your end customers to access a consolidated, real-time view without ever leaving your website.

Solution overview

We demonstrate how to deploy a sample flight data dashboard using OpenSearch Dashboards and embed it into your application through an iFrame. The following diagram provides a high-level overview of the end-to-end solution.

BDB3004-ArchitectureImage1

The workflow includes the following steps:

  1. The user requests for the embedded dashboard by opening the static web server’s endpoint in a browser.
  2. The request reaches the NGINX endpoint. The NGINX endpoint routes the traffic to the self-managed OpenSearch Dashboards server. The OpenSearch Dashboards server acts as the UI layer that connects to the OpenSearch Service domain as the server.
  3. The self-managed OpenSearch Dashboards server interacts with the Amazon managed OpenSearch Service domain to fetch the required data.
  4. The requested data is sent to the OpenSearch Dashboards server.
  5. The requested data is sent from the self-managed OpenSearch Dashboards server to the web server using the NGINX proxy.
  6. The dashboard renders the visualization with the data and displays it on the website.

Prerequisites

You will launch a self-managed OpenSearch Dashboards server on an Amazon Elastic Compute Cloud (Amazon EC2) instance and link it to the managed OpenSearch Service domain to create your visualization. The self-managed OpenSearch Dashboards server acts as the UI layer that connects to the OpenSearch Service domain as the server. The post assumes the presence of a VPC with public as well as private subnets.

Create an OpenSearch Service domain

If you already have an OpenSearch Service domain set up, you can skip this step.

For instructions to create an OpenSearch Service domain, refer to Getting started with Amazon OpenSearch Service. The domain creation takes around 15–20 minutes. When the domain is in Active status, note the domain endpoint, which you will need to set up a proxy in subsequent steps.

Deploy an EC2 instance to act as the NGINX proxy to the OpenSearch Service domain and OpenSearch Dashboards

In this step, you launch an AWS CloudFormation stack that deploys the following resources:

  • A security group for the EC2 instance
  • An ingress rule for the security group attached to the OpenSearch Service domain that allows the traffic on port 443 from the proxy instance
  • An EC2 instance with the NGINX proxy and self-managed OpenSearch Dashboards set up

Complete the following steps to create the stack:

  1. Choose Launch Stack to launch the CloudFormation stack with some preconfigured values in us-east-1. You can change the AWS Region as required.
    BDB3004-CFNStack
  2. Provide the parameters for your OpenSearch Service domain.
  3. Choose Create stack.
    The process may take 3–4 minutes to complete as it sets up an EC2 instance and the required stack. Wait until the status of the stack changes to CREATE_COMPLETE.
  4. On the Outputs tab of the stack, note the value for DashboardURL.

Access OpenSearch Dashboards using the NGINX proxy and set it up for embedding

In this step, you create a new dashboard in OpenSearch Dashboards, which will be used for embedding. Because you launched the OpenSearch Service domain within the VPC, you don’t have direct access to it. To establish a connection with the domain, you use the NGINX proxy setup that you configured in the previous steps.

  • Navigate to the link for DashboardURL (as demonstrated in the previous step) in your web browser.
  • Enter the user name and password you configured while creating the OpenSearch Service domain.

You will use a sample dataset for ease of demonstration, which has some preconfigured visualizations and dashboards.

  • Import the sample dataset by choosing Add data.

  • Choose the Sample flight data dataset and choose Add data.

  • To open the newly imported dashboard and get the iFrame code, choose Embed Code on the Share menu.
  • Under Generate the link as, select Snapshot and choose Copy iFrame code.

The iFrame code will look similar to the following code:

<iframe src="https://<ec2_instance_elastic_ip>/_dashboards/app/dashboards?security_tenant=global#/view/7adfa750-4c81-11e8-b3d7-01146121b73d?embed=true&_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!f%2Cvalue%3A900000)%2Ctime%3A(from%3Anow-24h%2Cto%3Anow)) height="600" width="800"></iframe>

  1. Copy the code to your preferred text editor, remove the /_dashboards part, and change the frame height and width from height="600" width="800" to height="800" width="100%".
  2. Wrap the iFrame code with HTML code as shown in the following example and save it as an index.html file on your local system:
    <!DOCTYPE html>
    <html lang="en">
       <head>
          <title>Flight Dashboard</title>
          <style>
             body {
             font-family: Arial;
             margin: 0;
             }
             .header {
             padding: 1px;
             text-align: center;
             font-family: Arial;
             background: black;
             color: white;
             }
             .content {padding:20px;}
          </style>
       </head>
       <body>
          <div class="header">
             <h1>
             Flight Dashboard
             <h1>
          </div>
          <iframe src="https://<ec2_instance_elastic_ip>/app/dashboards#/view/7adfa750-4c81-11e8-b3d7-01146121b73d?embed=true&_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!f%2Cvalue%3A900000)%2Ctime%3A(from%3Anow-24h%2Cto%3Anow))" height="800" width="100%"></iframe>
       </body>
    </html>

Host the HTML code

The next step is to host the index.html file. The index.html file can be served from any local laptop or desktop with Firefox or Chrome browser for a quick test.

There are different options available to host the web server, such as Amazon EC2 or Amazon S3. For instructions to host the web server on Amazon S3, refer to Tutorial: Configuring a static website on Amazon S3.

The following screenshot shows our embedded dashboard.

Clean up

If you no longer need the resources you created, delete the CloudFormation stack and the OpenSearch Service domain (if you created a new one) to prevent incurring additional charges.

Summary

In this post, we showed how you can embed your dashboard created with OpenSearch Dashboards into your application to provide insights to users. If you found this post useful, check out Using OpenSearch Dashboards with Amazon OpenSearch Service and OpenSearch Dashboards quickstart guide.


About the Authors

Vibhu Pareek is a Sr. Solutions Architect at AWS. Since 2016, he has guided customers in cloud adoption using well-architected, repeatable patterns. With his specialization in databases, data analytics, and AI, he thrives on transforming complex challenges into innovative solutions. Outside work, he enjoys short treks and sports like badminton, football, and swimming.

Kamal Manchanda is a Senior Solutions Architect at AWS, specializing in building and designing data solutions with focus on lake house architectures, data governance, search platforms, log analytics solutions as well as generative AI solutions. In his spare time, Kamal loves to travel and spend time with family.

Adesh Jaiswal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course, football.

OpenSearch optimized instance (OR1) is game changing for indexing performance and cost

Post Syndicated from Cedric Pelvet original https://aws.amazon.com/blogs/big-data/opensearch-optimized-instance-or1-is-game-changing-for-indexing-performance-and-cost/

Amazon OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.

In this post, we examine the OR1 instance type, an OpenSearch optimized instance introduced on November 29, 2023.

OR1 is an instance type for Amazon OpenSearch Service that provides a cost-effective way to store large amounts of data. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon Simple Storage Service (Amazon S3) as it arrives. OR1 instances provide increased indexing throughput with high durability.

To learn more about OR1, see the introductory blog post.

While actively writing to an index, we recommend that you keep one replica. However, you can switch to zero replicas after a rollover and the index is no longer being actively written.

This can be done safely because the data is persisted in Amazon S3 for durability.

Note that in case of a node failure and replacement, your data will be automatically restored from Amazon S3, but would be partially unavailable during the repair operation, so you should not consider it for cases where searches on non-actively written indices require high availability.

Goal

In this blog post, we’ll explore how OR1 impacts the performance of OpenSearch workloads.

By providing segment replication, OR1 instances save CPU cycles by indexing only on the primary shards. By doing that, the nodes are able to index more data with the same amount of compute, or to use fewer resources for indexing and thus have more available for search and other operations.

For this post, we’re going to consider an indexing-heavy workload and do some performance testing.

Traditionally, Amazon Elastic Compute Cloud (Amazon EC2) R6g instances are a high performant choice for indexing-heavy workloads, relying on Amazon EBS storage. Im4gn instances provide local NVMe SSD for high throughput and low latency disk writes.

We will compare OR1 indexing performance relative to these two instance types, focusing on indexing performance only for scope of this blog.

Setup

For our performance testing, we set up multiple components, as shown in the following figure:

Architecture diagram

For the testing process:

The index mapping, which is part of our initialization step, is as follows:

{
  "index_patterns": [
    "logs-*"
  ],
  "data_stream": {
    "timestamp_field": {
      "name": "time"
    }
  },
  "template": {
    "settings": {
      "number_of_shards": <VARYING>,
      "number_of_replicas": 1,
      "refresh_interval": "20s"
    },
    "mappings": {
      "dynamic": false,
      "properties": {
        "traceId": {
          "type": "keyword"
        },
        "spanId": {
          "type": "keyword"
        },
        "severityText": {
          "type": "keyword"
        },
        "flags": {
          "type": "long"
        },
        "time": {
          "type": "date",
          "format": "date_time"
        },
        "severityNumber": {
          "type": "long"
        },
        "droppedAttributesCount": {
          "type": "long"
        },
        "serviceName": {
          "type": "keyword"
        },
        "body": {
          "type": "text"
        },
        "observedTime": {
          "type": "date",
          "format": "date_time"
        },
        "schemaUrl": {
          "type": "keyword"
        },
        "resource": {
          "type": "flat_object"
        },
        "instrumentationScope": {
          "type": "flat_object"
        }
      }
    }
  }
}

As you can see, we’re using a data stream to simplify the rollover configuration and keep the maximum primary shard size under 50 GiB, as per best practices.

We optimized the mapping to avoid any unnecessary indexing activity and use the flat_object field type to avoid field mapping explosion.

For reference, the Index State Management (ISM) policy we used is as follows:

{
  "policy": {
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_primary_shard_size": "50gb"
            }
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": [
      {
        "index_patterns": [
          "logs-*"
        ]
      }
    ]
  }
}

Our average document size is 1.6 KiB and the bulk size is 4,000 documents per bulk, which makes approximately 6.26 MiB per bulk (uncompressed).

Testing protocol

The protocol parameters are as follows:

  • Number of data nodes: 6 or 12
  • Jobs parallelism: 75, 40
  • Primary shard count: 12, 48, 96 (for 12 nodes)
  • Number of replicas: 1 (total of 2 copies)
  • Instance types (each with 16 vCPUs):
    • or1.4xlarge.search
    • r6g.4xlarge.search
    • im4gn.4xlarge.search
Cluster Instance type vCPU RAM JVM size
or1-target or1.4xlarge.search 16 128 32
im4gn-target im4gn.4xlarge.search 16 64 32
r6g-target r6g.4xlarge.search 16 128 32

Note that the im4gn cluster has half the memory of the other two, but still each environment has the same JVM heap size of approximately 32 GiB.

Performance testing results

For the performance testing, we started with 75 parallel jobs and 750 batches of 4,000 documents per client (a total 225 million documents). We then adjusted the number of shards, data nodes, replicas, and jobs.

Configuration 1: 6 data nodes, 12 primary shards, 1 replica

For this configuration, we used 6 data nodes, 12 primary shards, and 1 replica, we observed the following performance:

Cluster CPU usage Time taken Indexing speed
or1-target 65-80% 24 min 156 kdoc/s 243 MiB/s
im4gn-target 89-97% 34 min 110 kdoc/s 172 MiB/s
r6g-target 88-95% 34 min 110 kdoc/s 172 MiB/s

Highlighted in this table, im4gn and r6g clusters have very high CPU usage, triggering admission control, which rejects document.

The OR1 shows a CPU below 80 percent sustained, which is a very good target.

Things to keep in mind:

  • In production, don’t forget to retry indexing with exponential backoff to avoid dropping unindexed documents because of intermittent rejections.
  • The bulk indexing operation returns 200 OK but can have partial failures. The body of the response must be checked to validate that all the documents were indexed successfully.

By reducing the number of parallel jobs from 75 to 40, while maintaining 750 batches of 4,000 documents per client (total 120M documents), we get the following:

Cluster CPU usage Time taken Indexing speed
or1-target 25-60% 20 min 100 kdoc/s 156 MiB/s
im4gn-target 75-93% 19 min 105 kdoc/s 164 MiB/s
r6g-target 77-90% 20 min 100 kdoc/s 156 MiB/s

The throughput and CPU usage decreased, but the CPU remains high on Im4gn and R6g, while the OR1 is showing more CPU capacity to spare.

Configuration 2: 6 data nodes, 48 primary shards, 1 replica

For this configuration, we increased the number of primary shards from 12 to 48, which provides more parallelism for indexing:

Cluster CPU usage Time taken Indexing speed
or1-target 60-80% 21 min 178 kdoc/s 278 MiB/s
im4gn-target 67-95% 34 min 110 kdoc/s 172 MiB/s
r6g-target 70-88% 37 min 101 kdoc/s 158 MiB/s

The indexing throughput increased for the OR1, but the Im4gn and R6g didn’t see an improvement because their CPU utilization is still very high.

Reducing the parallel jobs to 40 and keeping 48 primary shards, we can see that the OR1 gets a little more pressure as the minimum CPU increases from 12 primary shards, and the CPU for R6g looks much better. For the Im4gn however, the CPU is still high.

Cluster CPU usage Time taken Indexing speed
or1-target 40-60% 16 min 125 kdoc/s 195 MiB/s
im4gn-target 80-94% 18 min 111 kdoc/s 173 MiB/s
r6g-target 70-80% 21 min 95 kdoc/s 148 MiB/s

Configuration 3: 12 data nodes, 96 primary shards, 1 replica

For this configuration, we started with the original configuration and added more compute capacity, moving from 6 nodes to 12 and increasing the number of primary shards to 96.

Cluster CPU usage Time taken Indexing speed
or1-target 40-60% 18 min 208 kdoc/s 325 MiB/s
im4gn-target 74-90% 20 min 187 kdoc/s 293 MiB/s
r6g-target 60-78% 24 min 156 kdoc/s 244 MiB/s

The OR1 and the R6g are performing well with CPU usage below 80 percent, with OR1 giving 33 percent better performance with 30 percent less CPU usage compared to R6g.

The Im4gn is still at 90 percent CPU, but the performance is also very good.

Reducing the number of parallel jobs from 75 to 40, we get:

Cluster CPU usage Time taken Indexing speed
or1-target 40-60% 11 min 182 kdoc/s 284 MiB/s
im4gn-target 70-90% 11 min 182 kdoc/s 284 MiB/s
r6g-target 60-77% 12 min 167 kdoc/s 260 MiB/s

Reducing the number of parallel jobs to 40 from 75 brought the OR1 and Im4gn instances on par and the R6g very close.

Interpretation

The OR1 instances speed up indexing because only the primary shards need to be written while the replica is produced by copying segments. While being more performant compared to Img4n and R6g instances, the CPU usage is also lower, which gives room for additional load (search) or cluster size reduction.

We can compare a 6-node OR1 cluster with 48 primary shards, indexing at 178 thousand documents per second, to a 12-node Im4gn cluster with 96 primary shards, indexing at 187 thousand documents per second or to a 12-node R6g cluster with 96 primary shards, indexing at 156 thousand documents per second.

The OR1 performs almost as well as the larger Im4gn cluster, and better than the larger R6g cluster.

How to size when using OR1 instances

As you can see in the results, OR1 instances can process more data at higher throughput rates. However, when increasing the number of primary shards, they don’t perform as well because of the remote backed storage.

To get the best throughput from the OR1 instance type, you can use larger batch sizes than usual, and use an Index State Management (ISM) policy to roll over your index based on size so that you can effectively limit the number of primary shards per index. You can also increase the number of connections because the OR1 instance type can handle more parallelism.

For search, OR1 doesn’t directly impact the search performance. However, as you can see, the CPU usage is lower on OR1 instances than on Im4gn and R6g instances. That enables either more activity (search and ingest), or the possibility to reduce the instance size or count, which would result in a cost reduction.

Conclusion and recommendations for OR1

The new OR1 instance type gives you more indexing power than the other instance types. This is important for indexing-heavy workloads, where you index in batch every day or have a high sustained throughput.

The OR1 instance type also enables cost reduction because their price for performance is 30 percent better than existing instance types. When adding more than one replica, price for performance will decrease because the CPU is barely impacted on an OR1 instance, while other instance types would have indexing throughput decrease.

Check out the complete instructions for optimizing your workload for indexing using this repost article.


About the author

Cédric Pelvet is a Principal AWS Specialist Solutions Architect. He helps customers design scalable solutions for real-time data and search workloads. In his free time, his activities are learning new languages and practicing the violin.

Amazon OpenSearch Serverless cost-effective search capabilities, at any scale

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-cost-effective-search-capabilities-at-any-scale/

We’re excited to announce the new lower entry cost for Amazon OpenSearch Serverless. With support for half (0.5) OpenSearch Compute Units (OCUs) for indexing and search workloads, the entry cost is cut in half. Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that you can use to run search and analytics workloads without the complexities of infrastructure management, shard tuning or data lifecycle management. OpenSearch Serverless automatically provisions and scales resources to provide consistently fast data ingestion rates and millisecond query response times during changing usage patterns and application demand. 

OpenSearch Serverless offers three types of collections to help meet your needs: Time-series, search, and vector. The new lower cost of entry benefits all collection types. Vector collections have come to the fore as a predominant workload when using OpenSearch Serverless as an Amazon Bedrock knowledge base. With the introduction of half OCUs, the cost for small vector workloads is halved. Time-series and search collections also benefit, especially for small workloads like proof-of-concept deployments and development and test environments.

A full OCU includes one vCPU, 6GB of RAM and 120GB of storage. A half OCU offers half a vCPU, 3 GB of RAM, and 60 GB of storage. OpenSearch Serverless scales up a half OCU first to one full OCU and then in one-OCU increments. Each OCU also uses Amazon Simple Storage Service (Amazon S3) as a backing store; you pay for data stored in Amazon S3 regardless of the OCU size. The number of OCUs needed for the deployment depends on the collection type, along with ingestion and search patterns. We will go over the details later in the post and contrast how the new half OCU base brings benefits. 

OpenSearch Serverless separates indexing and search computes, deploying sets of OCUs for each compute need. You can deploy OpenSearch Serverless in two forms: 1) Deployment with redundancy for production, and 2) Deployment without redundancy for development or testing.

Note: OpenSearch Serverless deploys two times the compute for both indexing and searching in redundant deployments.

OpenSearch Serverless Deployment Type

The following figure shows the architecture for OpenSearch Serverless in redundancy mode.

In redundancy mode, OpenSearch Serverless deploys two base OCUs for each compute set (indexing and search) across two Availability Zones. For small workloads under 60GB, OpenSearch Serverless uses half OCUs as the base size. The minimum deployment is four base units, two each for indexing and search. The minimum cost is approximately $350 per month (four half OCUs). All prices are quoted based on the US-East region and 30 days a month. During normal operation, all OCUs are in operation to serve traffic. OpenSearch Serverless scales up from this baseline as needed.

For non-redundant deployments, OpenSearch Serverless deploys one base OCU for each compute set, costing $174 per month (two half OCUs).

Redundant configurations are recommended for production deployments to maintain availability; if one Availability Zone goes down, the other can continue serving traffic. Non-redundant deployments are suitable for development and testing to reduce costs. In both configurations, you can set a maximum OCU limit to manage costs. The system will scale up to this limit during peak loads if necessary, but will not exceed it.

OpenSearch Serverless collections and resource allocations

OpenSearch Serverless uses compute units differently depending on the type of collection and keeps your data in Amazon S3. When you ingest data, OpenSearch Serverless writes it to the OCU disk and Amazon S3 before acknowledging the request, making sure of the data’s durability and the system’s performance. Depending on collection type, it additionally keeps data in the local storage of the OCUs, scaling to accommodate the storage and computer needs.

The time-series collection type is designed to be cost-efficient by limiting the amount of data kept in local storage, and keeping the remainder in Amazon S3. The number of OCUs needed depends on amount of data and the collection’s retention period. The number of OCUs OpenSearch Serverless uses for your workload is the larger of the default minimum OCUs, or the minimum number of OCUs needed to hold the most recent portion of your data, as defined by your OpenSearch Serverless data lifecycle policy. For example, if you ingest 1 TiB per day and have 30 day retention period, the size of the most recent data will be 1 TiB. You will need 20 OCUs [10 OCUs x 2] for indexing and another 20 OCUS [10 OCUs x 2] for search (based on the 120 GiB of storage per OCU). Access to older data in Amazon S3 raises the latency of the query responses. This tradeoff in query latency for older data is done to save on the OCUs cost.

The vector collection type uses RAM to store vector graphs, as well as disk to store indices. Vector collections keep index data in OCU local storage. When sizing for vector workloads both needs into account. OCU RAM limits are reached faster than OCU disk limits, causing vector collections to be bound by RAM space. 

OpenSearch Serverless allocates OCU resources for vector collections as follows. Considering full OCUs, it uses 2 GB for the operating system, 2 GB for the Java heap, and the remaining 2 GB for vector graphs. It uses 120 GB of local storage for OpenSearch indices. The RAM required for a vector graph depends on the vector dimensions, number of vectors stored, and the algorithm chosen. See Choose the k-NN algorithm for your billion-scale use case with OpenSearch for a review and formulas to help you pre-calculate vector RAM needs for your OpenSearch Serverless deployment.

Note: Many of the behaviors of the system are explained as of June 2024. Check back in coming months as new innovations continue to drive down cost.

Supported AWS Regions

The support for the new OCU minimums for OpenSearch Serverless is now available in all regions that support OpenSearch Serverless. See AWS Regional Services List for more information about OpenSearch Service availability. See the documentation to learn more about OpenSearch Serverless.

Conclusion

The introduction of half OCUs gives you a significant reduction in the base costs of Amazon OpenSearch Serverless. If you have a smaller data set, and limited usage, you can now take advantage of this lower cost. The cost-effective nature of this solution and simplified management of search and analytics workloads ensures seamless operation even as traffic demands vary.


About the authors 

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and Geospatial and has years of experience in networking, security and ML and AI. He holds a BEng in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes, hang glide, and ride his motorcycle.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Deliver Amazon CloudWatch logs to Amazon OpenSearch Serverless

Post Syndicated from Balaji Mohan original https://aws.amazon.com/blogs/big-data/deliver-amazon-cloudwatch-logs-to-amazon-opensearch-serverless/

Amazon CloudWatch Logs collect, aggregate, and analyze logs from different systems in one place. CloudWatch provides subcriptions as a real-time feed of these logs to other services like Amazon Kinesis Data Streams, AWS Lambda, and Amazon OpenSearch Service. These subscriptions are a popular mechanism to enable custom processing and advanced analysis of log data to gain additional valuable insights. At the time of publishing this blog post, these subscription filters support delivering logs to Amazon OpenSearch Service provisioned clusters only. Customers are increasingly adopting Amazon OpenSearch Serverless as a cost-effective option for infrequent, intermittent and unpredictable workloads.

In this blog post, we will show how to use Amazon OpenSearch Ingestion to deliver CloudWatch logs to OpenSearch Serverless in near real-time. We outline a mechanism to connect a Lambda subscription filter with OpenSearch Ingestion and deliver logs to OpenSearch Serverless without explicitly needing a separate subscription filter for it.

Solution overview

The following diagram illustrates the solution architecture.

  1. CloudWatch Logs: Collects and stores logs from various AWS resources and applications. It serves as the source of log data in this solution.
  2. Subscription filter : A CloudWatch Logs subscription filter filters and routes specific log data from CloudWatch Logs to the next component in the pipeline.
  3. CloudWatch exporter Lambda function: This is a Lambda function that receives the filtered log data from the subscription filter. Its purpose is to transform and prepare the log data for ingestion into the OpenSearch Ingestion pipeline.
  4. OpenSearch Ingestion: This is a component of OpenSearch Service. The Ingestion pipeline is responsible for processing and enriching the log data received from the CloudWatch exporter Lambda function before storing it in the OpenSearch Serverless collection.
  5. OpenSearch Service: This is fully managed service that stores and indexes log data, making it searchable and available for analysis and visualization. OpenSearch Service offers two configurations: provisioned domains and serverless. In this setup, we use serverless, which is an auto-scaling configuration for OpenSearch Service.

Prerequisites

Deploy the solution

With the prerequisites in place, you can create and deploy the pieces of the solution.

Step 1: Create PipelineRole for ingestion

  • Open the AWS Management Console for AWS Identity and Access Management (IAM).
  • Choose Policies, and then choose Create policy.
  • Select JSON and paste the following policy into the editor:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "aoss:BatchGetCollection",
                "aoss:APIAccessAll"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:aoss:us-east-1:{accountId}:collection/{collectionId}"
        },
        {
            "Action": [
                "aoss:CreateSecurityPolicy",
                "aoss:GetSecurityPolicy",
                "aoss:UpdateSecurityPolicy"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aoss:collection": "{collection}"
                }
            }
        }
    ]
}

// Replace {accountId}, {collectionId}, and {collection} with your own values
  • Choose Next, choose Next, and name your policy collection-pipeline-policy.
  • Choose Create policy.
  • Next, create a role and attach the policy to it. Choose Roles, and then choose Create role.
  • Select Custom trust policy and paste the following policy into the editor:
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Service":"osis-pipelines.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
  • Choose Next, and then search for and select the collection-pipeline-policy you just created.
  • Choose Next and name the role PipelineRole.
  • Choose Create role.

Step 2: Configure the network and data policy for OpenSearch collection

  • In the OpenSearch Service console, navigate to the Serverless menu.
  • Create a VPC endpoint by following the instruction in Create an interface endpoint for OpenSearch Serverless.
  • Go to Security and choose Network policies.
  • Choose Create network policy.
  • Configure the following policy
[
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "collection"
      }
    ],
    "AllowFromPublic": false,
    "SourceVPCEs": [
      "{VPC Enddpoint Id}"
    ]
  },
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "dashboard"
      }
    ],
    "AllowFromPublic": true
  }
]
  • Go to Security and choose Data access policies.
  • Choose Create access policy.
  • Configure the following policy:
[
  {
    "Rules": [
      {
        "Resource": [
          "index/{collection name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{accountId}:role/PipelineRole",
      "arn:aws:iam::{accountId}:role/Admin"
    ],
    "Description": "Rule 1"
  }
]

Step 3: Create an OpenSearch Ingestion pipeline

  • Navigate to the OpenSearch Service.
  • Go to the Ingestion pipelines section.
  • Choose Create pipeline.
  • Define the pipeline configuration.
version: "2"
 cwlogs-ingestion-pipeline:

  source:

    http:

      path: /logs/ingest

  sink:

    - opensearch:

        # Provide an AWS OpenSearch Service domain endpoint

        hosts: ["https://{collectionId}.{region}.aoss.amazonaws.com"]

        index: "cwl-%{yyyy-MM-dd}"

        aws:

          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com

          sts_role_arn: "arn:aws:iam::{accountId}:role/PipelineRole"

          # Provide the region of the domain.

          region: "{region}"

          serverless: true

          serverless_options:

            network_policy_name: "{Network policy name}"
 # To get the values for the placeholders: 
 # 1. {collectionId}: You can find the collection ID by navigating to the Amazon OpenSearch Serverless Collection in the AWS Management Console, and then clicking on the Collection. The collection ID is listed under the "Overview" section. 
 # 2. {region}: This is the AWS region where your Amazon OpenSearch Service domain is located. You can find this information in the AWS Management Console when you navigate to the domain. 
 # 3. {accountId}: This is your AWS account ID. You can find your account ID by clicking on your username in the top-right corner of the AWS Management Console and selecting "My Account" from the dropdown menu. 
 # 4. {Network policy name}: This is the name of the network policy you have configured for your Amazon OpenSearch Serverless Collection. If you haven't configured a network policy, you can leave this placeholder as is or remove it from the configuration.
 # After obtaining the necessary values, replace the placeholders in the configuration with the actual values.            

Step 4: Create a Lambda function

  • Create a Lambda layer for requests and sigv4 packages. Run the following commands in AWS Cloudshell.
mkdir lambda_layers
 cd lambda_layers
 mkdir python
 cd python
 pip install requests -t ./
 pip install requests_auth_aws_sigv4 -t ./
 cd ..
 zip -r python_modules.zip .


 aws lambda publish-layer-version --layer-name Data-requests --description "My Python layer" --zip-file fileb://python_modules.zip --compatible-runtimes python3.x
import base64
 import gzip
 import json
 import logging
 import json
 import jmespath
 import requests
 from datetime import datetime
 from requests_auth_aws_sigv4 import AWSSigV4
 import boto3


 LOGGER = logging.getLogger(__name__)
 LOGGER.setLevel(logging.INFO)


 def lambda_handler(event, context):

    """Extract the data from the event"""

    data = jmespath.search("awslogs.data", event)

    """Decompress the logs"""

    cwLogs = decompress_json_data(data)

    """Construct the payload to send to OpenSearch Ingestion"""

    payload = prepare_payload(cwLogs)

    print(payload)

    """Ingest the set of events to the pipeline"""    

    response = ingestData(payload)

    return {

        'statusCode': 200

    }
 def decompress_json_data(data):

    compressed_data = base64.b64decode(data)

    uncompressed_data = gzip.decompress(compressed_data)

    return json.loads(uncompressed_data)


 def prepare_payload(cwLogs):

    payload = []

    logEvents = cwLogs['logEvents']

    for logEvent in logEvents:

        request = {}

        request['id'] = logEvent['id']

        dt = datetime.fromtimestamp(logEvent['timestamp'] / 1000) 

        request['timestamp'] = dt.isoformat()

        request['message'] = logEvent['message'];

        request['owner'] = cwLogs['owner'];

        request['log_group'] = cwLogs['logGroup'];

        request['log_stream'] = cwLogs['logStream'];

        payload.append(request)

    return payload

 def ingestData(payload):

    ingestionEndpoint = '{OpenSearch Pipeline Endpoint}'

    endpoint = 'https://' + ingestionEndpoint

    headers = {'Content-Type': 'application/json', 'Accept':'application/json'}

    r = requests.request('POST', f'{endpoint}/logs/ingest', json=payload, auth=AWSSigV4('osis'), headers=headers)

    LOGGER.info('Response received: ' + r.text)

    return r
  • Replace {OpenSearch Pipeline Endpoint}’ with the endpoint of your OpenSearch Ingestion pipeline.
  • Attach the following inline policy in execution role.
{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Sid": "PermitsWriteAccessToPipeline",

            "Effect": "Allow",

            "Action": "osis:Ingest",

            "Resource": "arn:aws:osis:{region}:{accountId}:pipeline/{OpenSearch Pipeline Name}"

        }

    ]
 }
  • Deploy the function.

Step 5: Set up a CloudWatch Logs subscription

  • Grant permission to a specific AWS service or AWS account to invoke the specified Lambda function. The following command grants permission to the CloudWatch Logs service to invoke the cloud-logs Lambda function for the specified log group. This is necessary because CloudWatch Logs cannot directly invoke a Lambda function without being granted permission. Run the following command in CloudShell to add permission.
aws lambda add-permission
 --function-name "{function name}"
 --statement-id "{function name}"
 --principal "logs.amazonaws.com"
 --action "lambda:InvokeFunction"
 --source-arn "arn:aws:logs:{region}:{accountId}:log-group:{log_group}:*"
 --source-account "{accountId}"
  • Create a subscription filter for a log group. The following command creates a subscription filter on the log group, which forwards all log events (because the filter pattern is an empty string) to the Lambda function. Run the following command in Cloudshell to create the subscription filter.
aws logs put-subscription-filter
 --log-group-name {log_group}
 --filter-name {filter name}
 --filter-pattern ""
 --destination-arn arn:aws:lambda:{region}:{accountId}:function:{function name}

Step 6: Testing and verification

  • Generate some logs in your CloudWatch log group. Run the following command in Cloudshell to create sample logs in log group.
aws logs put-log-events --log-group-name {log_group} --log-stream-name {stream_name} --log-events "[{\"timestamp\":{timestamp in millis} , \"message\": \"Simple Lambda Test\"}]"
  • Check the OpenSearch collection to ensure logs are indexed correctly.

Clean up

Remove the infrastructure for this solution when not in use to avoid incurring unnecessary costs.

Conclusion

You saw how to set up a pipeline to send CloudWatch logs to an OpenSearch Serverless collection within a VPC. This integration uses CloudWatch for log aggregation, Lambda for log processing, and OpenSearch Serverless for querying and visualization. You can use this solution to take advantage of the pay-as-you-go pricing model for OpenSearch Serverless to optimize operational costs for log analysis.

To further explore, you can:


About the Authors

Balaji Mohan is a senior modernization architect specializing in application and data modernization to the cloud. His business-first approach ensures seamless transitions, aligning technology with organizational goals. Using cloud-native architectures, he delivers scalable, agile, and cost-effective solutions, driving innovation and growth.

Souvik Bose is a Software Development Engineer working on Amazon OpenSearch Service.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Transition from Amazon CloudSearch to Amazon OpenSearch Service

Post Syndicated from Arvind Mahesh original https://aws.amazon.com/blogs/big-data/transition-from-amazon-cloudsearch-to-amazon-opensearch-service/

At AWS, we are constantly innovating and evolving our services to meet the ever-changing needs of our customers. In this post, we want to help you understand the differences between Amazon CloudSearch and Amazon OpenSearch Service, and how you can transition to OpenSearch Service.

Comparing Amazon CloudSearch and Amazon OpenSearch Service

CloudSearch is a fully managed service in the cloud that makes it straightforward to set up, manage, and scale a search solution for your website or application. With CloudSearch, you can search large collections of data such as webpages, document files, forum posts, or product information. You can quickly add search capabilities without having to become a search expert or worry about hardware provisioning, setup, and maintenance. As your volume of data and traffic fluctuates, CloudSearch scales to meet your needs. CloudSearch is internally powered by a customized version of Apache Solr, and supports features such as full-text search, Boolean search, prefix search, term boosting, faceting, hit highlighting, and auto-complete suggestions.

OpenSearch Service is a managed service that makes it seamless to deploy, operate, and scale OpenSearch, a popular open source search and analytics engine. OpenSearch provides best-in-class search capabilities, providing you with all the search features of CloudSearch plus a vector engine supporting semantic search on vector embeddings, and support for both dense and sparse vectors. In addition, with OpenSearch Service, you get advanced security with fine-grained access control, the ability to store and analyze log data for observability and security, along with dashboarding and alerting. You’ll have all of CloudSearch’s capabilities and more.

With OpenSearch Serverless, you get improved, out-of-the-box, hands-free operation. Like CloudSearch, OpenSearch Serverless lets you deploy and use OpenSearch through a REST endpoint. You send your documents to OpenSearch Serverless, which indexes them for search using the OpenSearch REST API. If you want deeper control over your infrastructure for cost and latency optimization, you can choose OpenSearch Service’s managed clusters deployment option. With managed clusters, you get granular control over the instances you would like to use, indexing and data-sharding strategy, and more. OpenSearch Service brings with it the flexibility and extensibility of open source, provides powerful querying and analytics capabilities, and enables cost-effective scalability for growing workloads, with high availability and durability. For more information on the capabilities and benefits of using OpenSearch Service, see Amazon OpenSearch Service.

Transitioning to OpenSearch Service

When transitioning from CloudSearch to OpenSearch Service, you need to re-ingest and index your data into OpenSearch Service. Because OpenSearch Service uses a REST API, numerous methods exist for indexing documents. You can use standard clients like curl or any programming language that can send HTTP requests. To further simplify the process of interacting with it, OpenSearch Service has clients for many programming languages. We recommend that you use Amazon OpenSearch Ingestion to ingest data. OpenSearch Ingestion is a fully managed data collector built within OpenSearch Service that can route data to an OpenSearch Service domain or an OpenSearch Serverless collection. OpenSearch Ingestion can ingest data from a wide variety of sources, such as Amazon Simple Storage Service (Amazon S3) buckets and HTTP endpoints, and has a rich ecosystem of built-in processors to take care of your most complex data transformation needs. OpenSearch Ingestion is serverless in nature and will scale automatically to meet the requirements of your most demanding workloads, helping you focus on your business logic while abstracting away the complexity of managing complex data pipelines for your ingestion use cases. For more information about how to ingest a document into an OpenSearch Serverless collection or a managed cluster using OpenSearch ingestion, see Getting started with Amazon OpenSearch Ingestion. For detailed information on using OpenSearch Ingestion to ingest data into OpenSearch Service, refer to Amazon OpenSearch Ingestion.

Summary

AWS continues to support CloudSearch and continues to invest in security and availability improvements. However, with the advancements in OpenSearch, we recommend that you explore OpenSearch Service to get the latest search capabilities and to meet the rapid evolution of search experience users have come to expect in the machine learning age.


About the Authors

Arvind Mahesh is a Senior Manager-Product at Amazon Web Services for Amazon OpenSearch Service. He has close to two decades of technology experience across a variety of domains such as Analytics, Search, Cloud, Network Security, and Telecom.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Post Syndicated from Arpad Csoke original https://aws.amazon.com/blogs/big-data/configure-saml-federation-with-amazon-opensearch-serverless-and-keycloak/

Amazon OpenSearch Serverless is a serverless version of Amazon OpenSearch Service, a fully managed open search and analytics platform. On Amazon OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and Amazon OpenSearch Serverless supports workloads up to 30TB of data for time-series collections. Amazon OpenSearch Serverless provides an installation of OpenSearch Dashboards with every collection created.

The network configuration for an OpenSearch Serverless collection controls how the collection can be accessed over the network. You have the option to make the collection publicly accessible over the internet from any network, or to restrict access to the collection only privately through OpenSearch Serverless-managed virtual private cloud (VPC) endpoints. This network access setting can be defined separately for the collection’s OpenSearch endpoint (used for data operations) and its corresponding OpenSearch Dashboards endpoint (used for visualizing and analyzing data). In this post, we work with a publicly accessible OpenSearch Serverless collection.

SAML enables users to access multiple applications or services with a single set of credentials, eliminating the need for separate logins for each application or service. This improves the user experience and reduces the overhead of managing multiple credentials. We provide SAML authentication for OpenSearch Serverless. With this you can use your existing identity provider (IdP) to offer single sign-on (SSO) for the OpenSearch Dashboards endpoints of serverless collections. OpenSearch Serverless supports IdPs that adhere to the SAML 2.0 standard, including services like AWS IAM Identity Center, Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0. This SAML authentication mechanism is solely intended for accessing the OpenSearch Dashboards interface through a web browser.

In this post, we show you how to configure SAML authentication for controlling access to public OpenSearch Dashboards using Keycloak as an IdP.

Solution overview

The following diagram illustrates a sample architecture of a solution that allows users to authenticate to OpenSearch Dashboards using SSO with Keycloak.

The sign-in flow includes the following steps:

  1. A user accesses OpenSearch Dashboards in a browser and chooses an IdP from the list.
  2. OpenSearch Serverless generates a SAML authentication request.
  3. OpenSearch Service redirects the request back to the browser.
  4. The browser redirects the user to the selected IdP (Keycloak). Keycloak provides a login page, where users can provide their login credentials.
  5. If authentication was successful, Keycloak returns the SAML response to the browser.
  6. The SAML assertions is sent back to OpenSearch Serverless.
  7. OpenSearch Serverless validates the SAML assertion, and logs the user in to OpenSearch Dashboards.

Prerequisites

To get started, you should have the following prerequisites:

  1. An active OpenSearch Serverless collection
  2. A working Keycloak server (on premises or in the cloud)
  3. The following AWS Identity and Access Management (IAM) permissions to configure SAML authentication in OpenSearch Serverless:
    • aoss:CreateSecurityConfig – Create a SAML provider.
    • aoss:ListSecurityConfig – List all SAML providers in the current account.
    • aoss:GetSecurityConfig – View SAML provider information.
    • aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata.
    • aoss:DeleteSecurityConfig – Delete a SAML provider.

Create and configure a client in Keycloak

Complete the following steps to create your Keycloak client:

  1. Login to your Keycloak admin page.
  2. In the navigation pane, choose Client.
  3. Choose Create client
  4. For Client type, choose SAML.
  5. For Client ID enter aws:opensearch:AWS_ACCOUNT_ID, where AWS_ACCOUNT_ID is your AWS account ID.
  6. Enter a name and description for your client.
  7. Choose Next.
  8. For Valid redirect URIs, enter the address of the assertion consumer service (ACS), where REGION is the AWS Region in which you have created the OpenSearch Serverless collection.
  9. For Master SAML Processing URL, also enter the preceding ACS address.
  10. Complete your client creation.
  11. After you create the client, you have to disable the Signing keys config setting, because OpenSearch Serverless signed and encrypted requests are not supported. For more details, refer to Considerations.
  12. After you have created the client and disabled the client signature, you can export the SAML 2.0 IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless.

Create a SAML provider

When your OpenSearch Serverless collection is active, you then create a SAML provider. This SAML provider can be assigned to any collection in the same Region. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
  2. Choose Create SAML provider.
  3. Enter a name and description for your SAML provider.
  4. Enter the IdP metadata you downloaded earlier from Keycloak.
  5. Under Additional settings, you can optionally add custom user ID and group attributes (for this example, we leave this empty).
  6. Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, you configure the data access policy for accessing collections.

Create a data access policy

After you have configured SAML provider, you have to create data access policies for OpenSearch Serverless to allow access to the users.

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
  2. Choose Create access policy.
  3. Enter a name and optional description for your access policy.
  4. For Policy definition method, select Visual editor.
  5. For Rule name, enter a name.
  6. Under Select principals, for Add principals, choose SAML users and groups.

  7. For SAML provider name, choose the provider you created before.
  8. Choose Save.

  9. Specify the user or group in the format user/USERNAME or group/GROUPNAME. The value of the USERNAME or GROUPNAME should match the value you specified in Keycloak for user-/groupname.
  10. Choose Save.
  11. Choose Grant to grant permissions to resources.
  12. In the Grant resources and permissions section, you can specify access you want to provide for a given user at the collection level, and also at the index pattern level.
    For more information about how to set up more granular access for your users, refer to Supported OpenSearch API operations and permissions and Supported policy permissions.
  13. Choose Save.
  14. You can create additional rules if needed.
  15. Choose Create to create the data access policy.

Now, you have data access policy that will allow users to access the OpenSearch Dashboards and perform the allowed actions there.

Access the OpenSearch Dashboards

Complete the following steps to sign in to the OpenSearch Dashboards:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Dashboard.
  2. In the Collection section, locate your collection and choose Dashboard.

    The OpenSearch login page will open in a new browser tab.
  3. Choose your IdP provider on the dropdown menu and choose Login.

    You will be redirected to the Keycloak sign-in page.
  4. Log in with your SSO credentials.

After a successful login, you will be redirected to OpenSearch Dashboards, and you can perform the actions allowed by the data access policy.

You have successfully federated OpenSearch Dashboards with Keycloak as an IdP.

Cleaning up

When you’re done with this solution, delete the resources you created if you no longer need them.

  1. Delete your OpenSearch Serverless collection.
  2. Delete your data access policy.
  3. Delete the SAML provider.

Conclusion

In this post, we demonstrated how to set up Keycloak as an IdP to access an OpenSearch Serverless dashboard using SAML authentication. For more details, refer to SAML authentication for Amazon OpenSearch Serverless


About the Author

Arpad Csoke is a Solutions Architect at Amazon Web Services. His responsibilities include helping large enterprise customers understand and utilize the AWS environment, acting as a technical consultant to contribute to solving their issues.

AWS Weekly Roundup: Global AWS Heroes Summit, AWS Lambda, Amazon Redshift, and more (July 22, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-global-aws-heroes-summit-aws-lambda-amazon-redshift-and-more-july-22-2024/

Last week, AWS Heroes from around the world gathered to celebrate the 10th anniversary of the AWS Heroes program at Global AWS Heroes Summit. This program recognizes a select group of AWS experts worldwide who go above and beyond in sharing their knowledge and making an impact within developer communities.

Matt Garman, CEO of AWS and a long-time supporter of developer communities, made a special appearance for a Q&A session with the Heroes to listen to their feedback and respond to their questions.

Here’s an epic photo from the AWS Heroes Summit:

As Matt mentioned in his Linkedin post, “The developer community has been core to everything we have done since the beginning of AWS.” Thank you, Heroes, for all you do. Wishing you all a safe flight home.

Last week’s launches
Here are some launches that caught my attention last week:

Announcing the July 2024 updates to Amazon Corretto — The latest updates for the Corretto distribution of OpenJDK is now available. This includes security and critical updates for the Long-Term Supported (LTS) and Feature (FR) versions.

New open-source Advanced MYSQL ODBC Driver now available for Amazon Aurora and RDS — The new AWS ODBC Driver for MYSQL provides faster switchover and failover times, and authentication support for AWS Secrets Manager and AWS Identity and Access Management (IAM), making it a more efficient and secure option for connecting to Amazon RDS and Amazon Aurora MySQL-compatible edition databases.

Productionize Fine-tuned Foundation Models from SageMaker Canvas — Amazon SageMaker Canvas now allows you to deploy fine-tuned Foundation Models (FMs) to SageMaker real-time inference endpoints, making it easier to integrate generative AI capabilities into your applications outside the SageMaker Canvas workspace.

AWS Lambda now supports SnapStart for Java functions that use the ARM64 architecture — Lambda SnapStart for Java functions on ARM64 architecture delivers up to 10x faster function startup performance and up to 34% better price performance compared to x86, enabling the building of highly responsive and scalable Java applications using AWS Lambda.

Amazon QuickSight improves controls performance — Amazon QuickSight has improved the performance of controls, allowing readers to interact with them immediately without having to wait for all relevant controls to reload. This enhancement reduces the loading time experienced by readers.

Amazon OpenSearch Serverless levels up speed and efficiency with smart caching — The new smart caching feature for indexing in Amazon OpenSearch Serverless automatically fetches and manages data, leading to faster data retrieval, efficient storage usage, and cost savings.

Amazon Redshift Serverless with lower base capacity available in the Europe (London) Region — Amazon Redshift Serverless now allows you to start with a lower data warehouse base capacity of 8 Redshift Processing Units (RPUs) in the Europe (London) region, providing more flexibility and cost-effective options for small to large workloads.

AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions — AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions, enabling you to build serverless applications with Lambda functions that are invoked based on messages posted to Amazon MQ message brokers.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: AWS Summit Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!