Tag Archives: Amazon OpenSearch Service

AWS Weekly Roundup: Global AWS Heroes Summit, AWS Lambda, Amazon Redshift, and more (July 22, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-global-aws-heroes-summit-aws-lambda-amazon-redshift-and-more-july-22-2024/

Last week, AWS Heroes from around the world gathered to celebrate the 10th anniversary of the AWS Heroes program at Global AWS Heroes Summit. This program recognizes a select group of AWS experts worldwide who go above and beyond in sharing their knowledge and making an impact within developer communities.

Matt Garman, CEO of AWS and a long-time supporter of developer communities, made a special appearance for a Q&A session with the Heroes to listen to their feedback and respond to their questions.

Here’s an epic photo from the AWS Heroes Summit:

As Matt mentioned in his Linkedin post, “The developer community has been core to everything we have done since the beginning of AWS.” Thank you, Heroes, for all you do. Wishing you all a safe flight home.

Last week’s launches
Here are some launches that caught my attention last week:

Announcing the July 2024 updates to Amazon Corretto — The latest updates for the Corretto distribution of OpenJDK is now available. This includes security and critical updates for the Long-Term Supported (LTS) and Feature (FR) versions.

New open-source Advanced MYSQL ODBC Driver now available for Amazon Aurora and RDS — The new AWS ODBC Driver for MYSQL provides faster switchover and failover times, and authentication support for AWS Secrets Manager and AWS Identity and Access Management (IAM), making it a more efficient and secure option for connecting to Amazon RDS and Amazon Aurora MySQL-compatible edition databases.

Productionize Fine-tuned Foundation Models from SageMaker Canvas — Amazon SageMaker Canvas now allows you to deploy fine-tuned Foundation Models (FMs) to SageMaker real-time inference endpoints, making it easier to integrate generative AI capabilities into your applications outside the SageMaker Canvas workspace.

AWS Lambda now supports SnapStart for Java functions that use the ARM64 architecture — Lambda SnapStart for Java functions on ARM64 architecture delivers up to 10x faster function startup performance and up to 34% better price performance compared to x86, enabling the building of highly responsive and scalable Java applications using AWS Lambda.

Amazon QuickSight improves controls performance — Amazon QuickSight has improved the performance of controls, allowing readers to interact with them immediately without having to wait for all relevant controls to reload. This enhancement reduces the loading time experienced by readers.

Amazon OpenSearch Serverless levels up speed and efficiency with smart caching — The new smart caching feature for indexing in Amazon OpenSearch Serverless automatically fetches and manages data, leading to faster data retrieval, efficient storage usage, and cost savings.

Amazon Redshift Serverless with lower base capacity available in the Europe (London) Region — Amazon Redshift Serverless now allows you to start with a lower data warehouse base capacity of 8 Redshift Processing Units (RPUs) in the Europe (London) region, providing more flexibility and cost-effective options for small to large workloads.

AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions — AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions, enabling you to build serverless applications with Lambda functions that are invoked based on messages posted to Amazon MQ message brokers.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: AWS Summit Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Migrate from Apache Solr to OpenSearch

Post Syndicated from Aswath Srinivasan original https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/

OpenSearch is an open source, distributed search engine suitable for a wide array of use-cases such as ecommerce search, enterprise search (content management search, document search, knowledge management search, and so on), site search, application search, and semantic search. It’s also an analytics suite that you can use to perform interactive log analytics, real-time application monitoring, security analytics and more. Like Apache Solr, OpenSearch provides search across document sets. OpenSearch also includes capabilities to ingest and analyze data. Amazon OpenSearch Service is a fully managed service that you can use to deploy, scale, and monitor OpenSearch in the AWS Cloud.

Many organizations are migrating their Apache Solr based search solutions to OpenSearch. The main driving factors include lower total cost of ownership, scalability, stability, improved ingestion connectors (such as Data Prepper, Fluent Bit, and OpenSearch Ingestion), elimination of external cluster managers like Zookeeper, enhanced reporting, and rich visualizations with OpenSearch Dashboards.

We recommend approaching a Solr to OpenSearch migration with a full refactor of your search solution to optimize it for OpenSearch. While both Solr and OpenSearch use Apache Lucene for core indexing and query processing, the systems exhibit different characteristics. By planning and running a proof-of-concept, you can ensure the best results from OpenSearch. This blog post dives into the strategic considerations and steps involved in migrating from Solr to OpenSearch.

Key differences

Solr and OpenSearch Service share fundamental capabilities delivered through Apache Lucene. However, there are some key differences in terminology and functionality between the two:

  • Collection and index: In OpenSearch, a collection is called an index.
  • Shard and replica: Both Solr and OpenSearch use the terms shard and replica.
  • API-driven Interactions: All interactions in OpenSearch are API-driven, eliminating the need for manual file changes or Zookeeper configurations. When creating an OpenSearch index, you define the mapping (equivalent to the schema) and the settings (equivalent to solrconfig) as part of the index creation API call.

Having set the stage with the basics, let’s dive into the four key components and how each of them can be migrated from Solr to OpenSearch.

Collection to index

A collection in Solr is called an index in OpenSearch. Like a Solr collection, an index in OpenSearch also has shards and replicas.

Although the shard and replica concept is similar in both the search engines, you can use this migration as a window to adopt a better sharding strategy. Size your OpenSearch shards, replicas, and index by following the shard strategy best practices.

As part of the migration, reconsider your data model. In examining your data model, you can find efficiencies that dramatically improve your search latencies and throughput. Poor data modeling doesn’t only result in search performance problems but extends to other areas. For example, you might find it challenging to construct an effective query to implement a particular feature. In such cases, the solution often involves modifying the data model.

Differences: Solr allows primary shard and replica shard collocation on the same node. OpenSearch doesn’t place the primary and replica on the same node. OpenSearch Service zone awareness can automatically ensure that shards are distributed to different Availability Zones (data centers) to further increase resiliency.

The OpenSearch and Solr notions of replica are different. In OpenSearch, you define a primary shard count using number_of_primaries that determines the partitioning of your data. You then set a replica count using number_of_replicas. Each replica is a copy of all the primary shards. So, if you set number_of_primaries to 5, and number_of_replicas to 1, you will have 10 shards (5 primary shards, and 5 replica shards). Setting replicationFactor=1 in Solr yields one copy of the data (the primary).

For example, the following creates a collection called test with one shard and no replicas.

http://localhost:8983/solr/admin/collections?
  _=action=CREATE
  &maxShardsPerNode=2
  &name=test
  &numShards=1
  &replicationFactor=1
  &wt=json

In OpenSearch, the following creates an index called test with five shards and one replica

PUT test
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Schema to mapping

In Solr schema.xml OR managed-schema has all the field definitions, dynamic fields, and copy fields along with field type (text analyzers, tokenizers, or filters). You use the schema API to manage schema. Or you can run in schema-less mode.

OpenSearch has dynamic mapping, which behaves like Solr in schema-less mode. It’s not necessary to create an index beforehand to ingest data. By indexing data with a new index name, you create the index with OpenSearch managed service default settings (for example: "number_of_shards": 5, "number_of_replicas": 1) and the mapping based on the data that’s indexed (dynamic mapping).

We strongly recommend you opt for a pre-defined strict mapping. OpenSearch sets the schema based on the first value it sees in a field. If a stray numeric value is the first value for what is really a string field, OpenSearch will incorrectly map the field as numeric (integer, for example). Subsequent indexing requests with string values for that field will fail with an incorrect mapping exception. You know your data, you know your field types, you will benefit from setting the mapping directly.

Tip: Consider performing a sample indexing to generate the initial mapping and then refine and tidy up the mapping to accurately define the actual index. This approach helps you avoid manually constructing the mapping from scratch.

For Observability workloads, you should consider using Simple Schema for Observability. Simple Schema for Observability (also known as ss4o) is a standard for conforming to a common and unified observability schema. With the schema in place, Observability tools can ingest, automatically extract, and aggregate data and create custom dashboards, making it easier to understand the system at a higher level.

Many of the field types (data types), tokenizers, and filters are the same in both Solr and OpenSearch. After all, both use Lucene’s Java search library at their core.

Let’s look at an example:

<!-- Solr schema.xml snippets -->
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="name" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="address" type="text_general" indexed="true" stored="true"/>
<field name="user_token" type="string" indexed="false" stored="true"/>
<field name="age" type="pint" indexed="true" stored="true"/>
<field name="last_modified" type="pdate" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true"/>

<uniqueKey>id</uniqueKey>

<copyField source="name" dest="text"/>
<copyField source="address" dest="text"/>

<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="pint" class="solr.IntPointField" docValues="true"/>
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PUT index_from_solr
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_general": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "copy_to": "text"
      },
      "address": {
        "type": "text",
        "analyzer": "text_general"
      },
      "user_token": {
        "type": "keyword",
        "index": false
      },
      "age": {
        "type": "integer"
      },
      "last_modified": {
        "type": "date"
      },
      "city": {
        "type": "text",
        "analyzer": "text_general"
      },
      "text": {
        "type": "text",
        "analyzer": "text_general"
      }
    }
  }
}

Notable things in OpenSearch compared to Solr:

  1. _id is always the uniqueKey and cannot be defined explicitly, because it’s always present.
  2. Explicitly enabling multivalued isn’t necessary because any OpenSearch field can contain zero or more values.
  3. The mapping and the analyzers are defined during index creation. New fields can be added and certain mapping parameters can be updated later. However, deleting a field isn’t possible. A handy ReIndex API can overcome this problem. You can use the Reindex API to index data from one index to another.
  4. By default, analyzers are for both index and query time. For some less-common scenarios, you can change the query analyzer at search time (in the query itself), which will override the analyzer defined in the index mapping and settings.
  5. Index templates are also a great way to initialize new indexes with predefined mappings and settings. For example, if you continuously index log data (or any time-series data), you can define an index template so that all the indices have the same number of shards and replicas. It can also be used for dynamic mapping control and component templates

Look for opportunities to optimize the search solution. For instance, if the analysis reveals that the city field is solely used for filtering rather than searching, consider changing its field type to keyword instead of text to eliminate unnecessary text processing. Another optimization could involve disabling doc_values for the user_token field if it’s only intended for display purposes. doc_values are disabled by default for the text datatype.

SolrConfig to settings

In Solr, solrconfig.xml carries the collection configuration. All sorts of configurations pertaining to everything from index location and formatting, caching, codec factory, circuit breaks, commits and tlogs all the way up to slow query config, request handlers, and update processing chain, and so on.

Let’s look at an example:

<codecFactory class="solr.SchemaCodecFactory">
<str name="compressionMode">`BEST_COMPRESSION`</str>
</codecFactory>

<autoCommit>
    <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
    <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
    <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
    </autoSoftCommit>

<slowQueryThresholdMillis>1000</slowQueryThresholdMillis>

<maxBooleanClauses>${solr.max.booleanClauses:2048}</maxBooleanClauses>

<requestHandler name="/query" class="solr.SearchHandler">
    <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    </lst>
</requestHandler>

<searchComponent name="spellcheck" class="solr.SpellCheckComponent"/>
<searchComponent name="suggest" class="solr.SuggestComponent"/>
<searchComponent name="elevator" class="solr.QueryElevationComponent"/>
<searchComponent class="solr.HighlightComponent" name="highlight"/>

<queryResponseWriter name="json" class="solr.JSONResponseWriter"/>
<queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" startup="lazy"/>
<queryResponseWriter name="xslt" class="solr.XSLTResponseWriter"/>

<updateRequestProcessorChain name="script"/>

Notable things in OpenSearch compared to Solr:

  1. Both OpenSearch and Solr have BEST_SPEED codec as default (LZ4 compression algorithm). Both offer BEST_COMPRESSION as an alternative. Additionally OpenSearch offers zstd and zstd_no_dict. Benchmarking for different compression codecs is also available.
  2. For near real-time search, refresh_interval needs to be set. The default is 1 second which is good enough for most use cases. We recommend increasing refresh_interval to 30 or 60 seconds to improve indexing speed and throughput, especially for batch indexing.
  3. Max boolean clause is a static setting, set at node level using the indices.query.bool.max_clause_count setting.
  4. You don’t need an explicit requestHandler. All searches use the _search or _msearch endpoint. If you’re used to using the requestHandler with default values then you can use search templates.
  5. If you’re used to using /sql requestHandler, OpenSearch also lets you use SQL syntax for querying and has a Piped Processing Language.
  6. Spellcheck, also known as Did-you-mean, QueryElevation (known as pinned_query in OpenSearch), and highlighting are all supported during query time. You don’t need to explicitly define search components.
  7. Most API responses are limited to JSON format, with CAT APIs as the only exception. In cases where Velocity or XSLT is used in Solr, it must be managed on the application layer. CAT APIs respond in JSON, YAML, or CBOR formats.
  8. For the updateRequestProcessorChain, OpenSearch provides the ingest pipeline, allowing the enrichment or transformation of data before indexing. Multiple processor stages can be chained to form a pipeline for data transformation. Processors include GrokProcessor, CSVParser, JSONProcessor, KeyValue, Rename, Split, HTMLStrip, Drop, ScriptProcessor, and more. However, it’s strongly recommended to do the data transformation outside OpenSearch. The ideal place to do that would be at OpenSearch Ingestion, which provides a proper framework and various out-of-the-box filters for data transformation. OpenSearch Ingestion is built on Data Prepper, which is a server-side data collector capable of filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization.
  9. OpenSearch also introduced search pipelines, similar to ingest pipelines but tailored for search time operations. Search pipelines make it easier for you to process search queries and search results within OpenSearch. Currently available search processors include filter query, neural query enricher, normalization, rename field, scriptProcessor, and personalize search ranking, with more to come.
  10. The following image shows how to set refresh_interval and slowlog. It also shows you the other possible settings.
  11. Slow logs can be set like the following image but with much more precision with separate thresholds for the query and fetch phases.

Before migrating every configuration setting, assess if the setting can be adjusted based on your current search system experience and best practices. For instance, in the preceding example, the slow logs threshold of 1 second might be intensive for logging, so that can be revisited. In the same example, max.booleanClauses might be another thing to look at and reduce.

Differences: Some settings are done at the cluster level or node level and not at the index level. Including settings such as max boolean clause, circuit breaker settings, cache settings, and so on.

Rewriting queries

Rewriting queries deserves its own blog post; however we want to at least showcase the autocomplete feature available in OpenSearch Dashboards, which helps ease query writing.

Similar to the Solr Admin UI, OpenSearch also features a UI called OpenSearch Dashboards. You can use OpenSearch Dashboards to manage and scale your OpenSearch clusters. Additionally, it provides capabilities for visualizing your OpenSearch data, exploring data, monitoring observability, running queries, and so on. The equivalent for the query tab on the Solr UI in OpenSearch Dashboard is Dev Tools. Dev Tools is a development environment that lets you set up your OpenSearch Dashboards environment, run queries, explore data, and debug problems.

Now, let’s construct a query to accomplish the following:

  1. Search for shirt OR shoe in an index.
  2. Create a facet query to find the number of unique customers. Facet queries are called aggregation queries in OpenSearch. Also known as aggs query.

The Solr query would look like this:

http://localhost:8983/solr/solr_sample_data_ecommerce/select?q=shirt OR shoe
  &facet=true
  &facet.field=customer_id
  &facet.limit=-1
  &facet.mincount=1
  &json.facet={
   unique_customer_count:"unique(customer_id)"
  }

The image below demonstrates how to re-write the above Solr query into an OpenSearch query DSL:

Conclusion

OpenSearch covers a wide variety of uses cases, including enterprise search, site search, application search, ecommerce search, semantic search, observability (log observability, security analytics (SIEM), anomaly detection, trace analytics), and analytics. Migration from Solr to OpenSearch is becoming a common pattern. This blog post is designed to be a starting point for teams seeking guidance on such migrations.

You can try out OpenSearch with the OpenSearch Playground. You can get started with Amazon OpenSearch Service, a managed implementation of OpenSearch in the AWS Cloud.


About the Authors

Aswath Srinivasan is a Senior Search Engine Architect at Amazon Web Services currently based in Munich, Germany. With over 17 years of experience in various search technologies, Aswath currently focuses on OpenSearch. He is a search and open-source enthusiast and helps customers and the search community with their search problems.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

Post Syndicated from Ayush Agrawal original https://aws.amazon.com/blogs/big-data/building-a-scalable-streaming-data-platform-that-enables-real-time-and-batch-analytics-of-electric-vehicles-on-aws/

The automobile industry has undergone a remarkable transformation because of the increasing adoption of electric vehicles (EVs). EVs, known for their sustainability and eco-friendliness, are paving the way for a new era in transportation. As environmental concerns and the push for greener technologies have gained momentum, the adoption of EVs has surged, promising to reshape our mobility landscape.

The surge in EVs brings with it a profound need for data acquisition and analysis to optimize their performance, reliability, and efficiency. In the rapidly evolving EV industry, the ability to harness, process, and derive insights from the massive volume of data generated by EVs has become essential for manufacturers, service providers, and researchers alike.

As the EV market is expanding with many new and incumbent players trying to capture the market, the major differentiating factor will be the performance of the vehicles.

Modern EVs are equipped with an array of sensors and systems that continuously monitor various aspects of their operation including parameters such as voltage, temperature, vibration, speed, and so on. From battery management to motor performance, these data-rich machines provide a wealth of information that, when effectively captured and analyzed, can revolutionize vehicle design, enhance safety, and optimize energy consumption. The data can be used to do predictive maintenance, device anomaly detection, real-time customer alerts, remote device management, and monitoring.

However, managing this deluge of data isn’t without its challenges. As the adoption of EVs accelerates, the need for robust data pipelines capable of collecting, storing, and processing data from an exponentially growing number of vehicles becomes more pronounced. Moreover, the granularity of data generated by each vehicle has increased significantly, making it essential to efficiently handle the ever-increasing number of data points. The challenges include not only the technical intricacies of data management but also concerns related to data security, privacy, and compliance with evolving regulations.

In this blog post, we delve into the intricacies of building a reliable data analytics pipeline that can scale to accommodate millions of vehicles, each generating hundreds of metrics every second using Amazon OpenSearch Ingestion. We also provide guidelines and sample configurations to help you implement a solution.

Of the prerequisites that follow, the IOT topic rule and the Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster can be set up by following How to integrate AWS IoT Core with Amazon MSK. The steps to create an Amazon OpenSearch Service cluster are available in Creating and managing Amazon OpenSearch Service domains.

Prerequisites

Before you begin the implementing the solution, you need the following:

  • IOT topic rule
  • Amazon MSK Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM) cluster
  • Amazon OpenSearch Service domain

Solution overview

The following architecture diagram provides a scalable and fully managed modern data streaming platform. The architecture uses Amazon OpenSearch Ingestion to stream data into OpenSearch Service and Amazon Simple Storage Service (Amazon S3) to store the data. The data in OpenSearch powers real-time dashboards. The data can also be used to notify customers of any failures occurring on the vehicle (see Configuring alerts in Amazon OpenSearch Service). The data in Amazon S3 is used for business intelligence and long-term storage.

Architecture diagram

In the following sections, we focus on the following three critical pieces of the architecture in depth:

1. Amazon MSK to OpenSearch ingestion pipeline

2. Amazon OpenSearch Ingestion pipeline to OpenSearch Service

3. Amazon OpenSearch Ingestion to Amazon S3

Solution Walkthrough

Step 1: MSK to Amazon OpenSearch Ingestion pipeline

Because each electric vehicle streams massive volumes of data to Amazon MSK clusters through AWS IoT Core, making sense of this data avalanche is critical. OpenSearch Ingestion provides a fully managed serverless integration to tap into these data streams.

The Amazon MSK source in OpenSearch Ingestion uses Kafka’s Consumer API to read records from one or more MSK topics. The MSK source in OpenSearch Ingestion seamlessly connects to MSK to ingest the streaming data into OpenSearch Ingestion’s processing pipeline.

The following snippet illustrates the pipeline configuration for an OpenSearch Ingestion pipeline used to ingest data from an MSK cluster.

While creating an OpenSearch Ingestion pipeline, add the following snippet in the Pipeline configuration section.

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true                  
      topics: 
         - name: "ev-device-topic " 
           group_id: "opensearch-consumer" 
           serde_format: json                 
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam:: ::<<account-id>>:role/opensearch-pipeline-Role"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:<<region>>:<<account-id>>:cluster/<<name>>/<<id>>" 

When configuring Amazon MSK and OpenSearch Ingestion, it’s essential to establish an optimal relationship between the number of partitions in your Kafka topics and the number of OpenSearch Compute Units (OCUs) allocated to your ingestion pipelines. This optimal configuration ensures efficient data processing and maximizes throughput. You can read more about it in Configure recommended compute units (OCUs) for the Amazon MSK pipeline.

Step 2: OpenSearch Ingestion pipeline to OpenSearch Service

OpenSearch Ingestion offers a direct method for streaming EV data into OpenSearch. The OpenSearch sink plugin channels data from multiple sources directly into the OpenSearch domain. Instead of manually provisioning the pipeline, you define the capacity for your pipeline using OCUs. Each OCU provides 6 GB of memory and two virtual CPUs. To use OpenSearch Ingestion auto-scaling optimally, it’s essential to configure the maximum number of OCUs for a pipeline based on the number of partitions in the topics being ingested. If a topic has a large number of partitions (for example, more than 96, which is the maximum OCUs per pipeline), it’s recommended to configure the pipeline with a maximum of 1–96 OCUs. This way, the pipeline can automatically scale up or down within this range as needed. However, if a topic has a low number of partitions (for example, fewer than 96), it’s advisable to set the maximum number of OCUs to be equal to the number of partitions. This approach ensures that each partition is processed by a dedicated OCU enabling parallel processing and optimal performance. In scenarios where a pipeline ingests data from multiple topics, the topic with the highest number of partitions should be used as a reference to configure the maximum OCUs. Additionally, if higher throughput is required, you can create another pipeline with a new set of OCUs for the same topic and consumer group, enabling near-linear scalability.

OpenSearch Ingestion provides several pre-defined configuration blueprints that can help you quickly build your ingestion pipeline on AWS

The following snippet illustrates pipeline configuration for an OpenSearch Ingestion pipeline using OpenSearch as a SINK with a dead letter queue (DLQ) to Amazon S3. When a pipeline encounters write errors, it creates DLQ objects in the configured S3 bucket. DLQ objects exist within a JSON file as an array of failed events.

sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<domain-name>>.<<region>>.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # serverless: true 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam:: <<account-id>>:role/<<role-name>>"

Step 3: OpenSearch Ingestion to Amazon S3

OpenSearch Ingestion offers a built-in sink for loading streaming data directly into S3. The service can compress, partition, and optimize the data for cost-effective storage and analytics in Amazon S3. Data loaded into S3 can be partitioned for easier query isolation and lifecycle management. Partitions can be based on vehicle ID, date, geographic region, or other dimensions as needed for your queries.

The following snippet illustrates how we’ve partitioned and stored EV data in Amazon S3.

- s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "evbucket"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

The pipeline can be created following the steps in Creating Amazon OpenSearch Ingestion pipelines.

The following is the complete pipeline configuration, combining the configuration of all three steps. Update the Amazon Resource Names (ARNs), AWS Region, Open Search Service domain endpoint, and S3 names as needed.

The entire OpenSearch Ingestion pipeline configuration can be directly copied into the ‘Pipeline configuration’ field in the AWS Management Console while creating the OpenSearch Ingestion pipeline

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true           # Default is false  
      topics: 
         - name: "<<msk-topic-name>>" 
           group_id: "opensearch-consumer" 
           serde_format: json        
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:us-east-1:<<account-id>>:cluster/<<cluster-name>>/<<cluster-id>>" 
  processor:
      - parse_json:
  sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<opensearch-service-domain-endpoint>>.us-east-1.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
      - s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "<<bucket-name>>"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

Real-time analytics

After the data is available in OpenSearch Service, you can build real-time monitoring and notifications. OpenSearch Service has robust support for multiple notification channels, allowing you to receive alerts through services like Slack, Chime, custom webhooks, Microsoft Teams, email, and Amazon Simple Notification Service (Amazon SNS).

The following screenshot illustrates supported notification channels in OpenSearch Service.

The notification feature in OpenSearch Service allows you to create monitors that will watch for certain conditions or changes in your data and launch alerts, such as monitoring vehicle telemetry data and launching alerts for issues like battery degradation or abnormal energy consumption. For example, you can create a monitor that analyzes battery capacity over time and notifies the on-call team using Slack if capacity drops below expected degradation curves in a significant number of vehicles. This could indicate a potential manufacturing defect requiring investigation.

In addition to notifications, OpenSearch Service makes it easy to build real-time dashboards to visually track metrics across your fleet of vehicles. You can ingest vehicle telemetry data like location, speed, fuel consumption, and so on, and visualize it on maps, charts, and gauges. Dashboards can provide real-time visibility into vehicle health and performance.

The following screenshot illustrates creating a sample dashboard on OpenSearch Service

Opensearch Dashboard

A key benefit of OpenSearch Service is its ability to handle high sustained ingestion and query rates with millisecond latencies. It distributes incoming vehicle data across data nodes in a cluster for parallel processing. This allows OpenSearch to scale out to handle very large fleets while still delivering the real-time performance needed for operational visibility and alerting.

Batch analytics

After the data is available in Amazon S3, you can build a secure data lake to power a variety of analytics use cases deriving powerful insights. As an immutable store, new data is continually stored in S3 while existing data remains unaltered. This serves as a single source of truth for downstream analytics.

For business intelligence and reporting, you can analyze trends, identify insights, and create rich visualizations powered by the data lake. You can use Amazon QuickSight to build and share dashboards without needing to set up servers or infrastructure. Here’s an example of a Quicksight dashboard for IoT device data. For example, you can use a dashboard to gain insights from historical data that can help with better vehicle and battery design.

The Amazon Quicksight public gallery shows examples of dashboards across different domains.

You should consider Amazon OpenSearch dashboards for your operational day-to-day use cases to identify issues and alert in near real time whereas Amazon Quicksight should be used to analyze big data stored in a lake house and generate actionable insights from them.

Clean up

Delete the OpenSearch pipeline and Amazon MSK cluster to stop incurring costs on these services.

Conclusion

In this post, you learned how Amazon MSK, OpenSearch Ingestion, OpenSearch Services, and Amazon S3 can be integrated to ingest, process, store, analyze, and act on endless streams of EV data efficiently.

With OpenSearch Ingestion as the integration layer between streams and storage, the entire pipeline scales up and down automatically based on demand. No more complex cluster management or lost data from bursts in streams.

See Amazon OpenSearch Ingestion to learn more.


About the authors

Ayush Agrawal is a Startups Solutions Architect from Gurugram, India with 11 years of experience in Cloud Computing. With a keen interest in AI, ML, and Cloud Security, Ayush is dedicated to helping startups navigate and solve complex architectural challenges. His passion for technology drives him to constantly explore new tools and innovations. When he’s not architecting solutions, you’ll find Ayush diving into the latest tech trends, always eager to push the boundaries of what’s possible.

Fraser SequeiraFraser Sequeira is a Solutions Architect with AWS based in Mumbai, India. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS.

How Zurich Insurance Group built a log management solution on AWS

Post Syndicated from Jake Obi original https://aws.amazon.com/blogs/big-data/how-zurich-insurance-group-built-a-log-management-solution-on-aws/

This post is written in collaboration with Clarisa Tavolieri, Austin Rappeport and Samantha Gignac from Zurich Insurance Group.

The growth in volume and number of logging sources has been increasing exponentially over the last few years, and will continue to increase in the coming years. As a result, customers across all industries are facing multiple challenges such as:

  • Balancing storage costs against meeting long-term log retention requirements
  • Bandwidth issues when moving logs between the cloud and on premises
  • Resource scaling and performance issues when trying to analyze massive amounts of log data
  • Keeping pace with the growing storage requirements, while also being able to provide insights from the data
  • Aligning license costs for Security Information and Event Management (SIEM) vendors with log processing, storage, and performance requirements. SIEM solutions help you implement real-time reporting by monitoring your environment for security threats and alerting on threats once detected.

Zurich Insurance Group (Zurich) is a leading multi-line insurer providing property, casualty, and life insurance solutions globally. In 2022, Zurich began a multi-year program to accelerate their digital transformation and innovation through the migration of 1,000 applications to AWS, including core insurance and SAP workloads.

The Zurich Cyber Fusion Center management team faced similar challenges, such as balancing licensing costs to ingest and long-term retention requirements for both business application log and security log data within the existing SIEM architecture. Zurich wanted to identify a log management solution to work in conjunction with their existing SIEM solution. The new approach would need to offer the flexibility to integrate new technologies such as machine learning (ML), scalability to handle long-term retention at forecasted growth levels, and provide options for cost optimization. In this post, we discuss how Zurich built a hybrid architecture on AWS incorporating AWS services to satisfy their requirements.

Solution overview

Zurich and AWS Professional Services collaborated to build an architecture that addressed decoupling long-term storage of logs, distributing analytics and alerting capabilities, and optimizing storage costs for log data. The solution was based on categorizing and prioritizing log data into priority levels between 1–3, and routing logs to different destinations based on priority. The following diagram illustrates the solution architecture.

Flow of logs from source to destination. All logs are sent to Cribl which routes portions of logs to the SIEM, portions to Amazon OpenSearch, and copies of logs to Amazon S3.

The workflow steps are as follows:

  1. All of the logs (P1, P2, and P3) are collected and ingested into an extract, transform, and load (ETL) service, AWS Partner Cribl’s Stream product, in real time. Capturing and streaming of logs is configured per use case based on the capabilities of the source, such as using built-in forwarders, installing agents, using Cribl Streams, and using AWS services like Amazon Data Firehose. This ETL service performs two functions before data reaches the analytics layer:
    1. Data normalization and aggregation – The raw log data is normalized and aggregated in the required format to perform analytics. The process consists of normalizing log field names, standardizing on JSON, removing unused or duplicate fields, and compressing to reduce storage requirements.
    2. Routing mechanism – Upon completing data normalization, the ETL service will apply necessary routing mechanisms to ingest log data to respective downstream systems based on category and priority.
  2. Priority 1 logs, such as network detection & response (NDR), endpoint detection and response (EDR), and cloud threat detection services (for example, Amazon GuardDuty), are ingested directly to the existing on-premises SIEM solution for real-time analytics and alerting.
  3. Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail, are ingested into Amazon OpenSearch Service to enable the following capabilities. Previously, P2 logs were ingested into the SIEM.
    1. Systematically detect potential threats and react to a system’s state through alerting, and integrating those alerts back into Zurich’s SIEM for larger correlation, reducing by approximately 85% the amount of data ingestion into Zurich’s SIEM. Eventually, Zurich plans to use ML plugins such as anomaly detection to enhance analysis.
    2. Develop log and trace analytics solutions with interactive queries and visualize results with high adaptability and speed.
    3. Reduce the average time to ingest and average time to search that accommodates the increasing scale of log data.
    4. In the future, Zurich plans to use OpenSearch’s security analytics plugin, which can help security teams quickly detect potential security threats by using over 2,200 pre-built, publicly available Sigma security rules or create custom rules.
  4. Priority 3 logs, such as logs from enterprise applications and vulnerability scanning tools, are not ingested into the SIEM or OpenSearch Service, but are forwarded to Amazon Simple Storage Service (Amazon S3) for storage. These can be queried as needed using one-time queries.
  5. Copies of all log data (P1, P2, P3) are sent in real time to Amazon S3 for highly durable, long-term storage to satisfy the following:
    1. Long-term data retentionS3 Object Lock is used to enforce data retention per Zurich’s compliance and regulatory requirements.
    2. Cost-optimized storageLifecycle policies automatically transition data with less frequent access patterns to lower-cost Amazon S3 storage classes. Zurich also uses lifecycle policies to automatically expire objects after a predefined period. Lifecycle policies provide a mechanism to balance the cost of storing data and meeting retention requirements.
    3. Historic data analysis – Data stored in Amazon S3 can be queried to satisfy one-time audit or analysis tasks. Eventually, this data could be used to train ML models to support better anomaly detection. Zurich has done testing with Amazon SageMaker and has plans to add this capability in the near future.
  6. One-time query analysis – Simple audit use cases require historical data to be queried based on different time intervals, which can be performed using Amazon Athena and AWS Glue analytic services. By using Athena and AWS Glue, both serverless services, Zurich can perform simple queries without the heavy lifting of running and maintaining servers. Athena supports a variety of compression formats for reading and writing data. Therefore, Zurich is able to store compressed logs in Amazon S3 to achieve cost-optimized storage while still being able to perform one-time queries on the data.

As a future capability, supporting on-demand, complex query, analysis, and reporting on large historical datasets could be performed using Amazon OpenSearch Serverless. Also, OpenSearch Service supports zero-ETL integration with Amazon S3, where users can query their data stored in Amazon S3 using OpenSearch Service query capabilities.

The solution outlined in this post provides Zurich an architecture that supports scalability, resilience, cost optimization, and flexibility. We discuss these key benefits in the following sections.

Scalability

Given the volume of data currently being ingested, Zurich needed a solution that could satisfy existing requirements and provide room for growth. In this section, we discuss how Amazon S3 and OpenSearch Service help Zurich achieve scalability.

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. The total volume of data and number of objects you can store in Amazon S3 are virtually unlimited. Based on its unique architecture, Amazon S3 is designed to exceed 99.999999999% (11 nines) of data durability. Additionally, Amazon S3 stores data redundantly across a minimum of three Availability Zones (AZs) by default, providing built-in resilience against widespread disaster. For example, the S3 Standard storage class is designed for 99.99% availability. For more information, check out the Amazon S3 FAQs.

Zurich uses AWS Partner Cribl’s Stream solution to route copies of all log information to Amazon S3 for long-term storage and retention, enabling Zurich to decouple log storage from their SIEM solution, a common challenge facing SIEM solutions today.

OpenSearch Service is a managed service that makes it straightforward to run OpenSearch without having to manage the underlying infrastructure. Zurich’s current on-premises SIEM infrastructure is comprised of more than 100 servers, all of which have to be operated and maintained. Zurich hopes to reduce this infrastructure footprint by 75% by offloading priority 2 and 3 logs from their existing SIEM solution.

To support geographies with restrictions on cross-border data transfer and to meet availability requirements, AWS and Zurich worked together to define an Amazon OpenSearch Service configuration that would support 99.9% availability using multiple AZs in a single region.

OpenSearch Service supports cross-region and cross-cluster queries, which helps with distributing analysis and processing of logs without moving data, and provides the ability to aggregate information across clusters. Since Zurich plans to deploy multiple OpenSearch domains in different regions, they will use cross-cluster search functionality to query data seamlessly across different regional domains without moving data. Zurich also configured a connector for their existing SIEM to query OpenSearch, which further allows distributed processing from on premises, and enables aggregation of data across data sources. As a result, Zurich is able to distribute processing, decouple storage, and publish key information in the form of alerts and queries to their SIEM solution without having to ship log data.

In addition, many of Zurich’s business units have logging requirements that could also be satisfied using the same AWS services (OpenSearch Service, Amazon S3, AWS Glue, and Amazon Athena). As such, the AWS components of the architecture were templatized using Infrastructure as Code (IaC) for consistent, repeatable deployment. These components are already being used across Zurich’s business units.

Cost optimization

In thinking about optimizing costs, Zurich had to consider how they would continue to ingest 5 TB per day of security log information just for their centralized security logs. In addition, lines of businesses needed similar capabilities to meet requirements, which could include processing 500 GB per day.

With this solution, Zurich can control (by offloading P2 and P3 log sources) the portion of logs that are ingested into their primary SIEM solution. As a result, Zurich has a mechanism to manage licensing costs, as well as improve the efficiency of queries by reducing the amount of information the SIEM needs to parse on search.

Because copies of all log data are going to Amazon S3, Zurich is able to take advantage of the different Amazon S3 storage tiers, such as using S3 Intelligent-Tiering to automatically move data among Infrequent Access and Archive Access tiers, to optimize the cost of retaining multiple years’ worth of log data. When data is moved to the Infrequent Access tier, costs are reduced by up to 40%. Similarly, when data is moved to the Archive Instant Access tier, storage costs are reduced by up to 68%.

Refer to Amazon S3 pricing for current pricing, as well as for information by region. Moving data to S3 Infrequent Access and Archive Access tiers provides a significant cost savings opportunity while meeting long-term retention requirements.

The team at Zurich analyzed priority 2 log sources, and based on historical analytics and query patterns, determined that only the most recent 7 days of logs are typically required. Therefore, OpenSearch Service was right-sized for retaining 7 days of logs in a hot tier. Rather than configuring UltraWarm and cold storage tiers for OpenSearch Service, copies of the remaining logs were simultaneously being sent to Amazon S3 for long-term retention and could be queried using Athena.

The combination of cost-optimization options is projected to reduce by 53% the cost of per GB of log data ingested and stored for 13 months when compared to the previous approach.

Flexibility

Another key consideration for the architecture was the flexibility to integrate with existing alerting systems and data pipelines, as well as the ability to incorporate new technology into Zurich’s log management approach. For example, Zurich also configured a connector for their existing SIEM to query OpenSearch, which further allows distributed processing from on premises and enables aggregation of data across data sources.

Within the OpenSearch Service software, there are options to expand log analysis using security analytics with predefined indicators of compromise across common log types. OpenSearch Service also offers the capability to integrate with ML capabilities such as anomaly detection and alert correlation to enhance log analysis.

With the introduction of Amazon Security Lake, there is another opportunity to expand the solution to more efficiently manage AWS logging sources and add to this architecture. For example, you can use Amazon OpenSearch Ingestion to generate security insights on security data from Amazon Security Lake.

Summary

In this post, we reviewed how Zurich was able to build a log data management architecture that provided the scalability, flexibility, performance, and cost-optimization mechanisms needed to meet their requirements.

To learn more about components of this solution, visit the Centralized Logging with OpenSearch implementation guide, review Querying AWS service logs, or run through the SIEM on Amazon OpenSearch Service workshop.


About the Authors

Clarisa Tavolieri is a Software Engineering graduate with qualifications in Business, Audit, and Strategy Consulting. With an extensive career in the financial and tech industries, she specializes in data management and has been involved in initiatives ranging from reporting to data architecture. She currently serves as the Global Head of Cyber Data Management at Zurich Group. In her role, she leads the data strategy to support the protection of company assets and implements advanced analytics to enhance and monitor cybersecurity tools.

Austin RappeportAustin Rappeport is a Computer Engineer who graduated from the University of Illinois Urbana/Champaign in 2011 with a focus in Computer Security. After graduation, he worked for the Federal Energy Regulatory Commission in the Office of Electric Reliability, working with the North American Electric Reliability Corporation’s Critical Infrastructure Protection Standards on both the audit and enforcement side, as well as standards development. Austin currently works for Zurich Insurance as the Global Head of Detection Engineering and Automation, where he leads the team responsible for using Zurich’s security tools to detect suspicious and malicious activity and improve internal processes through automation.

Samantha Gignac is a Global Security Architect at Zurich Insurance. She graduated from Ferris State University in 2014 with a Bachelor’s degree in Computer Systems & Network Engineering. With experience in the insurance, healthcare, and supply chain industries, she has held roles such as Storage Engineer, Risk Management Engineer, Vulnerability Management Engineer, and SOC Engineer. As a Cybersecurity Architect, she designs and implements secure network systems to protect organizational data and infrastructure from cyber threats.

Claire Sheridan is a Principal Solutions Architect with Amazon Web Services working with global financial services customers. She holds a PhD in Informatics and has more than 15 years of industry experience in tech. She loves traveling and visiting art galleries.

Jake Obi is a Principal Security Consultant with Amazon Web Services based in South Carolina, US, with over 20 years’ experience in information technology. He helps financial services customers improve their security posture in the cloud. Prior to joining Amazon, Jake was an Information Assurance Manager for the US Navy, where he worked on a large satellite communications program as well as hosting government websites using the public cloud.

Srikanth Daggumalli is an Analytics Specialist Solutions Architect in AWS. Out of 18 years of experience, he has over a decade of experience in architecting cost-effective, performant, and secure enterprise applications that improve customer reachability and experience, using big data, AI/ML, cloud, and security technologies. He has built high-performing data platforms for major financial institutions, enabling improved customer reach and exceptional experiences. He is specialized in services like cross-border transactions and architecting robust analytics platforms.

Freddy Kasprzykowski is a Senior Security Consultant with Amazon Web Services based in Florida, US, with over 20 years’ experience in information technology. He helps customers adopt AWS services securely according to industry best practices, standards, and compliance regulations. He is a member of the Customer Incident Response Team (CIRT), helping customers during security events, a seasoned speaker at AWS re:Invent and AWS re:Inforce conferences, and a contributor to open source projects related to AWS security.

AWS Weekly Roundup: Advanced capabilities in Amazon Bedrock and Amazon Q, and more (July 15, 2024).

Post Syndicated from Abhishek Gupta original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-advanced-capabilities-in-amazon-bedrock-and-amazon-q-and-more-july-15-2024/

As expected, there were lots of exciting launches and updates announced during the AWS Summit New York. You can quickly scan the highlights in Top Announcements of the AWS Summit in New York, 2024.

NY-Summit-feat-img

My colleagues and fellow AWS News Blog writers Veliswa Boya and Sébastien Stormacq were at the AWS Community Day Cameroon last week. They were energized to meet amazing professionals, mentors, and students – all willing to learn and exchange thoughts about cloud technologies. You can access the video replay to feel the vibes or just watch some of the talks!

AWS Community Day Cameroon 2024

Last week’s launches
In addition to the launches at the New York Summit, here are a few others that got my attention.

Advanced RAG capabilities Knowledge Bases for Amazon Bedrock – These include custom chunking options to enable customers to write their own chunking code as a Lambda function; smart parsing to extract information from complex data such as tables; and query reformulation to break down queries into simpler sub-queries, retrieve relevant information for each, and combine the results into a final comprehensive answer.

Amazon Bedrock Prompt Management and Prompt Flows – This is a preview launch of Prompt Management that help developers and prompt engineers get the best responses from foundation models for their use cases; and Prompt Flows accelerates the creation, testing, and deployment of workflows through an intuitive visual builder.

Fine-tuning for Anthropic’s Claude 3 Haiku in Amazon Bedrock (preview) – By providing your own task-specific training dataset, you can fine tune and customize Claude 3 Haiku to boost model accuracy, quality, and consistency to further tailor generative AI for your business.

IDE workspace context awareness in Amazon Q Developer chat – Users can now add @workspace to their chat message in Q Developer to ask questions about the code in the project they currently have open in the IDE. Q Developer automatically ingests and indexes all code files, configurations, and project structure, giving the chat comprehensive context across your entire application within the IDE.

New features in Amazon Q Business –  The new personalization capabilities in Amazon Q Business are automatically enabled and will use your enterprise’s employee profile data to improve their user experience. You can now get answers from text content in scanned PDFs, and images embedded in PDF documents, without having to use OCR for preprocessing and text extraction.

Amazon EC2 R8g instances powered by AWS Graviton4 are now generally available – Amazon EC2 R8g instances are ideal for memory-intensive workloads such as databases, in-memory caches, and real-time big data analytics. These are powered by AWS Graviton4 processors and deliver up to 30% better performance compared to AWS Graviton3-based instances.

Vector search for Amazon MemoryDB is now generally available – Vector search for MemoryDB enables real-time machine learning (ML) and generative AI applications. It can store millions of vectors with single-digit millisecond query and update latencies at the highest levels of throughput with >99% recall.

Introducing Valkey GLIDE, an open source client library for Valkey and Redis open sourceValkey is an open source key-value data store that supports a variety of workloads such as caching, and message queues. Valkey GLIDE is one of the official client libraries for Valkey and it supports all Valkey commands. GLIDE supports Valkey 7.2 and above, and Redis open source 6.2, 7.0, and 7.2.

Amazon OpenSearch Service enhancementsAmazon OpenSearch Serverless now supports workloads up to 30TB of data for time-series collections enabling more data-intensive use cases, and an innovative caching mechanism that automatically fetches and intelligently manages data, leading to faster data retrieval, efficient storage usage, and cost savings. Amazon OpenSearch Service has now added support for AI powered Natural Language Query Generation in OpenSearch Dashboards Log Explorer so you can get started quickly with log analysis without first having to be proficient in PPL.

Open source release of Secrets Manager Agent for AWS Secrets Manager – Secrets Manager Agent is a language agnostic local HTTP service that you can install and use in your compute environments to read secrets from Secrets Manager and cache them in memory, instead of making a network call to Secrets Manager.

Amazon S3 Express One Zone now supports logging of all events in AWS CloudTrail – This capability lets you get details on who made API calls to S3 Express One Zone and when API calls were made, thereby enhancing data visibility for governance, compliance, and operational auditing.

Amazon CloudFront announces managed cache policies for web applications – Previously, Amazon CloudFront customers had two options for managed cache policies, and had to create custom cache policies for all other cases. With the new managed cache policies, CloudFront caches content based on the Cache-Control headers returned by the origin, and defaults to not caching when the header is not returned.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

We launched existing services in additional Regions:

Other AWS news
Here are some additional projects, blog posts, and news items that you might find interesting:

Context window overflow: Breaking the barrierThis blog post dives into intricate workings of generative artificial intelligence (AI) models, and why is it crucial to understand and mitigate the limitations of CWO (context window overflow).

Using Agents for Amazon Bedrock to interactively generate infrastructure as code – This blog post explores how Agents for Amazon Bedrock can be used to generate customized, organization standards-compliant IaC scripts directly from uploaded architecture diagrams.

Automating model customization in Amazon Bedrock with AWS Step Functions workflow – This blog post covers orchestrating repeatable and automated workflows for customizing Amazon Bedrock models and how AWS Step Functions can help overcome key pain points in model customization.

AWS open source news and updates – My colleague Ricardo Sueiras writes about open source projects, tools, and events from the AWS Community; check out Ricardo’s page for the latest updates.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: Bogotá (July 18), Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

Browse all upcoming AWS led in-person and virtual events and developer-focused events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Abhishek

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Protein similarity search using ProtT5-XL-UniRef50 and Amazon OpenSearch Service

Post Syndicated from Camillo Anania original https://aws.amazon.com/blogs/big-data/protein-similarity-search-using-prott5-xl-uniref50-and-amazon-opensearch-service/

A protein is a sequence of amino acids that, when chained together, creates a 3D structure. This 3D structure allows the protein to bind to other structures within the body and initiate changes. This binding is core to the working of many drugs.

A common workflow within drug discovery is searching for similar proteins, because similar proteins likely have similar properties. Given an initial protein, researchers often look for variations that exhibit stronger binding, better solubility, or reduced toxicity. Despite advances in protein structure prediction, it’s still sometimes necessary to predict protein properties based on sequence alone. Thus, there is a need to quickly and at-scale get similar sequences based on an input sequence. In this blog post, we propose a solution based on Amazon OpenSearch Service for similarity search and the pretrained model ProtT5-XL-UniRef50, which we will use to generate embeddings. A repository providing such solution is available here. ProtT5-XL-UniRef50 is based on the t5-3b model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.

Before diving into our solution, it’s important to understand what embeddings are and why they’re crucial for our task. Embeddings are dense vector representations of objects—proteins in our case—that capture the essence of their properties in a continuous vector space. An embedding is essentially a compact vector representation that encapsulates the significant features of an object, making it easier to process and analyze. Embeddings play an important role in understanding and processing complex data. They not only reduce dimensionality but also capture and encode intrinsic properties. This means that objects (such as words or proteins) with similar characteristics result in embeddings that are closer in the vector space. This proximity allows us to perform similarity searches efficiently, making embeddings invaluable for identifying relationships and patterns in large datasets.

Consider the analogy of fruits and their properties. In an embedding space, fruits such as mandarins and oranges would be close to each other because they share some characteristics, such as being round, color, and having similar nutritional properties. Similarly, bananas would be close to plantains, reflecting their shared properties. Through embeddings, we can understand and explore these relationships intuitively.

ProtT5-XL-UniRef50 is a machine learning (ML) model specifically designed to understand the language of proteins by converting protein sequences into multidimensional embeddings. These embeddings capture biological properties, allowing us to identify proteins with similar functions or structures in a multi-dimensional space because similar proteins will be encoded close together. This direct encoding of proteins into embeddings is crucial for our similarity search, providing a robust foundation for identifying potential drug targets or understanding protein functions.

Embeddings for the UniProtKB/Swiss-Prot protein database, which we use for this post, have been pre-computed and are available for download. If you have your own novel proteins, you can compute embeddings using ProtT5-XL-UniRef50, and then use these pre-computed embeddings to find known proteins with similar properties

In this post, we outline the broad functionalities of the solution and its components. Following this, we provide a brief explanation of what embeddings are, discussing the specific model used in our example. We then show how you can run this model on Amazon SageMaker. In addition, we dive into how to use the OpenSearch Service as a vector database. Finally, we demonstrate some practical examples of running similarity searches on protein sequences.

Solution overview

Let’s walk through the solution and all its components. Code for this solution is available on GitHub.

  1. We use OpenSearch Service vector database (DB) capabilities to store a sample of 20 thousand pre-calculated embeddings. These will be used to demonstrate similarity search. OpenSearch Service has advanced vector DB capabilities supporting multiple popular vector DB algorithms. For an overview of such capabilities see Amazon OpenSearch Service’s vector database capabilities explained.
  2. The open source prot_t5_xl_uniref50 ML model, hosted on Huggingface Hub, was used to calculate protein embeddings. We use the SageMaker Huggingface Inference Toolkit to quickly customize and deploy the model on SageMaker.
  3. The model is deployed and the solution is ready to calculate embeddings on any input protein sequence and perform similarity search against the protein embeddings we have preloaded on OpenSearch Service.
  4. We use a SageMaker Studio notebook to show how to deploy the model on SageMaker and then use an endpoint to extract protein features in the form of embeddings.
  5. After we have generated the embeddings in real time from the SageMaker endpoint, we run a query on OpenSearch Service to determine the five most similar proteins currently stored on OpenSearch Service index.
  6. Finally, the user can see the result directly from the SageMaker Studio notebook.
  7. To understand if the similarity search works well, we choose the Immunoglobulin Heavy Diversity 2/OR15-2A protein and we calculate its embeddings. The embeddings returned by the model are pre-residue, which is a detailed level of analysis where each individual residue (amino acid) in the protein is considered. In our case, we want to focus on the overall structure, function, and properties of the protein, so we calculate the per-protein embeddings. We achieve that by doing dimensionality reduction, calculating the mean overall per-residue features. Finally, we use the resulting embeddings to perform a similarity search and the first five proteins ordered by similarity are:
    • Immunoglobulin Heavy Diversity 3/OR15-3A
    • T Cell Receptor Gamma Joining 2
    • T Cell Receptor Alpha Joining 1
    • T Cell Receptor Alpha Joining 11
    • T Cell Receptor Alpha Joining 50

These are all immune cells with T cell receptors being a subtype of immunoglobulin. The similarity surfaced proteins that are all bio-functionally similar.

Costs and clean up

The solution we just walked through creates an OpenSearch Service domain which is billed according to number and instance type selected during creation time, see the OpenSearch Service Pricing page for the rate of those. You will also be charged for the SageMaker endpoint created by the deploy-and-similarity-search notebook, which is currently using a ml.g4dn.8xlarge instance type. See SageMaker pricing for details.

Finally, you are charged for the SageMaker Studio Notebooks according to the instance type you are using as detailed on the pricing page.

To clean up the resources created by this solution:

Conclusion

In this blog post we described a solution capable of calculating protein embeddings and performing similarity searches to find similar proteins. The solution uses the open source ProtT5-XL-UniRef50 model to calculate the embeddings and it deploys it on SageMaker Inference. We used OpenSearch Service as the vector DB. OpenSearch Service is pre-populated with 20 thousand human proteins from UniProt. Finally, the solution was validated by performing a similarity search on the Immunoglobulin Heavy Diversity 2/OR15-2A protein. We successfully evaluated that the proteins returned from OpenSearch Service are all in the immunoglobulin family and are bio-functionally similar. Code for this solution is available in GitHub.

The solution can be further tuned by testing different supported OpenSearch Service KNN algorithms and scaled by importing additional protein embeddings into OpenSearch Service indexes.

Resources:

  • Elnaggar A, et al. “ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning”. IEEE Trans Pattern Anal Mach Intell. 2020.
  • Mikolov, T.; Yih, W.; Zweig, G. “Linguistic Regularities in Continuous Space Word Representations”. HLT-Naacl: 746–751. 2013.

About the Authors

that's meCamillo Anania is a Senior Solutions Architect at AWS. He is a tech enthusiast who loves helping healthcare and life science startups get the most out of the cloud. With a knack for cloud technologies, he’s all about making sure these startups can thrive and grow by leveraging the best cloud solutions. He is excited about the new wave of use cases and possibilities unlocked by GenAI and does not miss a chance to dive into them.

Adam McCarthy is the EMEA Tech Leader for Healthcare and Life Sciences Startups at AWS. He has over 15 years’ experience researching and implementing machine learning, HPC, and scientific computing environments, especially in academia, hospitals, and drug discovery.

Improve your Amazon OpenSearch Service performance with OpenSearch Optimized Instances

Post Syndicated from Jatinder Singh original https://aws.amazon.com/blogs/big-data/improve-your-amazon-opensearch-service-performance-with-opensearch-optimized-instances/

Amazon OpenSearch Service introduced the OpenSearch Optimized Instances (OR1), deliver price-performance improvement over existing instances. The newly introduced OR1 instances are ideally tailored for heavy indexing use cases like log analytics and observability workloads.

OR1 instances use a local and a remote store. The local storage utilizes either Amazon Elastic Block Store (Amazon EBS) of type gp3 or io1 volumes, and the remote storage uses Amazon Simple Storage Service (Amazon S3). For more details about OR1 instances, refer to Amazon OpenSearch Service Under the Hood: OpenSearch Optimized Instances (OR1).

In this post, we conduct experiments using OpenSearch Benchmark to demonstrate how the OR1 instance family improves indexing throughput and overall domain performance.

Getting started with OpenSearch Benchmark

OpenSearch Benchmark, a tool provided by the OpenSearch Project, comprehensively gathers performance metrics from OpenSearch clusters, including indexing throughput and search latency. Whether you’re tracking overall cluster performance, informing upgrade decisions, or assessing the impact of workflow changes, this utility proves invaluable.

In this post, we compare the performance of two clusters: one powered by memory-optimized instances and the other by OR1 instances. The dataset comprises HTTP server logs from the 1998 World Cup website. With the OpenSearch Benchmark tool, we conduct experiments to assess various performance metrics, such as indexing throughput, search latency, and overall cluster efficiency. Our aim is to determine the most suitable configuration for our specific workload requirements.

You can install OpenSearch Benchmark directly on a host running Linux or macOS, or you can run OpenSearch Benchmark in a Docker container on any compatible host.

OpenSearch Benchmark includes a set of workloads that you can use to benchmark your cluster performance. Workloads contain descriptions of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains indexes, data files, and operations invoked when the workflow runs.

When assessing your cluster’s performance, it is recommended to use a workload similar to your cluster’s use cases, which can save you time and effort. Consider the following criteria to determine the best workload for benchmarking your cluster:

  • Use case – Selecting a workload that mirrors your cluster’s real-world use case is essential for accurate benchmarking. By simulating heavy search or indexing tasks typical for your cluster, you can pinpoint performance issues and optimize settings effectively. This approach makes sure benchmarking results closely match actual performance expectations, leading to more reliable optimization decisions tailored to your specific workload needs.
  • Data – Use a data structure similar to that of your production workloads. OpenSearch Benchmark provides examples of documents within each workload to understand the mapping and compare with your own data mapping and structure. Every benchmark workload is composed of the following directories and files for you to compare data types and index mappings.
  • Query types – Understanding your query pattern is crucial for detecting the most frequent search query types within your cluster. Employing a similar query pattern for your benchmarking experiments is essential.

Solution overview

The following diagram explains how OpenSearch Benchmark connects to your OpenSearch domain to run workload benchmarks.Scope of solution

The workflow comprises the following steps:

  1. The first step involves running OpenSearch Benchmark using a specific workload from the workloads repository. The invoke operation collects data about the performance of your OpenSearch cluster according to the selected workload.
  2. OpenSearch Benchmark ingests the workload dataset into your OpenSearch Service domain.
  3. OpenSearch Benchmark runs a set of predefined test procedures to capture OpenSearch Service performance metrics.
  4. When the workload is complete, OpenSearch Benchmark outputs all related metrics to measure the workload performance. Metric records are by default stored in memory, or you can set up an OpenSearch Service domain to store the generated metrics and compare multiple workload executions.

In this post, we used the http_logs workload to conduct performance benchmarking. The dataset comprises 247 million documents designed for ingestion and offers a set of sample queries for benchmarking. Follow the steps outlined in the OpenSearch Benchmark User Guide to deploy OpenSearch Benchmark and run the http_logs workload.

Prerequisites

You should have the following prerequisites:

In this post, we deployed OpenSearch Benchmark in an AWS Cloud9 host using an Amazon Linux 2 instance type m6i.2xlarge with a capacity of 8 vCPUs, 32 GiB memory, and 512 TiB storage.

Performance analysis using the OR1 instance type in OpenSearch Service

In this post, we conducted a performance comparison between two different configurations of OpenSearch Service:

  • Configuration 1 – Cluster manager nodes and three data nodes of memory-optimized r6g.large instances
  • Configuration 2 – Cluster manager nodes and three data nodes of or1.larges instances

In both configurations, we use the same number and type of cluster manager nodes: three c6g.xlarge.

You can set up different configurations with the supported instance types in OpenSearch Service to run performance benchmarks.

The following table summarizes our OpenSearch Service configuration details.

  Configuration 1 Configuration 2
Number of cluster manager nodes 3 3
Type of cluster manager nodes c6g.xlarge c6g.xlarge
Number of data nodes 3 3
Type of data node r6g.large or1.large
Data node: EBS volume size (GP3) 200 GB 200 GB
Multi-AZ with standby enabled Yes Yes

Now let’s examine the performance details between the two configurations.

Performance benchmark comparison

The http_logs dataset contains HTTP server logs from the 1998 World Cup website between April 30, 1998 and July 26, 1998. Each request consists of a timestamp field, client ID, object ID, size of the request, method, status, and more. The uncompressed size of the dataset is 31.1 GB with 247 million JSON documents. The amount of load sent to both domain configurations is identical. The following table displays the amount of time taken to run various aspects of an OpenSearch workload on our two configurations.

Category Metric Name

Configuration 1

(3* r6g.large data nodes)

Runtimes

Configuration 2

(3* or1.large data nodes)

Runtimes

Performance Difference
Indexing Cumulative indexing time of primary shards 207.93 min 142.50 min 31%
Indexing Cumulative flush time of primary shards 21.17 min 2.31 min 89%
Garbage Collection Total Young Gen GC time 43.14 sec 24.57 sec 43%
bulk-index-append p99 latency 10857.2 ms 2455.12 ms 77%
query-Mean Throughput 29.76 ops/sec 36.24 ops/sec 22%
query-match_all(default) p99 latency 40.75 ms 32.99 ms 19%
query-term p99 latency 7675.54 ms 4183.19 ms 45%
query-range p99 latency 59.5316 ms 51.2864 ms 14%
query-hourly_aggregation p99 latency 5308.46 ms 2985.18 ms 44%
query-multi_term_aggregation p99 latency 8506.4 ms 4264.44 ms 50%

The benchmarks show a notable enhancement across various performance metrics. Specifically, OR1.large data nodes demonstrate a 31% reduction in indexing time for primary shards compared to r6g.large data nodes. OR1.large data nodes also exhibit a 43% improvement in garbage collection efficiency and significant enhancements in query performance, including term, range, and aggregation queries.

The extent of improvement depends on the workload. Therefore, make sure to run custom workloads as expected in your production environments in terms of indexing throughput, type of search queries, and concurrent requests.

Migration journey to OR1

The OR1 instance family is available in OpenSearch Service 2.11 or higher. Usually, if you’re using OpenSearch Service and you want to benefit from new released features in a specific version, you would follow the supported upgrade paths to upgrade your domain.

However, to use the OR1 instance type, you need to create a new domain with OR1 instances and then migrate your existing domain to the new domain. The migration journey to OpenSearch Service domain using an OR1 instance is similar to a typical OpenSearch Service migration scenario. Critical aspects involve determining the appropriate size for the target environment, selecting suitable data migration methods, and devising a seamless cutover strategy. These elements provide optimal performance, smooth data transition, and minimal disruption throughout the migration process.

To migrate data to a new OR1 domain, you can use the snapshot restore option or use Amazon OpenSearch Ingestion to migrate the data for your source.

For instructions on migration, refer to Migrating to Amazon OpenSearch Service.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all the resources you created as part of this post, including your OpenSearch Service domain.

Conclusion

In this post, we ran a benchmark to review the performance of the OR1 instance family compared to the memory-optimized r6g instance. We used OpenSearch Benchmark, a comprehensive tool for gathering performance metrics from OpenSearch clusters.

Learn more about how OR1 instances work and experiment with OpenSearch Benchmark to make sure your OpenSearch Service configuration matches your workload demand.


About the Authors

Jatinder Singh is a Senior Technical Account Manager at AWS and finds satisfaction in aiding customers in their cloud migration and innovation endeavors. Beyond his professional life, he relishes spending moments with his family and indulging in hobbies such as reading, culinary pursuits, and playing chess.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Puneetha Kumara is a Senior Technical Account Manager at AWS, with over 15 years of industry experience, including roles in cloud architecture, systems engineering, and container orchestration.

Manpreet Kour is a Senior Technical Account Manager at AWS and is dedicated to ensuring customer satisfaction. Her approach involves a deep understanding of customer objectives, aligning them with software capabilities, and effectively driving customer success. Outside of her professional endeavors, she enjoys traveling and spending quality time with her family.

Introducing self-managed data sources for Amazon OpenSearch Ingestion

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-self-managed-data-sources-for-amazon-opensearch-ingestion/

Enterprise customers increasingly adopt Amazon OpenSearch Ingestion (OSI) to bring data into Amazon OpenSearch Service for various use cases. These include petabyte-scale log analytics, real-time streaming, security analytics, and searching semi-structured key-value or document data. OSI makes it simple, with straightforward integrations, to ingest data from many AWS services, including Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon DocumentDB (with MongoDB compatibility).

Today we are announcing support for ingesting data from self-managed OpenSearch/Elasticsearch and Apache Kafka clusters. These sources can either be on Amazon Elastic Compute Cloud (Amazon EC2) or on-premises environments.

In this post, we outline the steps to get started with these sources.

Solution overview

OSI supports the AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, the AWS Command Line Interface (AWS CLI), Terraform, AWS APIs, and the AWS Management Console to deploy pipelines. In this post, we use the console to demonstrate how to create a self-managed Kafka pipeline.

Prerequisites

To make sure OSI can connect and read data successfully, the following conditions should be met:

  • Network connectivity to data sources – OSI is generally deployed in a public network, such as the internet, or in a virtual private cloud (VPC). OSI deployed in a customer VPC is able to access data sources in the same or different VPC and on the internet with an attached internet gateway. If your data sources are in another VPC, common methods for network connectivity include direct VPC peering, using a transit gateway, or using customer managed VPC endpoints powered by AWS PrivateLink. If your data sources are on your corporate data center or other on-premises environment, common methods for network connectivity include AWS Direct Connect and using a network hub like a transit gateway. The following diagram shows a sample configuration of OSI running in a VPC and using Amazon OpenSearch Service as a sink. OSI runs in a service VPC and creates an Elastic Network interface (ENI) in the customer VPC. For self-managed data source these ENIs are used for reading data from on-premises environment. OSI creates an VPC endpoint in the service VPC to send data to the sink.
  • Name resolution for data sources – OSI uses an Amazon Route 53 resolver. This resolver automatically answers queries to names local to a VPC, public domain names on the internet, and records hosted in private hosted zones. If you’re are using a private hosted zone, make sure you have a DHCP option set enabled, attached to the VPC using AmazonProvidedDNS as domain name server. For more information, see Work with DHCP option sets. Additionally, you can use resolver inbound and outbound endpoints if you need a complex resolution schemes with conditions that are beyond a simple private hosted zone.
  • Certificate verification for data source names – OSI supports only SASL_SSL for transport for Apache Kafka source. Within SASL, Amazon OpenSearch Service supports most authentication mechanisms like PLAIN, SCRAM, IAM, GSAPI and others. When using SASL_SSL, make sure you have access to certificates needed for OSI to authenticate. For self-managed OpenSearch data sources, make sure verifiable certificates are installed on the clusters. Amazon OpenSearch Service doesn’t support insecure communication between OSI and OpenSearch. Certificate verification cannot be turned off. In particular, the “insecure” configuration option is not supported.
  • Access to AWS Secrets Manager – OSI uses AWS Secrets Manager to retrieve credentials and certificates needed to communicate with self-managed data sources. For more information, see Create and manage secrets with AWS Secrets Manager.
  • IAM role for pipelines – You need an AWS Identity and Access Management (IAM) pipeline role to write to data sinks. For more information, see Identity and Access Management for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed Kafka as a source

After you complete the prerequisites, you’re ready to create a pipeline for your data source. Complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create pipeline.
  3. Choose Streaming under Use case in the navigation pane.
  4. Select Self managed Apache Kafka under Ingestion pipeline blueprints and choose Select blueprint.

This will populate a sample configuration for this pipeline.

  1. Provide a name for this pipeline and choose the appropriate pipeline capacity.
  2. Under Pipeline configuration, provide your pipeline configuration in YAML format. The following code snippet shows sample configuration in YAML for SASL_SSL authentication:
    version: '2'
    kafka-pipeline:
      source:
        kafka:
          acknowledgments: true
          bootstrap_servers:
            - 'node-0.example.com:9092'
          encryption:
            type: "ssl"
            certificate: '${{aws_secrets:kafka-cert}}'
            
          authentication:
            sasl:
              plain:
                username: '${{aws_secrets:secrets:username}}'
                password: '${{aws_secrets:secrets:password}}'
          topics:
            - name: on-prem-topic
              group_id: osi-group-1
      processor:
        - grok:
            match:
              message:
                - '%{COMMONAPACHELOG}'
        - date:
            destination: '@timestamp'
            from_time_received: true
      sink:
        - opensearch:
            hosts: ["https://search-domain-12345567890.us-east-1.es.amazonaws.com"]
            aws:
              region: us-east-1
              sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
            index: "on-prem-kakfa-index"
    extension:
      aws:
        secrets:
          kafka-cert:
            secret_id: kafka-cert
            region: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
          secrets:
            secret_id: secrets
            region: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'

  1. Choose Validate pipeline and confirm there are no errors.
  2. Under Network configuration, choose Public access or VPC access. (For this post, we choose VPC access).
  3. If you chose VPC access, specify your VPC, subnets, and an appropriate security group so OSI can reach the outgoing ports for the data source.
  4. Under VPC attachment options, select Attach to VPC and choose an appropriate CIDR range.

OSI resources are created in a service VPC managed by AWS that is separate from the VPC you chose in the last step. This selection allows you to configure what CIDR ranges OSI should use inside this service VPC. The choice exists so you can make sure there is no address collision between CIDR ranges in your VPC that is attached to your on-premises network and this service VPC. Many pipelines in your account can share same CIDR ranges for this service VPC.

  1. Specify any optional tags and log publishing options, then choose Next.
  2. Review the configuration and choose Create pipeline.

You can monitor the pipeline creation and any log messages in the Amazon CloudWatch Logs log group you specified. Your pipeline should now be successfully created. For more information about how to provision capacity for the performance of this pipeline, see the section Recommended Compute Units (OCUs) for the MSK pipeline in Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed OpenSearch as a source

The steps for creating a pipeline for self-managed OpenSearch are similar to the steps for creating one for Kafka. During the blueprint selection, choose Data Migration under Use case and select Self managed OpenSearch/Elasticsearch. OpenSearch Ingestion can source data from all versions of OpenSearch and Elasticsearch from version 7.0  to  version 7.10.

The following blueprint shows a sample configuration YAML for this data source:

version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      hosts: [ "https://node-0.example.com:9200" ]
      username: "${{aws_secrets:secret:username}}"
      password: "${{aws_secrets:secret:password}}"
      indices:
        include:
        - index_name_regex: "opensearch_dashboards_sample_data*"
        exclude:
          - index_name_regex: '\..*'
  sink:
    - opensearch:
        hosts: [ "https://search-domain-12345567890.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
          region: "us-east-1"
        index: "on-prem-os"
extension:
  aws:
    secrets:
      secret:
        secret_id: "self-managed-os-credentials"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
        refresh_interval: PT1H

Considerations for self-managed OpenSearch data source

Certificates installed on the OpenSearch cluster need to be verifiable for OSI to connect to this data source before reading data. Insecure connections are currently not supported.

After you’re connected, make sure the cluster has sufficient read bandwidth to allow for OSI to read data. Use the Min and Max OCU setting to limit OSI read bandwidth consumption. Your read bandwidth will vary depending upon data volume, number of indexes, and provisioned OCU capacity. Start small and increase the number of OCUs to balance between available bandwidth and acceptable migration time.

This source is typically meant for one-time migration of data and not as continuous ingestion to keep data in sync between data sources and sinks.

OpenSearch Service domains support remote reindexing, but that consumes resources in your domains. Using OSI will move this compute out of the domain, and OSI can achieve significantly higher bandwidth than remote reindexing, thereby resulting in faster migration times.

OSI doesn’t support deferred replay or traffic recording today; refer to Migration Assistant for Amazon OpenSearch Service if your migration needs those capabilities.

Conclusion

In this post, we introduced self-managed sources for OpenSearch Ingestion that enable you to ingest data from corporate data centers or other on-premises environments. OSI also supports various other data sources and integrations. Refer to Working with Amazon OpenSearch Ingestion pipeline integrations to learn about these other data sources.


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focuses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large-scale distributed systems and cloud-centered technologies, and is based out of Seattle, Washington.

Build a real-time streaming generative AI application using Amazon Bedrock, Amazon Managed Service for Apache Flink, and Amazon Kinesis Data Streams

Post Syndicated from Felix John original https://aws.amazon.com/blogs/big-data/build-a-real-time-streaming-generative-ai-application-using-amazon-bedrock-amazon-managed-service-for-apache-flink-and-amazon-kinesis-data-streams/

Generative artificial intelligence (AI) has gained a lot of traction in 2024, especially around large language models (LLMs) that enable intelligent chatbot solutions. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to help you build generative AI applications with security, privacy, and responsible AI. Use cases around generative AI are vast and go well beyond chatbot applications; for instance, generative AI can be used for analysis of input data such as sentiment analysis of reviews.

Most businesses generate data continuously in real-time. Internet of Things (IoT) sensor data, application log data from your applications, or clickstream data generated by users of your website are only some examples of continuously generated data. In many situations, the ability to process this data quickly (in real-time or near real-time) helps businesses increase the value of insights they get from their data.

One option to process data in real-time is using stream processing frameworks such as Apache Flink. Flink is a framework and distributed processing engine for processing data streams. AWS provides a fully managed service for Apache Flink through Amazon Managed Service for Apache Flink, which enables you to build and deploy sophisticated streaming applications without setting up infrastructure and managing resources.

Data streaming enables generative AI to take advantage of real-time data and provide businesses with rapid insights. This post looks at how to integrate generative AI capabilities when implementing a streaming architecture on AWS using managed services such as Managed Service for Apache Flink and Amazon Kinesis Data Streams for processing streaming data and Amazon Bedrock to utilize generative AI capabilities. We focus on the use case of deriving review sentiment in real-time from customer reviews in online shops. We include a reference architecture and a step-by-step guide on infrastructure setup and sample code for implementing the solution with the AWS Cloud Development Kit (AWS CDK). You can find the code to try it out yourself on the GitHub repo.

Solution overview

The following diagram illustrates the solution architecture. The architecture diagram depicts the real-time streaming pipeline in the upper half and the details on how you gain access to the Amazon OpenSearch Service dashboard in the lower half.

Architecture Overview

The real-time streaming pipeline consists of a producer that is simulated by running a Python script locally that is sending reviews to a Kinesis Data Stream. The reviews are from the Large Movie Review Dataset and contain positive or negative sentiment. The next step is the ingestion to the Managed Service for Apache Flink application. From within Flink, we are asynchronously calling Amazon Bedrock (using Anthropic Claude 3 Haiku) to process the review data. The results are then ingested into an OpenSearch Service cluster for visualization with OpenSearch Dashboards. We directly call the PutRecords API of Kinesis Data Streams within the Python script for the sake of simplicity and to cost-effectively run this example. You should consider using an Amazon API Gateway REST API as a proxy in front of Kinesis Data Streams when using a similar architecture in production, as described in Streaming Data Solution for Amazon Kinesis.

To gain access to the OpenSearch dashboard, we need to use a bastion host that is deployed in the same private subnet within your virtual private cloud (VPC) as your OpenSearch Service cluster. To connect with the bastion host, we use Session Manager, a capability of Amazon Systems Manager, which allows us to connect to our bastion host securely without having to open inbound ports. To access it, we use Session Manager to port forward the OpenSearch dashboard to our localhost.

The walkthrough consists of the following high-level steps:

  1. Create the Flink application by building the JAR file.
  2. Deploy the AWS CDK stack.
  3. Set up and connect to OpenSearch Dashboards.
  4. Set up the streaming producer.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Implementation details

This section focuses on the Flink application code of this solution. You can find the code on GitHub. The StreamingJob.java file inside the flink-async-bedrock directory file serves as entry point to the application. The application uses the FlinkKinesisConsumer, which is a connector for reading streaming data from a Kinesis Data Stream. It applies a map transformation to convert each input string into an instance of Review class object, resulting in DataStream<Review> to ease processing.

The Flink application uses the helper class AsyncDataStream defined in the StreamingJob.java file to incorporate an asynchronous, external operation into Flink. More specifically, the following code creates an asynchronous data stream by applying the AsyncBedrockRequest function to each element in the inputReviewStream. The application uses unorderedWait to increase throughput and reduce idle time because event ordering is not required. The timeout is set to 25,000 milliseconds to give the Amazon Bedrock API enough time to process long reviews. The maximum concurrency or capacity is limited to 1,000 requests at a time. See the following code:

DataStream<ProcessedReview> processedReviewStream = AsyncDataStream.unorderedWait(inputReviewStream, new AsyncBedrockRequest(applicationProperties), 25000, TimeUnit.MILLISECONDS, 1000).uid("processedReviewStream");

The Flink application initiates asynchronous calls to the Amazon Bedrock API, invoking the Anthropic Claude 3 Haiku foundation model for each incoming event. We use Anthropic Claude 3 Haiku on Amazon Bedrock because it is Anthropic’s fastest and most compact model for near-instant responsiveness. The following code snippet is part of the AsyncBedrockRequest.java file and illustrates how we set up the required configuration to call the Anthropic’s Claude Messages API to invoke the model:

@Override
public void asyncInvoke(Review review, final ResultFuture<ProcessedReview> resultFuture) throws Exception {

    // [..]

    JSONObject user_message = new JSONObject()
        .put("role", "user")
        .put("content", "<review>" + reviewText + "</review>");

    JSONObject assistant_message = new JSONObject()
        .put("role", "assistant")
        .put("content", "{");

    JSONArray messages = new JSONArray()
            .put(user_message)
            .put(assistant_message);

    String payload = new JSONObject()
            .put("system", systemPrompt)
            .put("anthropic_version", "bedrock-2023-05-31")
            .put("temperature", 0.0)
            .put("max_tokens", 4096)
            .put("messages", messages)
            .toString();

    InvokeModelRequest request = InvokeModelRequest.builder()
            .body(SdkBytes.fromUtf8String(payload))
            .modelId("anthropic.claude-3-haiku-20240307-v1:0")
            .build();

    CompletableFuture<InvokeModelResponse> completableFuture = client.invokeModel(request)
            .whenComplete((response, exception) -> {
                if (exception != null) {
                    LOG.error("Model invocation failed: " + exception);
                }
            })
            .orTimeout(250000, TimeUnit.MILLISECONDS);

Prompt engineering

The application uses advanced prompt engineering techniques to guide the generative AI model’s responses and provide consistent responses. The following prompt is designed to extract a summary as well as a sentiment from a single review:

String systemPrompt = 
     "Summarize the review within the <review> tags 
     into a single and concise sentence alongside the sentiment 
     that is either positive or negative. Return a valid JSON object with 
     following keys: summary, sentiment. 
     <example> {\\\"summary\\\": \\\"The reviewer strongly dislikes the movie, 
     finding it unrealistic, preachy, and extremely boring to watch.\\\", 
     \\\"sentiment\\\": \\\"negative\\\"} 
     </example>";

The prompt instructs the Anthropic Claude model to return the extracted sentiment and summary in JSON format. To maintain consistent and well-structured output by the generative AI model, the prompt uses various prompt engineering techniques to improve the output. For example, the prompt uses XML tags to provide a clearer structure for Anthropic Claude. Moreover, the prompt contains an example to enhance Anthropic Claude’s performance and guide it to produce the desired output. In addition, the prompt pre-fills Anthropic Claude’s response by pre-filling the Assistant message. This technique helps provide a consistent output format. See the following code:

JSONObject assistant_message = new JSONObject()
    .put("role", "assistant")
    .put("content", "{");

Build the Flink application

The first step is to download the repository and build the JAR file of the Flink application. Complete the following steps:

  1. Clone the repository to your desired workspace:
    git clone https://github.com/aws-samples/aws-streaming-generative-ai-application.git

  2. Move to the correct directory inside the downloaded repository and build the Flink application:
    cd flink-async-bedrock && mvn clean package

Building Jar File

Maven will compile the Java source code and package it in a distributable JAR format in the directory flink-async-bedrock/target/ named flink-async-bedrock-0.1.jar. After you deploy your AWS CDK stack, the JAR file will be uploaded to Amazon Simple Storage Service (Amazon S3) to create your Managed Service for Apache Flink application.

Deploy the AWS CDK stack

After you build the Flink application, you can deploy your AWS CDK stack and create the required resources:

  1. Move to the correct directory cdk and deploy the stack:
    cd cdk && npm install & cdk deploy

This will create the required resources in your AWS account, including the Managed Service for Apache Flink application, Kinesis Data Stream, OpenSearch Service cluster, and bastion host to quickly connect to OpenSearch Dashboards, deployed in a private subnet within your VPC.

  1. Take note of the output values. The output will look similar to the following:
 ✅  StreamingGenerativeAIStack

✨  Deployment time: 1414.26s

Outputs:
StreamingGenerativeAIStack.BastionHostBastionHostIdC743CBD6 = i-0970816fa778f9821
StreamingGenerativeAIStack.accessOpenSearchClusterOutput = aws ssm start-session --target i-0970816fa778f9821 --document-name AWS-StartPortForwardingSessionToRemoteHost --parameters '{"portNumber":["443"],"localPortNumber":["8157"], "host":["vpc-generative-ai-opensearch-qfssmne2lwpzpzheoue7rkylmi.us-east-1.es.amazonaws.com"]}'
StreamingGenerativeAIStack.bastionHostIdOutput = i-0970816fa778f9821
StreamingGenerativeAIStack.domainEndpoint = vpc-generative-ai-opensearch-qfssmne2lwpzpzheoue7rkylmi.us-east-1.es.amazonaws.com
StreamingGenerativeAIStack.regionOutput = us-east-1
Stack ARN:
arn:aws:cloudformation:us-east-1:<AWS Account ID>:stack/StreamingGenerativeAIStack/3dec75f0-cc9e-11ee-9b16-12348a4fbf87

✨  Total time: 1418.61s

Set up and connect to OpenSearch Dashboards

Next, you can set up and connect to OpenSearch Dashboards. This is where the Flink application will write the extracted sentiment as well as the summary from the processed review stream. Complete the following steps:

  1. Run the following command to establish connection to OpenSearch from your local workspace in a separate terminal window. The command can be found as output named accessOpenSearchClusterOutput.
    • For Mac/Linux, use the following command:
aws ssm start-session --target <BastionHostId> --document-name AWS-StartPortForwardingSessionToRemoteHost --parameters '{"portNumber":["443"],"localPortNumber":["8157"], "host":["<OpenSearchDomainHost>"]}'
    • For Windows, use the following command:
aws ssm start-session ^
    —target <BastionHostId> ^
    —document-name AWS-StartPortForwardingSessionToRemoteHost ^    
    —parameters host="<OpenSearchDomainHost>",portNumber="443",localPortNumber="8157"

It should look similar to the following output:

Session Manager CLI

  1. Create the required index in OpenSearch by issuing the following command:
    • For Mac/Linux, use the following command:
curl --location -k --request PUT https://localhost:8157/processed_reviews \
--header 'Content-Type: application/json' \
--data-raw '{
  "mappings": {
    "properties": {
        "reviewId": {"type": "integer"},
        "userId": {"type": "keyword"},
        "summary": {"type": "keyword"},
        "sentiment": {"type": "keyword"},
        "dateTime": {"type": "date"}}}}}'
    • For Windows, use the following command:
$url = https://localhost:8157/processed_reviews
$headers = @{
    "Content-Type" = "application/json"
}
$body = @{
    "mappings" = @{
        "properties" = @{
            "reviewId" = @{ "type" = "integer" }
            "userId" = @{ "type" = "keyword" }
            "summary" = @{ "type" = "keyword" }
            "sentiment" = @{ "type" = "keyword" }
            "dateTime" = @{ "type" = "date" }
        }
    }
} | ConvertTo-Json -Depth 3
Invoke-RestMethod -Method Put -Uri $url -Headers $headers -Body $body -SkipCertificateCheck
  1. After the session is established, you can open your browser and navigate to https://localhost:8157/_dashboards. Your browser might consider the URL not secure. You can ignore this warning.
  2. Choose Dashboards Management under Management in the navigation pane.
  3. Choose Saved objects in the sidebar.
  4. Import export.ndjson, which can be found in the resources folder within the downloaded repository.

OpenSearch Dashboards Upload

  1. After you import the saved objects, you can navigate to Dashboards under My Dashboard in the navigation pane.

At the moment, the dashboard appears blank because you haven’t uploaded any review data to OpenSearch yet.

Set up the streaming producer

Finally, you can set up the producer that will be streaming review data to the Kinesis Data Stream and ultimately to the OpenSearch Dashboards. The Large Movie Review Dataset was originally published in 2011 in the paper “Learning Word Vectors for Sentiment Analysis” by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Complete the following steps:

  1. Download the Large Movie Review Dataset here.
  2. After the download is complete, extract the .tar.gz file to retrieve the folder named aclImdb 3 or similar that contains the review data. Rename the review data folder to aclImdb.
  3. Move the extracted dataset to data/ inside the repository that you previously downloaded.

Your repository should look like the following screenshot.

Folder Overview

  1. Modify the DATA_DIR path in producer/producer.py if the review data is named differently.
  2. Move to the producer directory using the following command:
    cd producer

  3. Install the required dependencies and start generating the data:
    pip install -r requirements.txt && python produce.py

The OpenSearch dashboard should be populated after you start generating streaming data and writing it to the Kinesis Data Stream. Refresh the dashboard to view the latest data. The dashboard shows the total number of processed reviews, the sentiment distribution of the processed reviews in a pie chart, and the summary and sentiment for the latest reviews that have been processed.

When you have a closer look at the Flink application, you will notice that the application marks the sentiment field with the value error whenever there is an error with the asynchronous call made by Flink to the Amazon Bedrock API. The Flink application simply filters the correctly processed reviews and writes them to the OpenSearch dashboard.

For robust error handling, you should write any incorrectly processed reviews to a separate output stream and not discard them completely. This separation allows you to handle failed reviews differently than successful ones for simpler reprocessing, analysis, and troubleshooting.

Clean up

When you’re done with the resources you created, complete the following steps:

  1. Delete the Python producer using Ctrl/Command + C.
  2. Destroy your AWS CDK stack by returning to the root folder and running the following command in your terminal:
    cd cdk && cdk destroy

  3. When asked to confirm the deletion of the stack, enter yes.

Conclusion

In this post, you learned how to incorporate generative AI capabilities in your streaming architecture using Amazon Bedrock and Managed Service for Apache Flink using asynchronous requests. We also gave guidance on prompt engineering to derive the sentiment from text data using generative AI. You can build this architecture by deploying the sample code from the GitHub repository.

For more information on how to get started with Managed Service for Apache Flink, refer to Getting started with Amazon Managed Service for Apache Flink (DataStream API). For details on how to set up Amazon Bedrock, refer to Set up Amazon Bedrock. For other posts on Managed Service for Apache Flink, browse through the AWS Big Data Blog.


About the Authors

Felix John is a Solutions Architect and data streaming expert at AWS, based in Germany. He focuses on supporting small and medium businesses on their cloud journey. Outside of his professional life, Felix enjoys playing Floorball and hiking in the mountains.

Michelle Mei-Li Pfister is a Solutions Architect at AWS. She is supporting customers in retail and consumer packaged goods (CPG) industry on their cloud journey. She is passionate about topics around data and machine learning.

Perform reindexing in Amazon OpenSearch Serverless using Amazon OpenSearch Ingestion

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/perform-reindexing-in-amazon-opensearch-serverless-using-amazon-opensearch-ingestion/

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it straightforward to run search and analytics workloads without managing infrastructure. Customers using OpenSearch Serverless often need to copy documents between two indexes within the same collection or across different collections. This primarily arises from two scenarios:

  • Reindexing – You frequently need to update or modify index mapping due to evolving data needs or schema changes
  • Disaster recovery – Although OpenSearch Serverless data is inherently durable, you may want to copy data across AWS Regions for added redundancy and resiliency

Amazon OpenSearch Ingestion had recently introduced a feature supporting OpenSearch as a source. OpenSearch Ingestion, a fully managed, serverless data collector, facilitates real-time ingestion of log, metric, and trace data into OpenSearch Service domains and OpenSearch Serverless collections. We can leverage this feature to address these two scenarios, by reading the data from an OpenSearch Serverless Collection. This capability allows you to effortlessly copy data between indexes, making data management tasks more streamlined and eliminating the need for custom code.

In this post, we outline the steps to copy data between two indexes in the same OpenSearch Serverless collection using the new OpenSearch source feature of OpenSearch Ingestion. This is particularly useful for reindexing operations where you want to change your data schema. OpenSearch Serverless and OpenSearch Ingestion are both serverless services that enable you to seamlessly handle your data workflows, providing optimal performance and scalability.

Solution overview

The following diagram shows the flow of copying documents from the source index to the destination index using an OpenSearch Ingestion pipeline.

Implementing the solution consists of the following steps:

  1. Create an AWS Identity and Access Management (IAM) role to use as an OpenSearch Ingestion pipeline role.
  2. Update the data access policy attached to the OpenSearch Serverless collection.
  3. Create an OpenSearch Ingestion pipeline that simply copies data from one index to another, or you can even create an index template using the OpenSearch Ingestion pipeline to define explicit mapping, and then copy the data from the source index to the destination index with the defined mapping applied.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection with an index that you want to reindex (copy). Refer to Creating collections to learn more about creating a collection.

When the collection is ready, note the following details:

  • The endpoint of the OpenSearch Serverless collection
  • The name of the index from which the documents need to be copied
  • If the collection is defined as a VPC collection, note down the name of the network policy attached to the collection

You use these details in the ingestion pipeline configuration.

Create an IAM role to use as a pipeline role

An OpenSearch Ingestion pipeline needs certain permissions to pull data from the source and write to its sink. For this walkthrough, both the source and sink are the same, but if the source and sink collections are different, modify the policy accordingly.

Complete the following steps:

  1. Create an IAM policy (opensearch-ingestion-pipeline-policy) that provides permission to read and send data to the OpenSearch Serverless collection. The following is a sample policy with least privileges (modify {account-id}, {region}, {collection-id} and {collection-name} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
                "Action": [
                    "aoss:BatchGetCollection",
                    "aoss:APIAccessAll"
                ],
                "Effect": "Allow",
                "Resource": "arn:aws:aoss:{region}:{account-id}:collection/{collection-id}"
            },
            {
                "Action": [
                    "aoss:CreateSecurityPolicy",
                    "aoss:GetSecurityPolicy",
                    "aoss:UpdateSecurityPolicy"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "aoss:collection": "{collection-name}"
                    }
                }
            }
        ]
    }

  2. Create an IAM role (opensearch-ingestion-pipeline-role) that the OpenSearch Ingestion pipeline will assume. While creating the role, use the policy you created (opensearch-ingestion-pipeline-policy). The role should have the following trust relationship (modify {account-id} and {region} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                }
            }
        }]
    }

  3. Record the ARN of the newly created IAM role (arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role).

Update the data access policy attached to the OpenSearch Serverless collection

After you create the IAM role, you need to update the data access policy attached to the OpenSearch Serverless collection. Data access policies control access to the OpenSearch operations that OpenSearch Serverless supports, such as PUT <index> or GET _cat/indices. To perform the update, complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Collections.
  2. From the list of the collections, choose your OpenSearch Serverless collection.
  3. On the Overview tab, in the Data access section, choose the associated policy.
  4. Choose Edit.
  5. Edit the policy in the JSON editor to add the following JSON rule block in the existing JSON (modify {account-id} and {collection-name} accordingly):
    {
        "Rules": [{
            "Resource": [
                "index/{collection-name}/*"
            ],
            "Permission": [
                "aoss:CreateIndex",
                "aoss:UpdateIndex",
                "aoss:DescribeIndex",
                "aoss:ReadDocument",
                "aoss:WriteDocument"
            ],
            "ResourceType": "index"
        }],
        "Principal": [
            "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
        ],
        "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
    }

You can also use the Visual Editor method to choose Add another rule and add the preceding permissions for arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role.

  1. Choose Save.

Now you have successfully allowed the OpenSearch Ingestion role to perform OpenSearch operations against the OpenSearch Serverless collection.

Create and configure the OpenSearch Ingestion pipeline to copy the data from one index to another

Complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create a pipeline.
  3. In Choose Blueprint, select OpenSearchDataMigrationPipeline.
  4. For Pipeline name, enter a name (for example, sample-ingestion-pipeline).
  5. For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For this walkthrough, you can use the default value of 2 Ingestion OCUs for Min capacity and 4 Ingestion OCUs for Max capacity. However, you can even choose different values as OpenSearch Ingestion automatically scales your pipeline capacity according to your estimated workload, based on the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs) that you specify.
  6. Update the following information for the source:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection that was copied as part of prerequisites.
    2. Uncomment include and index_name_regex, and specify the name of the index that will act as the source (in this demo, we’re using logs-2024.03.01).
    3. Uncomment region under aws and specify the AWS Region where your OpenSearch Serverless collection is (for example, us-east-1).
    4. Uncomment sts_role_arn under aws and specify the role that has permission to read data from the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    5. Update the serverless flag to true.
    6. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    7. Uncomment scheduling, interval, index_read_count, and start_time and modify these parameters accordingly.
      Using these parameters makes sure the OpenSearch Ingestion pipeline processes the indexes multiple times (to pick up new documents).
      Note – If the collection specified in the sink is of the Time series or Vector search type, you can keep the scheduling, interval, index_read_count, and start_time parameters commented.
  1. Update the following information for the sink:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection.
    2. Uncomment sts_role_arn under aws and specify the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    3. Update the serverless flag to true.
    4. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    5. Update the value for index and provide the index name to which you want to transfer the documents (for example, new-logs-2024.03.01).
    6. For document_id, you can get the ID from the document metadata in the source and use the same in the target.
      However, it is important to note that custom document IDs are only supported for the Search type of collection. If your collection is of the Time Series or Vector Search type, you should comment out the document_id line.
    7. (Optional) The values for bucket, region and sts_role_arn keys within the dlq section can be modified to capture any failed requests in an S3 bucket.
      Note – Additional permission to opensearch-ingestion-pipeline-role needs to be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the changes required.
      For this walkthrough, you will not set up a DLQ. You can remove the entire dlq block.
  1. Now click on Validate pipeline to validate the pipeline configuration.
  2. For Network settings, choose your preferred setting:
    1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately. Choose this option if the OpenSearch Serverless collection has VPC access. AWS recommends using a VPC endpoint for all production workloads.
    2. Choose Public to use public access. For this walkthrough, we select Public because the collection is also accessible from public network.
  3. For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
  4. Choose Next, and verify the details you specified for your pipeline settings.
  5. Choose Create pipeline.

It will take a couple of minutes to create the ingestion pipeline. After the pipeline is created, you will see the documents in the destination index, specified in the sink (for example, new-logs-2024.03.01). After all the documents are copied, you can validate the number of documents by using the count API.

When the process is complete, you have the option to stop or delete the pipeline. If you choose to keep the pipeline running, it will continue to copy new documents from the source index according to the defined schedule, if specified.

In this walkthrough, the endpoint defined in the hosts parameter under source and sink of the pipeline configuration belonged to the same collection which was of the Search type. If the collections are different, you need to modify the permissions for the IAM role (opensearch-ingestion-pipeline-role) to allow access to both collections. Additionally, make sure you update the data access policy for both the collections to grant access to the OpenSearch Ingestion pipeline.

Create an index template using the OpenSearch Ingestion pipeline to define mapping

In OpenSearch, you can define how documents and their fields are stored and indexed by creating a mapping. The mapping specifies the list of fields for a document. Every field in the document has a field type, which defines the type of data the field contains. OpenSearch Service dynamically maps data types in each incoming document if an explicit mapping is not defined. However, you can use the template_type parameter with the index-template value and template_content with JSON of the content of the index-template in the pipeline configuration to define explicit mapping rules. You also need to define the index_type parameter with the value as custom.

The following code shows an example of the sink portion of the pipeline and the usage of index_type, template_type, and template_content:

sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "<<OpenSearch-Serverless-Collection-Endpoint>>" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role"
          # Provide the region of the domain.
          region: "us-east-1"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: true
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"
        # This will make it so each document in the source cluster will be written to the same index in the destination cluster
        index: "new-logs-2024.03.01"
        index_type: custom
        template_type: index-template
        template_content: >
          {
            "template" : {
              "mappings" : {
                "properties" : {
                  "Data" : {
                    "type" : "text"
                  },
                  "EncodedColors" : {
                    "type" : "binary"
                  },
                  "Type" : {
                    "type" : "keyword"
                  },
                  "LargeDouble" : {
                    "type" : "double"
                  }          
                }
              }
            }
          }
        # This will make it so each document in the source cluster will be written with the same document_id in the destination cluster
        document_id: "${getMetadata(\"opensearch-document_id\")}"
        # Enable the 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x
        # distribution_version: "es6"
        # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. See https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gzip.html
        # enable_request_compression: true/false
        # Enable the S3 DLQ to capture any failed requests in an S3 bucket
        # dlq:
          # s3:
            # Provide an S3 bucket
            # bucket: "<<your-dlq-bucket-name>>"
            # Provide a key path prefix for the failed requests
            # key_path_prefix: "<<logs/dlq>>"
            # Provide the region of the bucket.
            # region: "<<us-east-1>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
            # sts_role_arn: "<<arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role>>"

Or you can create the index first, with the mapping in the collection before you start the pipeline.

If you want to create a template using an OpenSearch Ingestion pipeline, you need to provide aoss:UpdateCollectionItems and aoss:DescribeCollectionItems permission for the collection in the data access policy for the pipeline role (opensearch-ingestion-pipeline-role). The updated JSON block for the rule would look like the following:

{
    "Rules": [
      {
        "Resource": [
          "collection/{collection-name}"
        ],
        "Permission": [
          "aoss:UpdateCollectionItems",
          "aoss:DescribeCollectionItems"
        ],
        "ResourceType": "collection"
      },
      {
        "Resource": [
          "index/{collection-name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
    ],
    "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
  }

Conclusion

In this post, we showed how to use an OpenSearch Ingestion pipeline to copy data from one index to another in an OpenSearch Serverless collection. OpenSearch Ingestion also allows you to perform transformation of data using various processors. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. You can use the following OpenSearch Ingestion blueprints to build data pipelines with minimal configuration changes.


About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course, cricket. Lately, he has also been attempting to master the art of cooking in his free time – the taste buds are excited, but the kitchen might disagree.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Build multimodal search with Amazon OpenSearch Service

Post Syndicated from Praveen Mohan Prasad original https://aws.amazon.com/blogs/big-data/build-multimodal-search-with-amazon-opensearch-service/

Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a desired style and use the uploaded image alongside the input text in order to find the most relevant items for each user. Multimodal search provides more flexibility in deciding how to find the most relevant information for your search.

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. Text embeddings capture document semantics, while image embeddings capture visual attributes that help you build rich image search applications.

Amazon Titan Multimodal Embeddings G1 is a multimodal embedding model that generates embeddings to facilitate multimodal search. These embeddings are stored and managed efficiently using specialized vector stores such as Amazon OpenSearch Service, which is designed to store and retrieve large volumes of high-dimensional vectors alongside structured and unstructured data. By using this technology, you can build rich search applications that seamlessly integrate text and visual information.

Amazon OpenSearch Service and Amazon OpenSearch Serverless support the vector engine, which you can use to store and run vector searches. In addition, OpenSearch Service supports neural search, which provides out-of-the-box machine learning (ML) connectors. These ML connectors enable OpenSearch Service to seamlessly integrate with embedding models and large language models (LLMs) hosted on Amazon Bedrock, Amazon SageMaker, and other remote ML platforms such as OpenAI and Cohere. When you use the neural plugin’s connectors, you don’t need to build additional pipelines external to OpenSearch Service to interact with these models during indexing and searching.

This blog post provides a step-by-step guide for building a multimodal search solution using OpenSearch Service. You will use ML connectors to integrate OpenSearch Service with the Amazon Bedrock Titan Multimodal Embeddings model to infer embeddings for your multimodal documents and queries. This post illustrates the process by showing you how to ingest a retail dataset containing both product images and product descriptions into your OpenSearch Service domain and then perform a multimodal search by using vector embeddings generated by the Titan multimodal model. The code used in this tutorial is open source and available on GitHub for you to access and explore.

Multimodal search solution architecture

We will provide the steps required to set up multimodal search using OpenSearch Service. The following image depicts the solution architecture.

Multimodal search architecture

Figure 1: Multimodal search architecture

The workflow depicted in the preceding figure is:

  1. You download the retail dataset from Amazon Simple Storage Service (Amazon S3) and ingest it into an OpenSearch k-NN index using an OpenSearch ingest pipeline.
  2. OpenSearch Service calls the Amazon Bedrock Titan Multimodal Embeddings model to generate multimodal vector embeddings for both the product description and image.
  3. Through an OpenSearch Service client, you pass a search query.
  4. OpenSearch Service calls the Amazon Bedrock Titan Multimodal Embeddings model to generate vector embedding for the search query.
  5. OpenSearch runs the neural search and returns the search results to the client.

Let’s look at steps 1, 2, and 4 in more detail.

Step 1: Ingestion of the data into OpenSearch

This step involves the following OpenSearch Service features:

  • Ingest pipelines – An ingest pipeline is a sequence of processors that are applied to documents as they’re ingested into an index. Here you use a text_image_embedding processor to generate combined vector embeddings for the image and image description.
  • k-NN index – The k-NN index introduces a custom data type, knn_vector, which allows users to ingest vectors into an OpenSearch index and perform different kinds of k-NN searches. You use the k-NN index to store both the general field data types, such as text, numeric, etc., and specialized field data types, such as knn_vector.

Steps 2 and 4: OpenSearch calls the Amazon Bedrock Titan model

OpenSearch Service uses the Amazon Bedrock connector to generate embeddings for the data. When you send the image and text as part of your indexing and search requests, OpenSearch uses this connector to exchange the inputs with the equivalent embeddings from the Amazon Bedrock Titan model. The highlighted blue box in the architecture diagram depicts the integration of OpenSearch with Amazon Bedrock using this ML-connector feature. This direct integration eliminates the need for an additional component (for example, AWS Lambda) to facilitate the exchange between the two services.

Solution overview

In this post, you will build and run multimodal search using a sample retail dataset. You will use the same multimodal generated embeddings and experiment by running text search only, image search only and both text and image search in OpenSearch Service.

Prerequisites

  1. Create an OpenSearch Service domain. For instructions, see Creating and managing Amazon OpenSearch Service domains. Make sure the following settings are applied when you create the domain, while leaving other settings as default.
    • OpenSearch version is 2.13
    • The domain has public access
    • Fine-grained access control is enabled
    • A master user is created
  2. Set up a Python client to interact with the OpenSearch Service domain, preferably on a Jupyter Notebook interface.
  3. Add model access in Amazon Bedrock. For instructions, see add model access.

Note that you need to refer to the Jupyter Notebook in the GitHub repository to run the following steps using Python code in your client environment. The following sections provide the sample blocks of code that contain only the HTTP request path and the request payload to be passed to OpenSearch Service at every step.

Data overview and preparation

You will be using a retail dataset that contains 2,465 retail product samples that belong to different categories such as accessories, home decor, apparel, housewares, books, and instruments. Each product contains metadata including the ID, current stock, name, category, style, description, price, image URL, and gender affinity of the product. You will be using only the product image and product description fields in the solution.

A sample product image and product description from the dataset are shown in the following image:

Sample product image and description

Figure 2: Sample product image and description

In addition to the original product image, the textual description of the image provides additional metadata for the product, such as color, type, style, suitability, and so on. For more information about the dataset, visit the retail demo store on GitHub.

Step 1: Create the OpenSearch-Amazon Bedrock ML connector

The OpenSearch Service console provides a streamlined integration process that allows you to deploy an Amazon Bedrock-ML connector for multimodal search within minutes. OpenSearch Service console integrations provide AWS CloudFormation templates to automate the steps of Amazon Bedrock model deployment and Amazon Bedrock-ML connector creation in OpenSearch Service.

  1. In the OpenSearch Service console, navigate to Integrations as shown in the following image and search for Titan multi-modal. This returns the CloudFormation template named Integrate with Amazon Bedrock Titan Multi-modal, which you will use in the following steps.Configure domainFigure 3: Configure domain
  2. Select Configure domain and choose ‘Configure public domain’.
  3. You will be automatically redirected to a CloudFormation template stack as shown in the following image, where most of the configuration is pre-populated for you, including the Amazon Bedrock model, the ML model name, and the AWS Identity and Access Management (IAM) role that is used by Lambda to invoke your OpenSearch domain. Update Amazon OpenSearch Endpoint with your OpenSearch domain endpoint and Model Region with the AWS Region in which your model is available.Create a CloudFormation stackFigure 4: Create a CloudFormation stack
  4. Before you deploy the stack by clicking ‘Create Stack’, you need to give necessary permissions for the stack to create the ML connector. The CloudFormation template creates a Lambda IAM role for you with the default name LambdaInvokeOpenSearchMLCommonsRole, which you can override if you want to choose a different name. You need to map this IAM role as a Backend role for ml_full_access role in OpenSearch dashboards Security plugin, so that the Lambda function can successfully create the ML connector. To do so,
    • Login to the OpenSearch Dashboards using the master user credentials that you created as a part of prerequisites. You can find the Dashboards endpoint on your domain dashboard on the OpenSearch Service console.
    • From the main menu choose SecurityRoles, and select the ml_full_access role.
    • Choose Mapped usersManage mapping.
    • Under Backend roles, add the ARN of the Lambda role (arn:aws:iam::<account-id>:role/LambdaInvokeOpenSearchMLCommonsRole) that needs permission to call your domain.
    • Select Map and confirm the user or role shows up under Mapped users.Set permissions in OpenSearch dashboardsFigure 5: Set permissions in OpenSearch dashboards security plugin
  5. Return back to the CloudFormation stack console, check the box, ‘I acknowledge that AWS CloudFormation might create IAM resources with customised names‘ and click on ‘Create Stack’.
  6. After the stack is deployed, it will create the Amazon Bedrock-ML connector (ConnectorId) and a model identifier (ModelId). CloudFormation stack outputsFigure 6: CloudFormation stack outputs
  7. Copy the ModelId from the Outputs tab of the CloudFormation stack starting with prefix ‘OpenSearch-bedrock-mm-’ from your CloudFormation console. You will be using this ModelId in the further steps.

Step 2: Create the OpenSearch ingest pipeline with the text_image_embedding processor

You can create an ingest pipeline with the text_image_embedding processor, which transforms the images and descriptions into embeddings during the indexing process.

In the following request payload, you provide the following parameters to the text_image_embedding processor. Specify which index fields to convert to embeddings, which field should store the vector embeddings, and which ML model to use to perform the vector conversion.

  • model_id (<model_id>) – The model identifier from the previous step.
  • Embedding (<vector_embedding>) – The k-NN field that stores the vector embeddings.
  • field_map (<product_description> and <image_binary>) – The field name of the product description and the product image in binary format.
path = "_ingest/pipeline/<bedrock-multimodal-ingest-pipeline>"

..
payload = {
"description": "A text/image embedding pipeline",
"processors": [
{
"text_image_embedding": {
"model_id":<model_id>,
"embedding": <vector_embedding>,
"field_map": {
"text": <product_description>,
"image": <image_binary>
}}}]}

Step 4: Create the k-NN index and ingest the retail dataset

Create the k-NN index and set the pipeline created in the previous step as the default pipeline. Set index.knn to True to perform an approximate k-NN search. The vector_embedding field type must be mapped as a knn_vector. vector_embedding field dimension must be mapped with the number of dimensions of the vector that the model provides.

Amazon Titan Multimodal Embeddings G1 lets you choose the size of the output vector (either 256, 512, or 1024). In this post, you will be using the default 1024 dimensional vectors from the model. You can check the size of dimensions of the model by selecting ‘Providers’ -> ‘Amazon’ tab -> ‘Titan Multimodal Embeddings G1’ tab -> ‘Model attributes’, from your Bedrock console.

Given the smaller size of the dataset and to bias for better recall, you use the faiss engine with the hnsw algorithm and the default l2 space type for your k-NN index. For more information about different engines and space types, refer to k-NN index.

payload = {
"settings": {
"index.knn": True,
"default_pipeline": <ingest-pipeline>
},
"mappings": {
"properties": {
"vector_embedding": {
"type": "knn_vector",
"dimension": 1024
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"product_description": {"type": "text"},
"image_url": {"type": "text"},
"image_binary": {"type": "binary"}
}}}

Finally, you ingest the retail dataset into the k-NN index using a bulk request. For the ingestion code, refer to the step 7, ‘Ingest the dataset into k-NN index using Bulk request‘ in the Jupyter notebook.

Step 5: Perform multimodal search experiments

Perform the following experiments to explore multimodal search and compare results. For text search, use the sample query “Trendy footwear for women” and set the number of results to 5 (size) throughout the experiments.

Experiment 1: Lexical search

This experiment shows you the limitations of simple lexical search and how the results can be improved using multimodal search.

Run a match query against the product_description field by using the following example query payload:

payload = {
"query": {
"match": {
"product_description": {
"query": "Trendy footwear for women"
}
}
},
"size": 5
}

Results:

Lexical search results

Figure 7: Lexical search results

Observation:

As shown in the preceding figure, the first three results refer to a jacket, glasses, and scarf, which are irrelevant to the query. These were returned because of the matching keywords between the query, “Trendy footwear for women” and the product descriptions, such as “trendy” and “women.” Only the last two results are relevant to the query because they contain footwear items.

Only the last two products fulfil the intent of the query, which was to find products that match all words in the query.

Experiment 2: Multimodal search with only text as input

In this experiment, you will use the Titan Multimodal Embeddings model that you deployed previously and run a neural search with only “Trendy footwear for women” (text) as input.

In the k-NN vector field (vector_embedding) of the neural query, you pass the model_id, query_text, and k value as shown in the following example. k denotes the number of results returned by the k-NN search.

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_text": "Trendy footwear for women",
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

Results from multimodal search using text

Figure 8: Results from multimodal search using text

Observation:

As shown in the preceding figure, all five results are relevant because each represents a style of footwear. Additionally, the gender preference from the query (women) is also matched in all the results, which indicates that the Titan multimodal embeddings preserved the gender context in both the query and nearest document vectors.

Experiment 3: Multimodal search with only an image as input

In this experiment, you will use only a product image as the input query.

You will use the same neural query and parameters as in the previous experiment but pass the query_image parameter instead of using the query_text parameter. You need to convert the image into binary format and pass the binary string to the query_image parameter:

Image of a woman’s sandal used as the query input

Figure 9: Image of a woman’s sandal used as the query input

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_image": <query_image_binary>,
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

Results from multimodal search using an image

Figure 10: Results from multimodal search using an image

Observation:

As shown in the preceding figure, by passing an image of a woman’s sandal, you were able to retrieve similar footwear styles. Though this experiment provides a different set of results compared to the previous experiment, all the results are highly related to the search query. All the matching documents are similar to the searched product image, not only in terms of the product category (footwear) but also in terms of the style (summer footwear), color, and gender affinity of the product.

Experiment 4: Multimodal search with both text and an image

In this last experiment, you will run the same neural query but pass both the image of a woman’s sandal and the text, “dark color” as inputs.

Figure 11: Image of a woman’s sandal used as part of the query input

As before, you will convert the image into its binary form before passing it to the query:

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_image": <query_image_binary>,
"query_text": "dark color",
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

payload = { "query": { "neural": { "vector_embedding": { "query_image": <query_image_binary>, "query_text": "dark color", "model_id": <model_id>, "k": 5 } } }, "size": 5 }” width=”904″ height=”796″></a></p>
<p><em>Figure 12: Results of query using text and an image</em></p>
<h4><strong>Observation:</strong></h4>
<p>In this experiment, you augmented the image query with a text query to return dark, summer-style shoes. This experiment provided more comprehensive options by taking into consideration both text and image input.</p>
<h2>Overall observations</h2>
<p>Based on the experiments, all the variants of multimodal search provided more relevant results than a basic lexical search. After experimenting with text-only search, image-only search, and a combination of the two, it’s clear that the combination of text and image modalities provides more search flexibility and, as a result, more specific footwear options to the user.</p>
<h2>Clean up</h2>
<p>To avoid incurring continued AWS usage charges, delete the Amazon OpenSearch Service domain that you created and delete the CloudFormation stack starting with prefix ‘<strong>OpenSearch-bedrock-mm-</strong>’ that you deployed to create the ML connector.</p>
<h2>Conclusion</h2>
<p>In this post, we showed you how to use OpenSearch Service and the Amazon Bedrock Titan Multimodal Embeddings model to run multimodal search using both text and images as inputs. We also explained how the new multimodal processor in OpenSearch Service makes it easier for you to generate text and image embeddings using an OpenSearch ML connector, store the embeddings in a k-NN index, and perform multimodal search.</p>
<p>Learn more about <a href=ML-powered search with OpenSearch and set up you multimodal search solution in your own environment using the guidelines in this post. The solution code is also available on the GitHub repo.


About the Authors

Praveen Mohan Prasad is an Analytics Specialist Technical Account Manager at Amazon Web Services and helps customers with pro-active operational reviews on analytics workloads. Praveen actively researches on applying machine learning to improve search relevance.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favourite pastime is hiking the New England trails and mountains.

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

Post Syndicated from Sharmila Shanmugam original https://aws.amazon.com/blogs/big-data/ingest-and-analyze-your-data-using-amazon-opensearch-service-with-amazon-opensearch-ingestion/

In today’s data-driven world, organizations are continually confronted with the task of managing extensive volumes of data securely and efficiently. Whether it’s customer information, sales records, or sensor data from Internet of Things (IoT) devices, the importance of handling and storing data at scale with ease of use is paramount.

A common use case that we see amongst customers is to search and visualize data. In this post, we show how to ingest CSV files from Amazon Simple Storage Service (Amazon S3) into Amazon OpenSearch Service using the Amazon OpenSearch Ingestion feature and visualize the ingested data using OpenSearch Dashboards.

OpenSearch Service is a fully managed, open source search and analytics engine that helps you with ingesting, searching, and analyzing large datasets quickly and efficiently. OpenSearch Service enables you to quickly deploy, operate, and scale OpenSearch clusters. It continues to be a tool of choice for a wide variety of use cases such as log analytics, real-time application monitoring, clickstream analysis, website search, and more.

OpenSearch Dashboards is a visualization and exploration tool that allows you to create, manage, and interact with visuals, dashboards, and reports based on the data indexed in your OpenSearch cluster.

Visualize data in OpenSearch Dashboards

Visualizing the data in OpenSearch Dashboards involves the following steps:

  • Ingest data – Before you can visualize data, you need to ingest the data into an OpenSearch Service index in an OpenSearch Service domain or Amazon OpenSearch Serverless collection and define the mapping for the index. You can specify the data types of fields and how they should be analyzed; if nothing is specified, OpenSearch Service automatically detects the data type of each field and creates a dynamic mapping for your index by default.
  • Create an index pattern – After you index the data into your OpenSearch Service domain, you need to create an index pattern that enables OpenSearch Dashboards to read the data stored in the domain. This pattern can be based on index names, aliases, or wildcard expressions. You can configure the index pattern by specifying the timestamp field (if applicable) and other settings that are relevant to your data.
  • Create visualizations – You can create visuals that represent your data in meaningful ways. Common types of visuals include line charts, bar charts, pie charts, maps, and tables. You can also create more complex visualizations like heatmaps and geospatial representations.

Ingest data with OpenSearch Ingestion

Ingesting data into OpenSearch Service can be challenging because it involves a number of steps, including collecting, converting, mapping, and loading data from different data sources into your OpenSearch Service index. Traditionally, this data was ingested using integrations with Amazon Data Firehose, Logstash, Data Prepper, Amazon CloudWatch, or AWS IoT.

The OpenSearch Ingestion feature of OpenSearch Service introduced in April 2023 makes ingesting and processing petabyte-scale data into OpenSearch Service straightforward. OpenSearch Ingestion is a fully managed, serverless data collector that allows you to ingest, filter, enrich, and route data to an OpenSearch Service domain or OpenSearch Serverless collection. You configure your data producers to send data to OpenSearch Ingestion, which automatically delivers the data to the domain or collection that you specify. You can configure OpenSearch Ingestion to transform your data before delivering it.

OpenSearch Ingestion scales automatically to meet the requirements of your most demanding workloads, helping you focus on your business logic while abstracting away the complexity of managing complex data pipelines. It’s powered by Data Prepper, an open source streaming Extract, Transform, Load (ETL) tool that can filter, enrich, transform, normalize, and aggregate data for downstream analysis and visualization.

OpenSearch Ingestion uses pipelines as a mechanism that consists of three major components:

  • Source – The input component of a pipeline. It defines the mechanism through which a pipeline consumes records.
  • Processors – The intermediate processing units that can filter, transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline.
  • Sink – The output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. A sink can also be another pipeline, which allows you to chain multiple pipelines together.

You can process data files written in S3 buckets in two ways: by processing the files written to Amazon S3 in near real time using Amazon Simple Queue Service (Amazon SQS), or with the scheduled scans approach, in which you process the data files in batches using one-time or recurring scheduled scan configurations.

In the following section, we provide an overview of the solution and guide you through the steps to ingest CSV files from Amazon S3 into OpenSearch Service using the S3-SQS approach in OpenSearch Ingestion. Additionally, we demonstrate how to visualize the ingested data using OpenSearch Dashboards.

Solution overview

The following diagram outlines the workflow of ingesting CSV files from Amazon S3 into OpenSearch Service.

solution_overview

The workflow comprises the following steps:

  1. The user uploads CSV files into Amazon S3 using techniques such as direct upload on the AWS Management Console or AWS Command Line Interface (AWS CLI), or through the Amazon S3 SDK.
  2. Amazon SQS receives an Amazon S3 event notification as a JSON file with metadata such as the S3 bucket name, object key, and timestamp.
  3. The OpenSearch Ingestion pipeline receives the message from Amazon SQS, loads the files from Amazon S3, and parses the CSV data from the message into columns. It then creates an index in the OpenSearch Service domain and adds the data to the index.
  4. Lastly, you create an index pattern and visualize the ingested data using OpenSearch Dashboards.

OpenSearch Ingestion provides a serverless ingestion framework to effortlessly ingest data into OpenSearch Service with just a few clicks.

Prerequisites

Make sure you meet the following prerequisites:

Create an SQS queue

Amazon SQS offers a secure, durable, and available hosted queue that lets you integrate and decouple distributed software systems and components. Create a standard SQS queue and provide a descriptive name for the queue, then update the access policy by navigating to the Amazon SQS console, opening the details of your queue, and editing the policy on the Advanced tab.

The following is a sample access policy you could use for reference to update the access policy:

{
  "Version": "2008-10-17",
  "Id": "example-ID",
  "Statement": [
    {
      "Sid": "example-statement-ID",
      "Effect": "Allow",
      "Principal": {
        "Service": "s3.amazonaws.com"
      },
      "Action": "SQS:SendMessage",
      "Resource": "<SQS_QUEUE_ARN>"
    }
  ]
}

SQS FIFO (First-In-First-Out) queues aren’t supported as an Amazon S3 event notification destination. To send a notification for an Amazon S3 event to an SQS FIFO queue, you can use Amazon EventBridge.

create_sqs_queue

Create an S3 bucket and enable Amazon S3 event notification

Create an S3 bucket that will be the source for CSV files and enable Amazon S3 notifications. The Amazon S3 notification invokes an action in response to a specific event in the bucket. In this workflow, whenever there in an event of type S3:ObjectCreated:*, the event sends an Amazon S3 notification to the SQS queue created in the previous step. Refer to Walkthrough: Configuring a bucket for notifications (SNS topic or SQS queue) to configure the Amazon S3 notification in your S3 bucket.

create_s3_bucket

Create an IAM policy for the OpenSearch Ingest pipeline

Create an AWS Identity and Access Management (IAM) policy for the OpenSearch pipeline with the following permissions:

  • Read and delete rights on Amazon SQS
  • GetObject rights on Amazon S3
  • Describe domain and ESHttp rights on your OpenSearch Service domain

The following is an example policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "es:DescribeDomain",
      "Resource": "<OPENSEARCH_SERVICE_DOMAIN_ENDPOINT>:domain/*"
    },
    {
      "Effect": "Allow",
      "Action": "es:ESHttp*",
      "Resource": "<OPENSEARCH_SERVICE_DOMAIN_ENDPOINT>/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "<S3_BUCKET_ARN>/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sqs:DeleteMessage",
        "sqs:ReceiveMessage"
      ],
      "Resource": "<SQS_QUEUE_ARN>"
    }
  ]
}

create_policy

Create an IAM role and attach the IAM policy

A trust relationship defines which entities (such as AWS accounts, IAM users, roles, or services) are allowed to assume a particular IAM role. Create an IAM role for the OpenSearch Ingestion pipeline (osis-pipelines.amazonaws.com), attach the IAM policy created in the previous step, and add the trust relationship to allow OpenSearch Ingestion pipelines to write to domains.

create_iam_role

Configure an OpenSearch Ingestion pipeline

A pipeline is the mechanism that OpenSearch Ingestion uses to move data from its source (where the data comes from) to its sink (where the data goes). OpenSearch Ingestion provides out-of-the-box configuration blueprints to help you quickly set up pipelines without having to author a configuration from scratch. Set up the S3 bucket as the source and OpenSearch Service domain as the sink in the OpenSearch Ingestion pipeline with the following blueprint:

version: '2'
s3-pipeline:
  source:
    s3:
      acknowledgments: true
      notification_type: sqs
      compression: automatic
      codec:
        newline: 
          #header_destination: <column_names>
      sqs:
        queue_url: <SQS_QUEUE_URL>
      aws:
        region: <AWS_REGION>
        sts_role_arn: <STS_ROLE_ARN>
  processor:
    - csv:
        column_names_source_key: column_names
        column_names:
          - row_id
          - order_id
          - order_date
          - date_key
          - contact_name
          - country
          - city
          - region
          - sub_region
          - customer
          - customer_id
          - industry
          - segment
          - product
          - license
          - sales
          - quantity
          - discount
          - profit
    - convert_entry_type:
        key: sales
        type: double
    - convert_entry_type:
        key: profit
        type: double
    - convert_entry_type:
        key: discount
        type: double
    - convert_entry_type:
        key: quantity
        type: integer
    - date:
        match:
          - key: order_date
            patterns:
              - MM/dd/yyyy
        destination: order_date_new
  sink:
    - opensearch:
        hosts:
          - <OPEN_SEARCH_SERVICE_DOMAIN_ENDPOINT>
        index: csv-ingest-index
        aws:
          sts_role_arn: <STS_ROLE_ARN>
          region: <AWS_REGION>

On the OpenSearch Service console, create a pipeline with the name my-pipeline. Keep the default capacity settings and enter the preceding pipeline configuration in the Pipeline configuration section.

Update the configuration setting with the previously created IAM roles to read from Amazon S3 and write into OpenSearch Service, the SQS queue URL, and the OpenSearch Service domain endpoint.

create_pipeline

Validate the solution

To validate this solution, you can use the dataset SaaS-Sales.csv. This dataset contains transaction data from a software as a service (SaaS) company selling sales and marketing software to other companies (B2B). You can initiate this workflow by uploading the SaaS-Sales.csv file to the S3 bucket. This invokes the pipeline and creates an index in the OpenSearch Service domain you created earlier.

Follow these steps to validate the data using OpenSearch Dashboards.

First, you create an index pattern. An index pattern is a way to define a logical grouping of indexes that share a common naming convention. This allows you to search and analyze data across all matching indexes using a single query or visualization. For example, if you named your indexes csv-ingest-index-2024-01-01 and csv-ingest-index-2024-01-02 while ingesting the monthly sales data, you can define an index pattern as csv-* to encompass all these indexes.

create_index_pattern

Next, you create a visualization.  Visualizations are powerful tools to explore and analyze data stored in OpenSearch indexes. You can gather these visualizations into a real time OpenSearch dashboard. An OpenSearch dashboard provides a user-friendly interface for creating various types of visualizations such as charts, graphs, maps, and dashboards to gain insights from data.

You can visualize the sales data by industry with a pie chart with the index pattern created in the previous step. To create a pie chart, update the metrics details as follows on the Data tab:

  • Set Metrics to Slice
  • Set Aggregation to Sum
  • Set Field to sales

create_dashboard

To view the industry-wise sales details in the pie chart, add a new bucket on the Data tab as follows:

  • Set Buckets to Split Slices
  • Set Aggregation to Terms
  • Set Field to industry.keyword

create_pie_chart

You can visualize the data by creating more visuals in the OpenSearch dashboard.

add_visuals

Clean up

When you’re done exploring OpenSearch Ingestion and OpenSearch Dashboards, you can delete the resources you created to avoid incurring further costs.

Conclusion

In this post, you learned how to ingest CSV files efficiently from S3 buckets into OpenSearch Service with the OpenSearch Ingestion feature in a serverless way without requiring a third-party agent. You also learned how to analyze the ingested data using OpenSearch dashboard visualizations. You can now explore extending this solution to build OpenSearch Ingestion pipelines to load your data and derive insights with OpenSearch Dashboards.


About the Authors

Sharmila Shanmugam is a Solutions Architect at Amazon Web Services. She is passionate about solving the customers’ business challenges with technology and automation and reduce the operational overhead. In her current role, she helps customers across industries in their digital transformation journey and build secure, scalable, performant and optimized workloads on AWS.

Harsh Bansal is an Analytics Solutions Architect with Amazon Web Services. In his role, he collaborates closely with clients, assisting in their migration to cloud platforms and optimizing cluster setups to enhance performance and reduce costs. Before joining AWS, he supported clients in leveraging OpenSearch and Elasticsearch for diverse search and log analytics requirements.

Rohit Kumar works as a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He focuses on Amazon OpenSearch Service, offering guidance and technical help to customers, helping them create scalable, highly available, and secure solutions on AWS Cloud. Outside of work, Rohit enjoys watching or playing cricket. He also loves traveling and discovering new places. Essentially, his routine revolves around eating, traveling, cricket, and repeating the cycle.

Optimize storage costs in Amazon OpenSearch Service using Zstandard compression

Post Syndicated from Sarthak Aggarwal original https://aws.amazon.com/blogs/big-data/optimize-storage-costs-in-amazon-opensearch-service-using-zstandard-compression/

This post is co-written with Praveen Nischal and Mulugeta Mammo from Intel.

Amazon OpenSearch Service is a managed service that makes it straightforward to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. In an OpenSearch Service domain, the data is managed in the form of indexes. Based on the usage pattern, an OpenSearch cluster may have one or more indexes, and their shards are spread across the data nodes in the cluster. Each data node has a fixed disk size and the disk usage is dependent on the number of index shards stored on the node. Each index shard may occupy different sizes based on its number of documents. In addition to the number of documents, one of the important factors that determine the size of the index shard is the compression strategy used for an index.

As part of an indexing operation, the ingested documents are stored as immutable segments. Each segment is a collection of various data structures, such as inverted index, block K dimensional tree (BKD), term dictionary, or stored fields, and these data structures are responsible for retrieving the document faster during the search operation. Out of these data structures, stored fields, which are largest fields in the segment, are compressed when stored on the disk and based on the compression strategy used, the compression speed and the index storage size will vary.

In this post, we discuss the performance of the Zstandard algorithm, which was introduced in OpenSearch v2.9, amongst other available compression algorithms in OpenSearch.

Importance of compression in OpenSearch

Compression plays a crucial role in OpenSearch, because it significantly impacts the performance, storage efficiency and overall usability of the platform. The following are some key reasons highlighting the importance of compression in OpenSearch:

  1. Storage efficiency and cost savings OpenSearch often deals with vast volumes of data, including log files, documents, and analytics datasets. Compression techniques reduce the size of data on disk, leading to substantial cost savings, especially in cloud-based and/or distributed environments.
  2. Reduced I/O operations Compression reduces the number of I/O operations required to read or write data. Fewer I/O operations translate into reduced disk I/O, which is vital for improving overall system performance and resource utilization.
  3. Environmental impact By minimizing the storage requirements and reduced I/O operations, compression contributes to a reduction in energy consumption and a smaller carbon footprint, which aligns with sustainability and environmental goals.

When configuring OpenSearch, it’s essential to consider compression settings carefully to strike the right balance between storage efficiency and query performance, depending on your specific use case and resource constraints.

Core concepts

Before diving into various compression algorithms that OpenSearch offers, let’s look into three standard metrics that are often used while comparing compression algorithms:

  1. Compression ratio The original size of the input compared with the compressed data, expressed as a ratio of 1.0 or greater
  2. Compression speed The speed at which data is made smaller (compressed), expressed in MBps of input data consumed
  3. Decompression speed The speed at which the original data is reconstructed from the compressed data, expressed in MBps

Index codecs

OpenSearch provides support for codecs that can be used for compressing the stored fields. Until OpenSearch 2.7, OpenSearch provided two codecs or compression strategies: LZ4 and Zlib. LZ4 is analogous to best_speed because it provides faster compression but a lesser compression ratio (consumes more disk space) when compared to Zlib. LZ4 is used as the default compression algorithm if no explicit codec is specified during index creation and is preferred by most because it provides faster indexing and search speeds though it consumes relatively more space than Zlib. Zlib is analogous to best_compression because it provides a better compression ratio (consumes less disk space) when compared to LZ4, but it takes more time to compress and decompress, and therefore has higher latencies for indexing and search operations. Both LZ4 and Zlib codecs are part of the Lucene core codecs.

Zstandard codec

The Zstandard codec was introduced in OpenSearch as an experimental feature in version 2.7, and it provides Zstandard-based compression and decompression APIs. The Zstandard codec is based on JNI binding to the Zstd native library.

Zstandard is a fast, lossless compression algorithm aimed at providing a compression ratio comparable to Zlib but with faster compression and decompression speed comparable to LZ4. The Zstandard compression algorithm is available in two different modes in OpenSearch: zstd and zstd_no_dict. For more details, see Index codecs.

Both codec modes aim to balance compression ratio, index, and search throughput. The zstd_no_dict option excludes a dictionary for compression at the expense of slightly larger index sizes.

With the recent OpenSearch 2.9 release, the Zstandard codec has been promoted from experimental to mainline, making it suitable for production use cases.

Create an index with the Zstd codec

You can use the index.codec during index creation to create an index with the Zstd codec. The following is an example using the curl command (this command requires the user to have necessary privileges to create an index):

# Creating an index
curl -XPUT "http://localhost:9200/your_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.codec": "zstd"
  }
}'

Zstandard compression levels

With Zstandard codecs, you can optionally specify a compression level using the index.codec.compression_level setting, as shown in the following code. This setting takes integers in the [1, 6] range. A higher compression level results in a higher compression ratio (smaller storage size) with a trade-off in speed (slower compression and decompression speeds lead to higher indexing and search latencies). For more details, see Choosing a codec.

# Creating an index
curl -XPUT "http://localhost:9200/your_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.codec": "zstd",
    "index.codec.compression_level": 2
  }
}
'

Update an index codec setting

You can update the index.codec and index.codec.compression_level settings any time after the index is created. For the new configuration to take effect, the index needs to be closed and reopened.

You can update the setting of an index using a PUT request. The following is an example using curl commands.

Close the index:

# Close the index 
curl -XPOST "http://localhost:9200/your_index/_close"

Update the index settings:

# Update the index.codec and codec.compression_level setting
curl -XPUT "http://localhost:9200/your_index/_settings" -H 'Content-Type: application/json' -d' 
{ 
  "index": {
    "codec": "zstd_no_dict", 
    "codec.compression_level": 3 
  } 
}'

Reopen the index:

# Reopen the index
curl -XPOST "http://localhost:9200/your_index/_open"

Changing the index codec settings doesn’t immediately affect the size of existing segments. Only new segments created after the update will reflect the new codec setting. To have consistent segment sizes and compression ratios, it may be necessary to perform a reindexing or other indexing processes like merges.

Benchmarking compression performance of compression in OpenSearch

To understand the performance benefits of Zstandard codecs, we carried out a benchmark exercise.

Setup

The server setup was as follows:

  1. Benchmarking was performed on an OpenSearch cluster with a single data node which acts as both data and coordinator node and with a dedicated cluster_manager node.
  2. The instance type for the data node was r5.2xlarge and the cluster_manager node was r5.xlarge, both backed by an Amazon Elastic Block Store (Amazon EBS) volume of type GP3 and size 100GB.

Benchmarking was set up as follows:

  1. The benchmark was run on a single node of type c5.4xlarge (sufficiently large to avoid hitting client-side resource constraints) backed by an EBS volume of type GP3 and size 500GB.
  2. The number of clients was 16 and bulk size was 1024
  3. The workload was nyc_taxis

The index setup was as follows:

  1. Number of shards: 1
  2. Number of replicas: 0

Results

From the experiments, zstd provides a better compression ratio compared to Zlib (best_compression) with a slight gain in write throughput and with similar read latency as LZ4 (best_speed). zstd_no_dict provides 14% better write throughput than LZ4 (best_speed) and a slightly lower compression ratio than Zlib (best_compression).

The following table summarizes the benchmark results.

Limitations

Although Zstd provides the best of both worlds (compression ratio and compression speed), it has the following limitations:

  1. Certain queries that fetch the entire stored fields for all the matching documents may observe an increase in latency. For more information, see Changing an index codec.
  2. You can’t use the zstd and zstd_no_dict compression codecs for k-NN or Security Analytics indexes.

Conclusion

Zstandard compression provides a good balance between storage size and compression speed, and is able to tune the level of compression based on the use case. Intel and the OpenSearch Service team collaborated on adding Zstandard as one of the compression algorithms in OpenSearch. Intel contributed by designing and implementing the initial version of compression plugin in open-source which was released in OpenSearch v2.7 as experimental feature. OpenSearch Service team worked on further improvements, validated the performance results and integrated it into the OpenSearch server codebase where it was released in OpenSearch v2.9 as a generally available feature.

If you would want to contribute to OpenSearch, create a GitHub issue and share your ideas with us. We would also be interested in learning about your experience with Zstandard in OpenSearch Service. Please feel free to ask more questions in the comments section.


About the Authors

Praveen Nischal is a Cloud Software Engineer, and leads the cloud workload performance framework at Intel.

Mulugeta Mammo is a Senior Software Engineer, and currently leads the OpenSearch Optimization team at Intel.

Akash Shankaran is a Software Architect and Tech Lead in the Xeon software team at Intel. He works on pathfinding opportunities, and enabling optimizations for data services such as OpenSearch.

Sarthak Aggarwal is a Software Engineer at Amazon OpenSearch Service. He has been contributing towards open-source development with indexing and storage performance as a primary area of interest.

Prabhakar Sithanandam is a Principal Engineer with Amazon OpenSearch Service. He primarily works on the scalability and performance aspects of OpenSearch.

AWS Weekly Roundup: New AWS Heroes, Amazon API Gateway, Amazon Q and more (June 10, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-new-aws-heroes-amazon-api-gateway-amazon-q-and-more-june-10-2024/

In the last AWS Weekly Roundup, Channy reminded us on how life has ups and downs. It’s just how life is. But, that doesn’t mean that we should do it alone. Farouq Mousa, AWS Community Builder, is fighting brain cancer and Allen Helton, AWS Serverless Hero, his daughter is fighting leukemia.

If you have a moment, please visit their campaign pages and give your support.

Meanwhile, we’ve just finished a few AWS Summits in India, Korea and also Thailand. As always, I had so much fun working together at Developer Lounge with AWS Heroes, AWS Community Builders, and AWS User Group leaders. Here’s a photo from everyone here.

Last Week’s Launches
Here are some launches that caught my attention last week:

Welcome, new AWS Heroes! — Last week, we just announced new cohort for AWS Heroes, worldwide group of AWS experts who go above and beyond to share knowledge and empower their communities.

Amazon API Gateway increased integration timeout limit — If you’re using Regional REST APIs and private REST APIs in Amazon API Gateway, now you can increase the integration timeout limit greater than 29 seconds. This allows you to run various workloads requiring longer timeouts.

Amazon Q offers inline completion in the command line — Now, Amazon Q Developer provides real-time AI-generated code suggestions as you type in your command line. As a regular command line interface (CLI) user, I’m really excited about this.

New common control library in AWS Audit Manager — This announcement helps you to save time when mapping enterprise controls into AWS Audit Manager. Check out Danilo’s post where he elaborated how that you can simplify risk and complicance assessment with the new common control library.

Amazon Inspector container image scanning for Amazon CodeCatalyst and GitHub actions — If you need to integrate your CI/CD with software vulnerabilities checking, you can use Amazon Inspector. Now, with this native integration in GitHub actions and Amazon CodeCatalyst, it streamlines your development pipeline process.

Ingest streaming data with Amazon OpenSearch Ingestion and Amazon Managed Streaming for Apache Kafka — With this new capability, now you can build more efficient data pipelines for your complex analytics use cases. Now, you can seamlessly index the data from your Amazon MSK Serverless clusters in Amazon OpenSearch service.

Amazon Titan Text Embeddings V2 now available in Amazon Bedrock Knowledge Base — You now can embed your data into a vector database using Amazon Titan Text Embeddings V2. This will be helpful for you to retrieve relevant information for various tasks.

Max tokens 8,192
Languages 100+ in pre-training
Fine-tuning supported No
Normalization supported Yes
Vector size 256, 512, 1,024 (default)

From Community.aws
Here’s my 3 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for these AWS and AWS Community events:

  • AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Japan (June 20), Washington, DC (June 26–27), and New York (July 10).

  • AWS re:Inforce — Join us for AWS re:Inforce (June 10–12) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. Connect with the AWS teams that build the security tools and meet AWS customers to learn about their security journeys.

  • AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Midwest | Columbus (June 13), Sri Lanka (June 27), Cameroon (July 13), New Zealand (August 15), Nigeria (August 24), and New York (August 28).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

Post Syndicated from Joshua Bright original https://aws.amazon.com/blogs/big-data/modernize-your-data-observability-with-amazon-opensearch-service-zero-etl-integration-with-amazon-s3/

We are excited to announce the general availability of Amazon OpenSearch Service zero-ETL integration with Amazon Simple Storage Service (Amazon S3) for domains running 2.13 and above. The integration is new way for customers to query operational logs in Amazon S3 and Amazon S3-based data lakes without needing to switch between tools to analyze operational data. By querying across OpenSearch Service and S3 datasets, you can evaluate multiple data sources to perform forensic analysis of operational and security events. The new integration with OpenSearch Service supports AWS’s zero-ETL vision to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling you to directly query your operational data, reducing costs and time to action.

OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch 7.10. OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing hundreds of trillions of requests per month.

Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. Organizations of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-centered applications, and mobile apps. With cost-effective storage classes and user-friendly management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements. Let’s dig into this exciting new feature for OpenSearch Service.

Benefits of using OpenSearch Service zero-ETL integration with Amazon S3

OpenSearch Service zero-ETL integration with Amazon S3 allows you to use the rich analytics capabilities of OpenSearch Service SQL and PPL directly on infrequently queried data stored outside of OpenSearch Service in Amazon S3. It also integrates with other OpenSearch integrations so you can install prepackaged queries and visualizations to analyze your data, making it straightforward to quickly get started.

The following diagram illustrates how OpenSearch Service unlocks value stored in infrequently queried logs from popular AWS log types.

You can use OpenSearch Service direct queries to query data in Amazon S3. OpenSearch Service provides a direct query integration with Amazon S3 as a way to analyze operational logs in Amazon S3 and data lakes based in Amazon S3 without having to switch between services. You can now analyze data in cloud object stores and simultaneously use the operational analytics and visualizations of OpenSearch Service.

Many customers currently use Amazon S3 to store event data for their solutions. For operational analytics, Amazon S3 is typically used as a destination for VPC Flow Logs, Amazon S3 Access Logs, AWS Load Balancer Logs, and other event sources from AWS services. Customers also store data directly from application events in Amazon S3 for compliance and auditing needs. The durability and scalability of Amazon S3 makes it an obvious data destination for many customers that want a longer-term storage or archival option at a cost-effective price point.

Bringing data from these sources into OpenSearch Service stored in hot and warm storage tiers may be prohibitive due to the size and volume of the events being generated. For some of these event sources that are stored into OpenSearch Service indexes, the volume of queries run against the data doesn’t justify the cost to continue to store them in their cluster. Previously, you would pick and choose which event sources you brought in for ingestion into OpenSearch Service based on the storage provisioned in your cluster. Access to other data meant using different tools such as Amazon Athena to view the data on Amazon S3.

For a real-world example, let’s see how using the new integration benefited Arcesium.

“Arcesium provides advanced cloud-native data, operations, and analytics capabilities for the financial services industry. Our software platform processes many millions of transactions a day, emitting large volumes of log and audit records along the way. The volume of log data we needed to process, store, and analyze was growing exponentially given our retention and compliance needs. Amazon OpenSearch Service’s new zero-ETL integration with Amazon S3 is helping our business scale by allowing us to analyze infrequently queried logs already stored in Amazon S3 instead of incurring the operational expense of maintaining large and costly online OpenSearch clusters or building ad hoc ingestion pipelines.”

– Kyle George, SVP & Global Head of Infrastructure at Arcesium.

With direct queries with Amazon S3, you no longer need to build complex extract, transform, and load (ETL) pipelines or incur the expense of duplicating data in both OpenSearch Service and Amazon S3 storage.

Fundamental concepts

After configuring a direct query connection, you’ll need to create tables in the AWS Glue Data Catalog using the OpenSearch Service Query Workbench. The direct query connection relies on the metadata in Glue Data Catalog tables to query data stored in Amazon S3. Note that tables created by AWS Glue crawlers or Athena are not currently supported.

By combining the structure of Data Catalog tables, SQL indexing techniques, and OpenSearch Service indexes, you can accelerate query performance, unlock advanced analytics capabilities, and contain querying costs. Below are a few examples of how you can accelerate your data:

  • Skipping indexes – You ingest and index only the metadata of the data stored in Amazon S3. When you query a table with a skipping index, the query planner references the index and rewrites the query to efficiently locate the data, instead of scanning all partitions and files. This allows the skipping index to quickly narrow down the specific location of the stored data that’s relevant to your analysis.
  • Materialized views – With materialized views, you can use complex queries, such as aggregations, to power dashboard visualizations. Materialized views ingest a small amount of your data into OpenSearch Service storage.
  • Covering indexes – With a covering index, you can ingest data from a specified column in a table. This is the most performant of the three indexing types. Because OpenSearch Service ingests all data from your desired column, you get better performance and can perform advanced analytics. OpenSearch Service creates a new index from the covering index data. You can use this new index for dashboard visualizations and other OpenSearch Service functionality, such as anomaly detection or geospatial capabilities.

As new data comes in to your S3 bucket, you can configure a refresh interval for your materialized views and covering indexes to provide local access to the most current data on Amazon S3.

Solution overview

Let’s take a test drive using VPC Flow Logs as your source! As mentioned before, many AWS services emit logs to Amazon S3. VPC Flow Logs is a feature of Amazon Virtual Private Cloud (Amazon VPC) that enables you to capture information about the IP traffic going to and from network interfaces in your VPC. For this walkthrough, you perform the following steps:

  1. Create an S3 bucket if you don’t already have one available.
  2. Enable VPC Flow Logs using an existing VPC that can generate traffic and store the logs as Parquet on Amazon S3.
  3. Verify the logs exist in your S3 bucket.
  4. Set up a direct query connection to the Data Catalog and the S3 bucket that has your data.
  5. Install the integration for VPC Flow Logs.

Create an S3 bucket

If you have an existing S3 bucket, you can reuse that bucket by creating a new folder inside of the bucket. If you need to create a bucket, navigate to the Amazon S3 console and create an Amazon S3 bucket with a name that is suitable for your organization.

Enable VPC Flow Logs

Complete the following steps to enable VPC Flow Logs:

  1. On the Amazon VPC console, choose a VPC that has application traffic that can generate logs.
  2. On the Flow Logs tab, choose Create flow log.
  3. For Filter, choose ALL.
  4. Set Maximum aggregation interval to 1 minute.
  5. For Destination, choose Send to an Amazon S3 bucket and provide the S3 bucket ARN from the bucket you created earlier.
  6. For Log record format, choose Custom format and select Standard attributes.

For this post, we don’t select any of the Amazon Elastic Container Service (Amazon ECS) attributes because they’re not implemented with OpenSearch integrations as of this writing.

  1. For Log file format, choose Parquet.
  2. For Hive-compatible S3 prefix, choose Enable.
  3. Set Partition logs by time to every 1 hour (60 minutes).

Validate you are receiving logs in your S3 bucket

Navigate to the S3 bucket you created earlier to see that data is streaming into your S3 bucket. If you drill down and navigate the directory structure, you find that the logs are delivered in an hourly folder and emitted every minute.

Now that you have VPC Flow Logs flowing into an S3 bucket, you need to set up a connection between your data on Amazon S3 and your OpenSearch Service domain.

Set up a direct query data source

In this step, you create a direct query data source which uses Glue Data Catalog tables and your Amazon S3 data. The action creates all the necessary infrastructure to give you access to the Hive metastore (databases and tables in Glue Data Catalog and the data housed in Amazon S3 for the bucket and folder combination you want the data source to have access to. It will also wire in all the appropriate permissions with the Security plugin’s fine-grained access control so you don’t have to worry about permissions to get started.

Complete the following steps to set up your direct query data source:

  1. On the OpenSearch Service domain, choose Domains in the navigation pane.
  2. Choose your domain.
  3. On the Connections tab, choose Create new connection.
  4. For Name, enter a name without dashes, such as zero_etl_walkthrough.
  5. For Description, enter a descriptive name.
  6. For Data source type, choose Amazon S3 with AWS Glue Data Catalog.
  7. For IAM role, if this is your first time, let the direct query setup take care of the permissions by choosing Create a new role. You can edit it later based on your organization’s compliance and security needs. For this post, we name the role zero_etl_walkthrough.
  8. For S3 buckets, use the one you created.
  9. Do not select the check box to grant access to all new and existing buckets.
  10. For Checkpoint S3 bucket, use the same bucket you created. The checkpoint folders get created for you automatically.
  11. For AWS Glue tables, because you don’t have anything that you have created in the Data Catalog, enable Grant access to all existing and new tables.

The VPC Flow Logs OpenSearch integration will create resources in the Data Catalog, and you will need access to pick those resources up.

  1. Choose Create.

Now that the initial setup is complete, you can install the OpenSearch integration for VPC Flow Logs.

Install the OpenSearch integration for VPC Flow Logs

The integrations plugin contains a wide variety of prebuilt dashboards, visualizations, mapping templates, and other resources that make visualizing and working with data generated by your sources simpler. The integration for Amazon VPC installs a variety of resources to view your VPC Flow Logs data as it sits in Amazon S3.

In this section, we show you how to make sure you have the most up-to-date integration packages for installation. We then show you how to install the OpenSearch integration. In most cases, you will have the latest integrations such as VPC Flow Logs, NGINX, HA Proxy, or Amazon S3 (access logs) at the time of the release of a minor or major version. However, OpenSearch is an open source community-led project, and you can expect that there will be version changes and new integrations not yet included with your current deployment.

Verify the latest version of the OpenSearch integration for Amazon VPC

You may have upgraded from earlier versions of OpenSearch Service to OpenSearch Service version 2.13. Let’s confirm that your deployment matches what is present in this post.

On OpenSearch Dashboards, navigate to the Integrations tab and choose Amazon VPC. You will see a release version for the integration.

Confirm that you have version 1.1.0 or higher. If your deployment doesn’t have it, you can install the latest version of the integration from the OpenSearch catalog. Complete the following steps:

  1. Navigate to the OpenSearch catalog.
  2. Choose Amazon VPC Flow Logs.
  3. Download the 1.1.0 Amazon VPC Integration file from the repository folder labeled amazon_vpc_flow_1.1.0.
  4. In the OpenSearch Dashboard’s Dashboard Management plugin, choose Saved objects.
  5. Choose Import and browse your local folders.
  6. Import the downloaded file.

The file contains all the necessary objects to create an integration. After it’s installed, you can proceed to the steps to set up the Amazon VPC OpenSearch integration.

Set up the OpenSearch integration for Amazon VPC

Let’s jump in and install the integration:

  1. In OpenSearch Dashboards, navigate to the Integrations tab.
  2. Choose the Amazon VPC integration.
  3. Confirm the version is 1.1.0 or higher and choose Set Up.
  4. For Display Name, keep the default.
  5. For Connection Type, choose S3 Connection.
  6. For Data Source, choose the direct query connection alias you created in prior steps. In this post, we use zero_etl_walkthrough.
  7. For Spark Table Name, keep the prepopulated value of amazon_vpc_flow.
  8. For S3 Data Location, enter the S3 URI of your log folder created by VPC Flow Logs set up in the prior steps. In this post, we use s3://zero-etl-walkthrough/AWSLogs/.

S3 bucket names are globally unique, and you may want to consider using bucket names that conform to your company’s compliance guidance. UUIDs plus a descriptive name are good options to guarantee uniqueness.

  1. For S3 Checkpoint Location, enter the S3 URI of your checkpoint folder which you define. Checkpoints store metadata for the direct query feature. Make sure you pick any empty or unused path in the bucket you choose. In this post, we use s3://zero-etl-walkthrough/CP/, which is in the same bucket we created earlier.
  2. Select Queries (recommended) and Dashboards and Visualizations for Flint Integrations using live queries.

You get a message that states “Setting Up the Integration – this can take several minutes.” This particular integration sets up skipping indexes and materialized views on top of your data in Amazon S3. The materialized view aggregates the data into a backing index that occupies a significantly smaller data footprint in your cluster compared to ingesting all the data and building visualizations on top of it.

When the Amazon VPC integration installation is complete, you have a broad variety of assets to play with. If you navigate to the installed integrations, you will find queries, visualizations, and other assets that can help you jumpstart your data exploration using data sitting on Amazon S3. Let’s look at the dashboard that gets installed for this integration.

I love it! How much does it cost?

With OpenSearch Service direct queries, you only pay for the resources consumed by your workload. OpenSearch Service charges for only the compute needed to query your external data as well as maintain optional indexes in OpenSearch Service. The compute capacity is measured in OpenSearch Compute Units (OCUs). If no queries or indexing activities are active, no OCUs are consumed. The following table contains sample compute prices based on searching HTTP logs in IAD.

Data scanned per query (GB) OCU price per query (USD)
1-10 $0.026
100 $0.24
1000 $1.35

Because the price is based on the OCUs used per query, this solution is tailored for infrequently queried data. If your users query data often, it makes more sense to fully ingest into OpenSearch Service and take advantage of storage optimization techniques such as using OR1 instances or UltraWarm.

OCUs consumed by zero-ETL integrations will be populated in AWS Cost Explorer. This will be at the account level. You can account for OCU usage at the account level and set thresholds and alerts when thresholds have been crossed. The format of the usage type to filter on in Cost Explorer is RegionCode-DirectQueryOCU (OCU-hours). You can create a budget using AWS Budgets and configure an alert to be notified when DirectQueryOCU (OCU-Hours) usage meets the threshold you set. You can also optionally use an Amazon Simple Notification Service (Amazon SNS) topic with an AWS Lambda function as a target to turn off a data source when a threshold criterion is met.

Summary

Now that you have a high-level understanding of the direct query connection feature, OpenSearch integrations, and how the OpenSearch Service zero-ETL integration with Amazon S3 works, you should consider using the feature as part of your organization’s toolset. With OpenSearch Service zero-ETL integration with Amazon S3, you now have a new tool for event analysis. You can bring hot data into OpenSearch Service for near real-time analysis and alerting. For the infrequently queried, larger data, mainly used for post-event analysis and correlation, you can query that data on Amazon S3 without moving the data. The data stays in Amazon S3 for cost-effective storage, and you access that data as needed without building additional infrastructure to move the data into OpenSearch Service for analysis.

For more information, refer to Working with Amazon OpenSearch Service direct queries with Amazon S3.


About the authors

Joshua Bright is a Senior Product Manager at Amazon Web Services. Joshua leads data lake integration initiatives within the OpenSearch Service team. Outside of work, Joshua enjoys listening to birds while walking in nature.

Kevin Fallis is an Principal Specialist Search Solutions Architect at Amazon Web Services. His passion is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.


Sam Selvan
is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Post Syndicated from Anand Komandooru original https://aws.amazon.com/blogs/big-data/implement-a-full-stack-serverless-search-application-using-aws-amplify-amazon-cognito-amazon-api-gateway-aws-lambda-and-amazon-opensearch-serverless/

Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability.

Amazon OpenSearch Serverless is a powerful and scalable search and analytics engine that can significantly contribute to the development of search applications. It allows you to store, search, and analyze large volumes of data in real time, offering scalability, real-time capabilities, security, and integration with other AWS services. With OpenSearch Serverless, you can search and analyze a large volume of data without having to worry about the underlying infrastructure and data management. An OpenSearch Serverless collection is a group of OpenSearch indexes that work together to support a specific workload or use case. Collections have the same kind of high-capacity, distributed, and highly available storage volume that’s used by provisioned Amazon OpenSearch Service domains, but they remove complexity because they don’t require manual configuration and tuning. Each collection that you create is protected with encryption of data at rest, a security feature that helps prevent unauthorized access to your data. OpenSearch Serverless also supports OpenSearch Dashboards, which provides an intuitive interface for analyzing data.

OpenSearch Serverless supports three primary use cases:

  • Time series – The log analytics workloads that focus on analyzing large volumes of semi-structured, machine-generated data in real time for operational, security, user behavior, and business insights
  • Search – Full-text search that powers applications in your internal networks (content management systems, legal documents) and internet-facing applications, such as ecommerce website search and content search
  • Vector search – Semantic search on vector embeddings that simplifies vector data management and powers machine learning (ML) augmented search experiences and generative artificial intelligence (AI) applications, such as chatbots, personal assistants, and fraud detection

In this post, we walk you through a reference implementation of a full-stack cloud-centered serverless text search application designed to run using OpenSearch Serverless.

Solution overview

The following services are used in the solution:

  • AWS Amplify is a set of purpose-built tools and features that enables frontend web and mobile developers to quickly and effortlessly build full-stack applications on AWS. These tools have the flexibility to use the breadth of AWS services as your use cases evolve. This solution uses the Amplify CLI to build the serverless movie search web application. The Amplify backend is used to create resources such as the Amazon Cognito user pool, API Gateway, Lambda function, and Amazon S3 storage.
  • Amazon API Gateway is a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale. We use API Gateway as a “front door” for the movie search application for searching movies.
  • AWS CloudFront accelerates the delivery of web content such as static and dynamic web pages, video streams, and APIs to users across the globe by caching content at edge locations closer to the end-users. This solution uses CloudFront with Amazon S3 to deliver the search application user interface to the end users.
  • Amazon Cognito makes it straightforward for adding authentication, user management, and data synchronization without having to write backend code or manage any infrastructure. We use Amazon Cognito for creating a user pool so the end-user can log in to the movie search application through Amazon Cognito.
  • AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Our solution uses a Lambda function to query OpenSearch Serverless. API Gateway forwards all requests to the Lambda function to serve up the requests.
  • Amazon OpenSearch Serverless is a serverless option for OpenSearch Service. In this post, you use common methods for searching documents in OpenSearch Service that improve the search experience, such as request body searches using domain-specific language (DSL) for queries. The query DSL lets you specify the full range of OpenSearch search options, including pagination and sorting the search results. Pagination and sorting are implemented on the server side using DSL as part of this implementation.
  • Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The solution uses Amazon S3 as storage for storing movie trailers.
  • AWS WAF helps protects web applications from attacks by allowing you to configure rules that allow, block, or monitor (count) web requests based on conditions that you define. We use AWS WAF to allow access to the movie search app from only IP addresses on an allow list.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. The end-user accesses the CloudFront and Amazon S3 hosted movie search web application from their browser or mobile device.
  2. The user signs in with their credentials.
  3. A request is made to an Amazon Cognito user pool for a login authentication token, and a token is received for a successful sign-in request.
  4. The search application calls the search API method with the token in the authorization header to API Gateway. API Gateway is protected by AWS WAF to enforce rate limiting and implement allow and deny lists.
  5. API Gateway passes the token for validation to the Amazon Cognito user pool. Amazon Cognito validates the token and sends a response to API Gateway.
  6. API Gateway invokes the Lambda function to process the request.
  7. The Lambda function queries OpenSearch Serverless and returns the metadata for the search.
  8. Based on metadata, content is returned from Amazon S3 to the user.

In the following sections, we walk you through the steps to deploy the solution, ingest data, and test the solution.

Prerequisites

Before you get started, make sure you complete the following prerequisites:

  1. Install Nodejs latest LTS version.
  2. Install and configure the AWS Command Line Interface (AWS CLI).
  3. Install awscurl for data ingestion.
  4. Install and configure the Amplify CLI. At the end of configuration, you should successfully set up the new user using the amplify-dev user’s AccessKeyId and SecretAccessKey in your local machine’s AWS profile.
  5. Amplify users need additional permissions in order to deploy AWS resources. Complete the following steps to create a new inline AWS Identity and Access Management (IAM) policy and attach it to the user:
    • On the IAM console, choose Users in the navigation pane.
    • Choose the user amplify-dev.
    • On the Permissions tab, choose the Add permissions dropdown menu, then choose Inline policy.
    • In the policy editor, choose JSON.

You should see the default IAM statement in JSON format.

This environment name needs to be used when performing amplify init when bringing up the backend. The actions in the IAM statement are largely open (*) but restricted or limited by the target resources; this is done to satisfy the maximum inline policy length (2,048 characters).

    • Enter the updated JSON into the policy editor, then choose Next.
    • For Policy name, enter a name (for this post, AddionalPermissions-Amplify).
    • Choose Create policy.

You should now see the new inline policy attached to the user.

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the repository to a new folder on your desktop using the following command:
    git clone https://github.com/aws-samples/amazon-opensearchserverless-searchapp.git

  2. Deploy the movie search backend.
  3. Deploy the movie search frontend.

Ingest data

To ingest the sample movie data into the newly created OpenSearch Serverless collection, complete the following steps:

  • On the OpenSearch Service console, choose Ingestion: Pipelines in the navigation pane.
  • Choose the pipeline movie-ingestion and locate the ingestion URL.

  • Replace the ingestion endpoint and Region in the following snippet and run the awscurl command to save data into the collection:
awscurl --service osis --region <region> \
-X POST \
-H "Content-Type: application/json" \
-d "@project_assets/movies-data.json" \
https://<ingest_url>/movie-ingestion/data 

You should see a 200 OK response.

  • On the Amazon S3 console, open the trailer S3 bucket (created as part of the backend deployment.
  • Upload some movie trailers.

Storage

Make sure the file name matches the ID field in sample movie data (for example, tt1981115.mp4, tt0800369.mp4, and tt0172495.mp4). Uploading a trailer with ID tt0172495.mp4 is used as the default trailer for all movies, without having to upload one for each movie.

Test the solution

Access the application using the CloudFront distribution domain name. You can find this by opening the CloudFront console, choosing the distribution, and copying the distribution domain name into your browser.

Sign up for application access by entering your user name, password, and email address. The password should be at least eight characters in length, and should include at least one uppercase character and symbol.

Sign Up

After you’re logged in, you’re redirected to the Movie Finder home page.

Home Page

You can search using a movie name, actor, or director, as shown in the following example. The application returns results using OpenSearch DSL.

Search Results

If there’s a large number of search results, you can navigate through them using the pagination option at the bottom of the page. For more information about how the application uses pagination, see Paginating search results.

Pagination

You can choose movie tiles to get more details and watch the trailer if you took the optional step of uploading a movie trailer.

Movie Details

You can sort the search results using the Sort by feature. The application uses the sort functionality within OpenSearch.

Sort

There are many more DSL search patterns that allow for intricate searches. See Query DSL for complete details.

Monitoring OpenSearch Serverless

Monitoring is an important part of maintaining the reliability, availability, and performance of OpenSearch Serverless and your other AWS services. AWS provides Amazon CloudWatch and AWS CloudTrail to monitor OpenSearch Serverless, report when something is wrong, and take automatic actions when appropriate. For more information, see Monitoring Amazon OpenSearch Serverless.

Clean up

To avoid unnecessary charges, clean up the solution implementation by running the following command at the project root folder you created using the git clone command during deployment:

amplify delete

You can also clean up the solution by deleting the AWS CloudFormation stack you deployed as part of the setup. For instructions, see Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, we implemented a full-stack serverless search application using OpenSearch Serverless. This solution seamlessly integrates with various AWS services, such as Lambda for serverless computing, API Gateway for constructing RESTful APIs, IAM for robust security, Amazon Cognito for streamlined user management, and AWS WAF for safeguarding the web application against threats. By adopting a serverless architecture, this search application offers numerous advantages, including simplified deployment processes and effortless scalability, with the benefits of a managed infrastructure.

With OpenSearch Serverless, you get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. You pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application without impacting performance and scale as needed. You can use OpenSearch Serverless and this reference implementation to build your own full-stack text search application.


About the Authors

Anand Komandooru is a Principal Cloud Architect at AWS. He joined AWS Professional Services organization in 2021 and helps customers build cloud-native applications on AWS cloud. He has over 20 years of experience building software and his favorite Amazon leadership principle is “Leaders are right a lot“.

Rama Krishna Ramaseshu is a Senior Application Architect at AWS. He joined AWS Professional Services in 2022 and with close to two decades of experience in application development and software architecture, he empowers customers to build well architected solutions within the AWS cloud. His favorite Amazon leadership principle is “Learn and Be Curious”.

Sachin Vighe is a Senior DevOps Architect at AWS. He joined AWS Professional Services in 2020, and specializes in designing and architecting solutions within the AWS cloud to guide customers through their DevOps and Cloud transformation journey. His favorite leadership principle is “Customer Obsession”.

Molly Wu is an Associate Cloud Developer at AWS. She joined AWS Professional Services in 2023 and specializes in assisting customers in building frontend technologies in AWS cloud. Her favorite leadership principle is “Bias for Action”.

Andrew Yankowsky is a Security Consultant at AWS. He joined AWS Professional Services in 2023, and helps customers build cloud security capabilities and follow security best practices on AWS. His favorite leadership principle is “Earn Trust”.

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

Post Syndicated from Dhaval Shah original https://aws.amazon.com/blogs/big-data/build-a-decentralized-semantic-search-engine-on-heterogeneous-data-stores-using-autonomous-agents/

Large language models (LLMs) such as Anthropic Claude and Amazon Titan have the potential to drive automation across various business processes by processing both structured and unstructured data. For example, financial analysts currently have to manually read and summarize lengthy regulatory filings and earnings transcripts in order to respond to Q&A on investment strategies. LLMs could automate the extraction and summarization of key information from these documents, enabling analysts to query the LLM and receive reliable summaries. This would allow analysts to process the documents to develop investment recommendations faster and more efficiently. Anthropic Claude and other LLMs on Amazon Bedrock can bring new levels of automation and insight across many business functions that involve both human expertise and access to knowledge spread across an organization’s databases and content repositories.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how to build a Q&A bot with RAG (Retrieval Augmented Generation). RAG uses data sources like Amazon Redshift and Amazon OpenSearch Service to retrieve documents that augment the LLM prompt. For getting data from Amazon Redshift, we use the Anthropic Claude 2.0 on Amazon Bedrock, summarizing the final response based on pre-defined prompt template libraries from LangChain. To get data from Amazon OpenSearch Service, we chunk, and convert the source data chunks to vectors using Amazon Titan Text Embeddings model.

For client interaction we use Agent Tools based on ReAct. A ReAct prompt consists of few-shot task-solving trajectories, with human-written text reasoning traces and actions, as well as environment observations in response to actions. In this example, we use ReAct for zero-shot training to generate responses to fit in a pre-defined template. The additional information is concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time.

Solution overview

Our solution demonstrates how financial analysts can use generative artificial intelligence (AI) to adapt their investment recommendations based on financial reports and earnings transcripts with RAG to use LLMs to generate factual content.

The hybrid architecture uses multiple databases and LLMs, with foundation models from Amazon Bedrock for data source identification, SQL generation, and text generation with results. In the following architecture, Steps 1 and 2 represent data ingestion to be done by data engineering in batch mode. Steps 3, 4, and 5 are the queries and response formation.

The following diagram shows a more detailed view of the Q&A processing chain. The user asks a question, and LangChain queries the Redshift and OpenSearch Service data stores for relevant information to build the prompt. It sends the prompt to the Anthropic Claude on Amazon Bedrock model, and returns the response.

The details of each step are as follows:

  1. Populate the Amazon Redshift Serverless data warehouse with company stock information stored in Amazon Simple Storage Service (Amazon S3). Redshift Serverless is a fully functional data warehouse holding data tables maintained in real time.
  2. Load the unstructured data from your S3 data lake to OpenSearch Service to create an index to store and perform semantic search. The LangChain library loads knowledge base documents, splits the documents into smaller chunks, and uses Amazon Titan to generate embeddings for chunks.
  3. The client submits a question via an interface like a chatbot or website.
  4. You will create multiple steps to transform a user query passed from Amazon SageMaker Notebook to execute API calls to LLMs from Amazon Bedrock. Use LLM-based Agents to generate SQL from Text and then validate if query is relevant to data warehouse tables. If yes, run query to extract information. The LangChain library calls Amazon Titan embeddings to generate a vector for the user’s question. It calls OpenSearch vector search to get similar documents.
  5. LangChain calls Anthropic Claude on Amazon Bedrock model with the additional, retrieved knowledge as context, to generate an answer for the question. It returns generated content to client

In this deployment, you will choose Amazon Redshift Serverless, use Anthropic Claude 2.0  model on Amazon Bedrock and Amazon Titan Text Embeddings model. Overall spend for the deployment will be directly proportional to number of input/output tokens for Amazon Bedrock models, Knowledge base volume, usage hours and so on.

To deploy the solution, you need two datasets: SEC Edgar Annual Financial Filings and Stock pricing data. To join these datasets for analysis, you need to choose Stock Symbol as the join key. The provided AWS CloudFormation template deploys the datasets required for this post, along with the SageMaker notebook.

Prerequisites

To follow along with this post, you should have an AWS account with AWS Identity and Access Management (IAM) user credentials to deploy AWS services.

Deploy the chat application using AWS CloudFormation

To deploy the resources, complete the following steps:

  1. Deploy the following CloudFormation template to create your stack in the us-east-1 AWS Region.The stack will deploy an OpenSearch Service domain, Redshift Serverless endpoint, SageMaker notebook, and other services like VPC and IAM roles that you will use in this post. The template sets a default user name password for the OpenSearch Service domain, and sets up a Redshift Serverless admin. You can choose to modify them or use the default values.
  2. On the AWS CloudFormation console, navigate to the stack you created.
  3. On the Outputs tab, choose the URL for SageMakerNotebookURL to open the notebook.
  4. In Jupyter, choose semantic-search-with-amazon-opensearch, thenblog, then the LLM-Based-Agentfolder.
  5. Open the notebook Generative AI with LLM based autonomous agents augmented with structured and unstructured data.ipynb.
  6. Follow the instructions in the notebook and run the code sequentially.

Run the notebook

There are six major sections in the notebook:

  • Prepare the unstructured data in OpenSearch Service – Download the SEC Edgar Annual Financial Filings dataset and convert the company financial filing document into vectors with Amazon Titan Text Embeddings model and store the vector in an Amazon OpenSearch Service vector database.
  • Prepare the structured data in a Redshift database – Ingest the structured data into your Amazon Redshift Serverless table.
  • Query the unstructured data in OpenSearch Service with a vector search – Create a function to implement semantic search with OpenSearch Service. In OpenSearch Service, match the relevant company financial information to be used as context information to LLM. This is unstructured data augmentation to the LLM.
  • Query the structured data in Amazon Redshift with SQLDatabaseChain – Use the LangChain library LLM text to SQL to query company stock information stored in Amazon Redshift. The search result will be used as context information to the LLM.
  • Create an LLM-based ReAct agent augmented with data in OpenSearch Service and Amazon Redshift – Use the LangChain library to define a ReAct agent to judge whether the user query is stock- or investment-related. If the query is stock related, the agent will query the structured data in Amazon Redshift to get the stock symbol and stock price to augment context to the LLM. The agent also uses semantic search to retrieve relevant financial information from OpenSearch Service to augment context to the LLM.
  • Use the LLM-based agent to generate a final response based on the template used for zero-shot training – The following is a sample user flow for a stock price recommendation for the query, “Is ABC a good investment choice right now.”

Example questions and responses

In this section, we show three example questions and responses to test our chatbot.

Example 1: Historical data is available

In our first test, we explore how the bot responds to a question when historical data is available. We use the question, “Is [Company Name] a good investment choice right now?” Replace [Company Name] with a company you want to query.

This is a stock-related question. The company stock information is in Amazon Redshift and the financial statement information is in OpenSearch Service. The agent will run the following process:

  1. Determine if this is a stock-related question.
  2. Get the company name.
  3. Get the stock symbol from Amazon Redshift.
  4. Get the stock price from Amazon Redshift.
  5. Use semantic search to get related information from 10k financial filing data from OpenSearch Service.
response = zero_shot_agent("\n\nHuman: Is {company name} a good investment choice right now? \n\nAssistant:")

The output may look like the following:

Final Answer: Yes, {company name} appears to be a good investment choice right now based on the stable stock price, continued revenue and earnings growth, and dividend payments. I would recommend investing in {company name} stock at current levels.

You can view the final response from the complete chain in your notebook.

Example 2: Historical data is not available

In this next test, we see how the bot responds to a question when historical data is not available. We ask the question, “Is Amazon a good investment choice right now?”

This is a stock-related question. However, there is no Amazon stock price information in the Redshift table. Therefore, the bot will answer “I cannot provide stock analysis without stock price information.” The agent will run the following process:

  1. Determine if this is a stock-related question.
  2. Get the company name.
  3. Get the stock symbol from Amazon Redshift.
  4. Get the stock price from Amazon Redshift.
response = zero_shot_agent("\n\nHuman: Is Amazon a good investment choice right now? \n\nAssistant:")

The output looks like the following:

Final Answer: I cannot provide stock analysis without stock price information.

Example 3: Unrelated question and historical data is not available

For our third test, we see how the bot responds to an irrelevant question when historical data is not available. This is testing for hallucination. We use the question, “What is SageMaker?”

This is not a stock-related query. The agent will run the following process:

  1. Determine if this is a stock-related question.
response = zero_shot_agent("\n\nHuman: What is SageMaker? \n\nAssistant:")

The output looks like the following:

Final Answer: What is SageMaker? is not a stock related query.

This was a simple RAG-based ReAct chat agent analyzing the corpus from different data stores. In a realistic scenario, you might choose to further enhance the response with restrictions or guardrails for input and output like filtering harsh words for robust input sanitization, output filtering, conversational flow control, and more. You may also want to explore the programmable guardrails to LLM-based conversational systems.

Clean up

To clean up your resources, delete the CloudFormation stack llm-based-agent.

Conclusion

In this post, you explored how LLMs play a part in answering user questions. You looked at a scenario for helping financial analysts. You could employ this methodology for other Q&A scenarios, like supporting insurance use cases, by quickly contextualizing claims data or customer interactions. You used a knowledge base of structured and unstructured data in a RAG approach, merging the data to create intelligent chatbots. You also learned how to use autonomous agents to help provide responses that are contextual and relevant to the customer data and limit irrelevant and inaccurate responses.

Leave your feedback and questions in the comments section.

References


About the Authors

Dhaval Shah is a Principal Solutions Architect with Amazon Web Services based out of New York, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development and Architecture, Data Engineering, and IT Management.

Soujanya Konka is a Senior Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of Data platforms. Before joining AWS, Soujanya has had stints with companies such as HSBC & Cognizant

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Hrishikesh Karambelkar is a Principal Architect for Data and AIML with AWS Professional Services for Asia Pacific and Japan. He is proactively engaged with customers in APJ region to enable enterprises in their Digital Transformation journey on AWS Cloud in the areas of Generative AI, machine learning and Data, Analytics, Previously, Hrishikesh has authored books on enterprise search, biig data and co-authored research publications in the areas of Enterprise Search and AI-ML.

AWS Weekly Roundup – LlamaIndex support for Amazon Neptune, force AWS CloudFormation stack deletion, and more (May 27, 2024)

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-llamaindex-support-for-amazon-neptune-force-aws-cloudformation-stack-deletion-and-more-may-27-2024/

Last week, Dr. Matt Wood, VP for AI Products at Amazon Web Services (AWS), delivered the keynote at the AWS Summit Los Angeles. Matt and guest speakers shared the latest advancements in generative artificial intelligence (generative AI), developer tooling, and foundational infrastructure, showcasing how they come together to change what’s possible for builders. You can watch the full keynote on YouTube.

AWS Summit LA 2024 keynote

Announcements during the LA Summit included two new Amazon Q courses as part of Amazon’s AI Ready initiative to provide free AI skills training to 2 million people globally by 2025. The courses are part of the Amazon Q learning plan. But that’s not all that happened last week.

Last week’s launches
Here are some launches that got my attention:

LlamaIndex support for Amazon Neptune — You can now build Graph Retrieval Augmented Generation (GraphRAG) applications by combining knowledge graphs stored in Amazon Neptune and LlamaIndex, a popular open source framework for building applications with large language models (LLMs) such as those available in Amazon Bedrock. To learn more, check the LlamaIndex documentation for Amazon Neptune Graph Store.

AWS CloudFormation launches a new parameter called DeletionMode for the DeleteStack API — You can use the AWS CloudFormation DeleteStack API to delete your stacks and stack resources. However, certain stack resources can prevent the DeleteStack API from successfully completing, for example, when you attempt to delete non-empty Amazon Simple Storage Service (Amazon S3) buckets. The DeleteStack API can enter into the DELETE_FAILED state in such scenarios. With this launch, you can now pass FORCE_DELETE_STACK value to the new DeletionMode parameter and delete such stacks. To learn more, check the DeleteStack API documentation.

Mistral Small now available in Amazon Bedrock — The Mistral Small foundation model (FM) from Mistral AI is now generally available in Amazon Bedrock. This a fast-follow to our recent announcements of Mistral 7B and Mixtral 8x7B in March, and Mistral Large in April. Mistral Small, developed by Mistral AI, is a highly efficient large language model (LLM) optimized for high-volume, low-latency language-based tasks. To learn more, check Esra’s post.

New Amazon CloudFront edge location in Cairo, Egypt — The new AWS edge location brings the full suite of benefits provided by Amazon CloudFront, a secure, highly distributed, and scalable content delivery network (CDN) that delivers static and dynamic content, APIs, and live and on-demand video with low latency and high performance. Customers in Egypt can expect up to 30 percent improvement in latency, on average, for data delivered through the new edge location. To learn more about AWS edge locations, visit CloudFront edge locations.

Amazon OpenSearch Service zero-ETL integration with Amazon S3 — This Amazon OpenSearch Service integration offers a new efficient way to query operational logs in Amazon S3 data lakes, eliminating the need to switch between tools to analyze data. You can get started by installing out-of-the-box dashboards for AWS log types such as Amazon VPC Flow Logs, AWS WAF Logs, and Elastic Load Balancing (ELB). To learn more, check out the Amazon OpenSearch Service Integrations page and the Amazon OpenSearch Service Developer Guide.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items and a Twitch show that you might find interesting:

AWS Build On Generative AIBuild On Generative AI — Now streaming every Thursday, 2:00 PM US PT on twitch.tv/aws, my colleagues Tiffany and Mike discuss different aspects of generative AI and invite guest speakers to demo their work. Check out show notes and the full list of episodes on community.aws.

Amazon Bedrock Studio bootstrapper script — We’ve heard your feedback! To everyone who struggled setting up the required AWS Identity and Access Management (IAM) roles and permissions to get started with Amazon Bedrock Studio: You can now use the Bedrock Studio bootstrapper script to automate the creation of the permissions boundary, service role, and provisioning role.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS SummitsAWS Summits — It’s AWS Summit season! Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Dubai (May 29), Bangkok (May 30), Stockholm (June 4), Madrid (June 5), and Washington, DC (June 26–27).

AWS re:InforceAWS re:Inforce — Join us for AWS re:Inforce (June 10–12) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. Connect with the AWS teams that build the security tools and meet AWS customers to learn about their security journeys.

AWS Community DaysAWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Midwest | Columbus (June 13), Sri Lanka (June 27), Cameroon (July 13), New Zealand (August 15), Nigeria (August 24), and New York (August 28).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Antje

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Introducing blueprint discovery and other UI enhancements for Amazon OpenSearch Ingestion

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/introducing-blueprint-discovery-and-other-ui-enhancements-for-amazon-opensearch-ingestion/

Amazon OpenSearch Ingestion is a fully managed serverless pipeline that allows you to ingest, filter, transform, enrich, and route data to an Amazon OpenSearch Service domain or Amazon OpenSearch Serverless collection. OpenSearch Ingestion is capable of ingesting data from a wide variety of sources and has a rich ecosystem of built-in processors to take care of your most complex data transformation needs.

In this post, we walk you through the new UI enhancements and blueprint discovery features that are now available with OpenSearch Ingestion for a richer user experience.

Blueprint discovery

Rather than create a pipeline definition from scratch, you can use configuration blueprints, which are preconfigured templates for common ingestion scenarios such as trace analytics or Apache logs. Configuration blueprints help you provision pipelines without having to author a configuration from scratch.

With the older interface, you had to scroll through a dropdown menu of all the available blueprints and couldn’t filter the blueprints using search. OpenSearch Ingestion now offers a new UI that enables searching for blueprints using full-text search on the AWS Management Console, helping you discover all the sources that you can ingest data from into OpenSearch Service.

When you create a new pipeline on the OpenSearch Service console, you’re presented with a new catalog page. You now have a top-down view of all the sources and sinks supported by OpenSearch Ingestion. The blueprints are listed as tiles with icons and are grouped together thematically.

The new catalog has a navigation menu that lists the blueprints, grouped by use case or the service that you want to ingest data from. You can either search for a blueprint or choose the shortcuts in the navigation pane. For example, if you’re building a pipeline to perform a zero-ETL integration with Amazon DynamoDB, you can either search for DynamoDB, choose ZeroETL under Use case in the navigation pane, or choose DynamoDB under Service.

Customized getting started guide

After you choose your blueprint, you’re directed to the Pipeline settings and configuration page. The new UI for blueprints offers a customized getting started guide for each source, detailing key steps in setting up a successful end-to-end integration. The guide on this page will change depending on the blueprint you chose for your pipeline.

The pipeline configuration is pre-filled with a template that you can modify with your specific sources, AWS Identity and Access Management (IAM) roles, and sinks. For example, if you chose the Zero-ETL with DocumentDB blueprint, the getting started guide will provide information on setting up a DocumentDB collection, permissions, and destination.

Support for JSON pipeline configuration

As part of the visual overhaul, OpenSearch Ingestion now offers support for specifying the pipeline configuration in JSON format on the console in addition to the existing YAML support. You can now view the configurations in JSON format in addition to the YAML format and edit them in place.

Although YAML is more human readable, it’s whitespace sensitive and is often challenging when copy/pasting from text editors. The JSON pipeline configuration allows you to confidently copy and paste configurations from your text or code editors without having to worry about formatting errors due to inconsistent whitespace. In addition, you can now make any edits in place with JSON configuration and the changes will be propagated to the YAML automatically.

Availability and pricing

These features are available in all commercial AWS Regions where OpenSearch Ingestion is currently available.

Ingestion-OCUs remain at the same price of $0.24 cents per hour. OCUs are billed on an hourly basis with per-minute granularity. You can control the costs OCUs incur by configuring maximum OCUs that a pipeline is allowed to scale.

Conclusion

In this post, we showed you the new blueprint discovery and other UI enhancements for OpenSearch Ingestion. Refer to Amazon OpenSearch Ingestion to learn about other capabilities provided by OpenSearch Ingestion to build scalable pipelines for your OpenSearch data ingestion needs.


About the author

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.


Sam Selvan
is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

AVB accelerates search in LINQ with Amazon OpenSearch Service

Post Syndicated from Patrick Duffy original https://aws.amazon.com/blogs/big-data/avb-accelerates-search-in-linq-with-amazon-opensearch-service/

This post is co-written with Mike Russo from AVB Marketing.

AVB Marketing delivers custom digital solutions for their members across a wide range of products. LINQ, AVB’s proprietary product information management system, empowers their appliance, consumer electronics, and furniture retailer members to streamline the management of their product catalog.

A key challenge for AVB’s members is the ability to retrieve, sort, and search through product data, which is crucial for sales activities within their stores. Floor sales use AVB’s Hub, a custom in-store customer relationship management (CRM) product, which relies on LINQ. Initially, searches from Hub queried LINQ’s Microsoft SQL Server database hosted on Amazon Elastic Compute Cloud (Amazon EC2), with search times averaging 3 seconds, leading to reduced adoption and negative feedback.

In this post, we share how AVB reduced their average search time from 3 seconds to 300 milliseconds in LINQ by adopting Amazon OpenSearch Service while processing 14.5 million record updates daily.

Overview of solution

To meet the demands of their users, the LINQ team set a goal to reduce search time response to under 2 seconds while supporting retrieval of over 60 million product data records. Additionally, the team aimed to reduce operational costs, reduce administrative overhead, and scale the solution to meet demand, especially during peak retail periods. Over a 6-month period, the team evaluated multiple architecture options, eventually moving forward with a solution using OpenSearch Service, Amazon EventBridge, AWS Lambda, Amazon Simple Queue Service (Amazon SQS), and AWS Step Functions.

During implementation, the LINQ team worked with OpenSearch Service specialists to optimize the OpenSearch Service cluster configuration to maximize performance and optimize cost of the solution. Following the best practices section of the OpenSearch Service Developer Guide, AVB selected an optimal cluster configuration with three dedicated cluster manager nodes and six data nodes, across three Availability Zones, while keeping shard size between 10–30 GiB.

Updates to the primary LINQ database come from various sources, including partner APIs for manufacturer metadata updates, LINQ’s frontend, and LINQ PowerTools. A Lambda function reads the updates from change data capture (CDC) tables on a schedule, which sends the updated records to a Step Functions workflow. This workflow prepares and indexes the record into OpenSearch Service in JSON format, allowing for individual customizations of the record on a per-customer basis. The LINQ team exposes access to the OpenSearch Service index through a search API hosted on Amazon EC2. The following figure outlines the solution.

Overview of the solution

AVB developed the LINQ Product Data Search solution with the expertise of a diverse team including software engineers and database administrators. Despite their limited experience with AWS, they set a timeline to complete the project in 6 months. AVB had several goals for this new workload, including search APIs to support in-store sales floor associates’ ability to quickly find products based on customer requirements, scalability to support future growth, and real-time analytics to support AVB’s needs around understanding their data.

AVB split this project into three key phases:

  • Research and development
  • Proof of concept
  • Implementation and iteration

Research and development

AVB’s LINQ team received a task to identify the most efficient solution to expedite product searches across AVB’s suite of software products. The team completed a comprehensive evaluation of various technologies and techniques to meet their requirements, including a close examination of various NoSQL databases and caching mechanisms. Following this exploration, AVB selected OpenSearch Service, an open source, distributed search and analytics suite, for use in a proof of concept. AVB chose OpenSearch Service for its powerful search capabilities, including full-text search and complex query support, as well as its ability to integrate seamlessly with other AWS services.

Proof of concept

In the proof of concept phase, the AVB team focused on validating the effectiveness of their chosen technology stack, with a particular emphasis on data loading and synchronization processes. This was essential to achieve real-time data consistency with their primary system of record to provide correct and up-to-date information to floor sales agents. A significant part of this phase involved the innovative process of data flattening, a technique crucial for managing complex product data.

For example, let’s explore a use case of a refrigerator listed in the SQL Server database. This product is linked to several related tables: one for basic details like model number and manufacturer, another for pricing, and another for features such as energy efficiency and capacity. The original database stores elements separately but connected through relational keys. The following figure provides an example data schema of the SQL Server database.

Microsoft SQL Server Schema

To enhance search capabilities in OpenSearch Service, the team merged all these disparate data elements into a single, comprehensive JSON document. This document includes both standard manufacturer details and member-specific customizations, like special pricing or additional features. This results in an optimized record for each product for quick and efficient search in OpenSearch Service. The following figure shows the data schema in OpenSearch Service.

Amazon OpenSearch Service Schema

Transforming relational data into a consolidated, searchable format allowed the LINQ team to ingest the data into OpenSearch Service. In the proof of concept, AVB shifted to updating data by using reference IDs, which are directly linked to the primary IDs of the product records or their relational entities in the SQL database. This approach allows updates to be executed independently and asynchronously. Crucially, it supports non-first in, first out (FIFO) processing models, which are vital in high-scale environments susceptible to data discrepancies like drops or replays. By using reference IDs, the system fetches the most current data for each entity at the time a change occurs, ensuring that the latest data is always used when processed. This method maintains data integrity by preventing outdated data from superseding newer information, thereby keeping the database accurate and current. A noteworthy technique used in the proof of concept was index aliases, allowing for zero downtime re-indexes for adding new fields or fixing bugs. AVB built robust performance monitoring and alerts using Amazon CloudWatch and Splunk, which enabled swift identification of issues.

The proof of concept improved search relevance by flattening relational data, which improved indexing and queryability. This restructuring reduced search response latency to 300 milliseconds, which was well under the 2-second goal set for this proof of concept. With this successful proof of concept demonstrating the effectiveness of the architectural approach, AVB moved on to the next phase of implementation and iteration.

Implementation and iteration

With AVB exceeding their initial goal of reducing search latency to under 2 seconds, the team then adopted an iterative approach to implement the complete solution, with a series of deployments designed to make data available in OpenSearch Service from different business verticals. Each business vertical has records consisting of different attributes, and this incremental approach allowed AVB to bring in and inspect data to make sure the documents in OpenSearch Service are what the team expected. Each deployment focused on specific data categories and included refinements to the indexing process from lessons learned in prior deployments. AVB also places a strong emphasis on cost optimization and security of the solution, and deployed OpenSearch Service into a private VPC to allow strict access control. Access to the new search capabilities is controlled through their Hub product using a middleware service provided by LINQ’s API. AVB uses robust API keys and tokens to provide API security to the new search product. This systematic progression meant that the completed LINQ Product Data Search catalog met AVB’s speed and accuracy requirements.

Conclusion

In this post, you learned how AVB reduced their average search time from 3 seconds to 300 milliseconds in LINQ by adopting OpenSearch Service while processing 14.5 million record updates daily, resulting in a 500% increase in adoption by AVB’s internal teams. Tim Hatfield, AVB Marketing’s VP of Engineering, reflected on the project and stated, “By partnering with AWS, we’ve not only supercharged Hub’s search speeds but also forged a cost-efficient foundation for LINQ’s future, where swift searches translate into reduced operating costs and maintain the competitive edge in retail technology.”

To get started with OpenSearch Service, see Getting started with Amazon OpenSearch Service.


About the Authors

Mike RussoMike Russo is a Director of Software Engineering at AVB Marketing. He leads the software delivery for AVB’s e-commerce and product catalog solutions. Outside work, Mike enjoys spending time with his family and playing basketball.

Patrick Duffy is a Senior Solutions Architect in the at AWS. He is passionate about raising awareness and increasing security of AWS workloads. Outside work, he loves to travel and try new cuisines, and you may match up against him in a game on Magic Arena.