All posts by Hang Zuo

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Post Syndicated from Hang Zuo original https://aws.amazon.com/blogs/big-data/accelerate-your-migration-to-amazon-opensearch-service-with-reindexing-from-snapshot/

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment (OpenSearch is now part of Linux Foundation). However, the data migration process can be daunting, especially when downtime and data consistency are critical concerns for your production workload.

In this post, we will introduce a new mechanism called Reindexing-from-Snapshot (RFS), and explain how it can address your concerns and simplify migrating to OpenSearch.

Key concepts

To understand the value of RFS and how it works, let’s look at a few key concepts in OpenSearch (and the same in Elasticsearch):

  1. OpenSearch index: An OpenSearch index is a logical container that stores and manages a collection of related documents. OpenSearch indices are composed of multiple OpenSearch shards, and each OpenSearch shard contains a single Lucene index.
  2. Lucene index and shard: OpenSearch is built as a distributed system on top of Apache Lucene, an open-source high-performance text search engine library. An OpenSearch index can contain multiple OpenSearch shards, and each OpenSearch shard maps to a single Lucene index. Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. OpenSearch combines many independent Lucene indices into a single higher-level system to extend the capability of Lucene beyond what a single machine can support. OpenSearch provides resilience by creating and managing replicas of the Lucene indices as well as managing the allocation of data across Lucene indices and combining search results across all Lucene indices.
  3. Snapshots: Snapshots are backups of an OpenSearch cluster’s indexes and state in an off-cluster storage location (snapshot repository) such as Amazon Simple Storage Service (Amazon S3). As a backup strategy, snapshots can be created automatically in OpenSearch, or users can create a snapshot manually for restoring it on to a different domain or for data migration.

For example, when a document is added to the OpenSearch index, the distributed system layer picks a specific shard to host the document, and the document is ingested into that shard’s Lucene index. Operations on that document are then routed to the same shard (though the shard might have replicas). Search operations are performed across the shards in OpenSearch index individually and then a combined result is returned. A snapshot can be created to backup the cluster’s indexes and state, including cluster settings, node information, index settings and shard allocation, so that the snapshot can be used for data migration.

Why RFS?

RFS can transfer data from OpenSearch and Elasticsearch clusters at high throughput without impacting the performance of the source cluster. This is achieved by using the shard-level codependency and snapshots:

  1. Minimized performance impact to source clusters: Instead of retrieving data directly from the source cluster, RFS can use a snapshot of the source cluster for data migration. Documents are parsed from the snapshot and then reindexed to the target cluster, so that performance impact to the source clusters is minimized during migration. This maintains a smooth transition and minimal performance impact to end users, especially for production workloads.
  2. High throughput: Because shards are separate entities, RFS can retrieve, parse, extract and reindex the documents from each shard in parallel, to achieve high data throughput.
  3. Multi-version upgrades: RFS supports migrating data across multiple major versions (for example, from Elasticsearch 6.8 to OpenSearch 2.x), which can be a significant challenge with other data migration approaches. This is because the data indexed into OpenSearch (and Lucene) is only backward compatible for one major version. By incorporating reindexing as the core mechanism of the migration process, RFS can migrate data across multiple versions in one hop and make sure the data is fully updated and readable in the target cluster’s version, so that you don’t need to worry about the hidden technical debt imposed by having previous-version Lucene files in the new OpenSearch cluster.

How RFS works

OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. Each index has its own sub-directory, and each shard has its own sub-directory under the directory of its parent index. The raw data for a given shard is stored in its corresponding shard sub-directory as a collection of Lucene files, which OpenSearch and Elasticsearch lightly obfuscates. Metadata files exist in the snapshot to provide details about the snapshot as a whole, the source cluster’s global metadata and settings, each index in the snapshot, and each shard in the snapshot.

The following is an example for the structure of an Elasticsearch 7.10 snapshot, along with a breakdown of its contents:

/snapshot/root
├── index-0 <-------------------------------------------- [1]
├── index.latest
├── indices
│   ├── DG4Ys006RDGOkr3_8lfU7Q <------------------------- [2]
│   │   ├── 0 <------------------------------------------ [3]
│   │   │   ├── __iU-NaYifSrGoeo_12o_WaQ <--------------- [4]
│   │   │   ├── __mqHOLQUtToG23W5r2ZWaKA <--------------- [4]
│   │   │   ├── index-gvxJ-ifiRbGfhuZxmVj9Hg 
│   │   │   └── snap-eBHv508cS4aRon3VuqIzWg.dat <-------- [5]
│   │   └── meta-tDcs8Y0BelM_jrnfY7OE.dat <-------------- [6]
│   └── _iayRgRXQaaRNvtfVfRdvg
│       ├── 0
│       │   ├── __DNRvbH6tSxekhRUifs35CA
│       │   ├── __NRek2UuKTKSBOGczcwftng
│       │   ├── index-VvqHYPQaRcuz0T_vy_bMyw
│       │   └── snap-eBHv508cS4aRon3VuqIzWg.dat
│       └── meta-tTcs8Y0BelM_jrnfY7OE.dat
├── meta-eBHv508cS4aRon3VuqIzWg.dat <-------------------- [7]
└── snap-eBHv508cS4aRon3VuqIzWg.dat <-------------------- [8]

The structure includes the following elements:

  1. Repository metadata file: JSON encoded and contains a mapping between the snapshots within the repository and the OpenSearch or Elasticsearch indices and shards stored within it.
  2. Index directory: Contains the data and metadata for a specific OpenSearch or Elasticsearch index.
  3. Shard directory: Contains the data and metadata for a specific shard of an OpenSearch or Elasticsearch index
  4. Lucene Files: Lucene index files, lightly obfuscated by the snapshotting process. Large files from the source file system are split into multiple parts.
  5. Shard metadata file: SMILE encoded and contains details about all the Lucene files in the shard and a mapping between their in-snapshot representation and their original representation on the source machine they were pulled from (including the original file name and other details).
  6. Index metadata file: SMILE encoded and contains things such as the index aliases, settings, mappings, and number of shards.
  7. Global metadata file: SMILE encoded and contains things such as the legacy, index, and component templates.
  8. Snapshot metadata file: SMILE encoded and contains things such as whether the snapshot succeeded, the number of shards, how many shards succeeded, the OpenSearch or Elasticsearch version, and the indices in the snapshot.

RFS works by retrieving a local copy of a shard-level directory, unpacking its contents and de-obfuscating them, reading them as a Lucene index, and extracting the documents within. This is enabled because OpenSearch and Elasticsearch store the original format of documents added to an OpenSearch or Elasticsearch index in Lucene using the _source field; this feature is enabled by default and is what allows the standard _reindex REST API to work (among other things).

The user workflow for performing a document migration with RFS using the Migration Assistant is shown in the following figure:

The workflow is:

  1. The operator shells into the Migration Assistant console
  2. The operator uses the console command line interface (CLI) to initiate a snapshot on their source cluster. The source cluster stores the snapshot in an S3 Bucket.
  3. The operator starts the document migration with RFS using the console CLI. This creates a single RFS Worker, which is a Docker container running in AWS Fargate.
  4. Each RFS worker provisioned pulls down an un-migrated shard from the snapshot bucket and reindexes its documents against the target cluster. Once finished, it proceeds to the next shard until all shards are completed.
  5. The operator monitors the progress of the migration using the console CLI, which reports both the number of shards yet to be migrated and the number that have been completed. The operator can scale the RFS worker fleet up or down to increase or reduce the rate of indexing on the target cluster.
  6. After all shards have been migrated to the target cluster, the operator scales the RFS worker fleet down to zero.

As previously mentioned, the RFS workers operate at the shard-level, so that you can provision one RFS worker for every shard in the snapshot to achieve maximum throughput. If a RFS worker stops unexpectedly in the middle of migrating a shard, another RFS worker will restart its migration from the beginning. The original document identifiers are preserved in the migration process, so that the restarted migration will be able to over-write the failed attempt. RFS workers coordinate amongst themselves using metadata that they store in an index on the target cluster.

How RFS performs

To highlight the performance of RFS, let’s consider the following scenario: you have an Elasticsearch 7.10 source cluster containing 5 TiB (3.9 billion documents) and wants to migrate to OpenSearch 2.15. With RFS, you can perform this migration in approximately 35 minutes, spending approximately $10 in Amazon Elastic Container Service (Amazon ECS) usage to run the RFS workers during the migration.

To demonstrate this capability, we created an Elasticsearch 7.10 source cluster in Amazon OpenSearch Service, with 1,024 shards and 0 replicas. We used AWS Glue to bulk-load sample data into the source cluster with the AWS Public Blockchain Dataset, and repeated the bulk-load process until 5 TiB of data (3.9 billion documents) was stored. We created an OpenSearch 2.15 cluster as the target cluster in Amazon OpenSearch Service, with 15 r7gd.16xlarge data nodes and 3 m7g.large master nodes, and used Sigv4 for authentication. Using the Migration Assistant solution, we created a snapshot of the source cluster, stored it in S3, and performed a metadata migration so that the indices on the source were recreated on the target cluster with the same shard and replica counts. We then ran console backfill start and console backfill scale 200 to begin the RFS migration with 200 workers. RFS indexed data into the target cluster at 2,497 MiB per second. The migration was completed in approximately 35 minutes. We metered approximately $10 in ECS cost for running the RFS workers.

To better highlight the performance, the following figures show metrics from the OpenSearch target cluster during this process (presented below).

In the preceding figures, you can see the cyclical variation in the document index rate and target cluster resource utilization as the 200 RFS workers pick up shards, complete a shard, and then pick up a new shard. At peak RFS indexing, we see the target cluster nodes maxing their CPU and begin queuing writes. The queue is cleared as shards complete and more workers transition to the downloading state. In general, we find that RFS performance is limited by the ability of the target cluster to absorb the traffic it generates. You can tune the RFS worker fleet to match what your target cluster can reliably ingest.

Conclusion

This blog post is designed to be a starting point for teams seeking guidance on how to use Reindexing-from-Snapshot as a straightforward, high throughput, and low-cost solution for data migration from self-managed OpenSearch and Elasticsearch clusters to Amazon OpenSearch Service. RFS is now part of the Migration Assistant solution and available from the AWS Solution Library. To use RFS to migrate to Amazon OpenSearch Service, try the Migration Assistant solution. To experience OpenSearch, try the OpenSearch Playground. To use the managed implementation of OpenSearch in the AWS Cloud, see Getting started with Amazon OpenSearch Service.


About the authors

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Chris Helma is a Senior Engineer at Amazon Web Services based in Austin, Texas. He is currently developing tools and techniques to enable users to shift petabyte-scale data workloads into OpenSearch. He has extensive experience building highly-scalable technologies in diverse areas such as search, security analytics, cryptography, and developer productivity. He has functional domain expertise in distributed systems, AI/ML, cloud-native design, and optimizing DevOps workflows. In his free time, he loves to explore specialty coffee and run through the West Austin hills.

Andre Kurait is a Software Development Engineer II at Amazon Web Services, based in Austin, Texas. He is currently working on Migration Assistant for Amazon OpenSearch Service. Prior to joining Amazon OpenSearch, Andre worked within Amazon Health Services. In his free time, Andre enjoys traveling, cooking, and playing in his church sport leagues. Andre holds Bachelor of the Science degrees from the University of Kansas in Computer Science and Mathematics.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Amazon OpenSearch Service launches the next-generation OpenSearch UI

Post Syndicated from Hang Zuo original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-launches-the-next-generation-opensearch-ui/

Amazon OpenSearch Service launches a modernized operational analytics experience that can provide comprehensive observability spanning multiple data sources, so that you can gain insights from OpenSearch and other integrated data sources in one place. The launch also introduces OpenSearch Workspaces that provides tailored experience for popular use cases and supports access control, so that you can create a private space for your use case and share it only to your collaborators. With the next-generation user interface (UI), the Discover experience has been improved to simplify interactive analysis, so that you can easily utilize features such as natural language query generation to gain insights from your data.

Multiple Data Source: You might have already used OpenSearch Dashboards to provide an operational analytics experience for your OpenSearch clusters. OpenSearch Dashboards is co-located with a cluster, so that each OpenSearch Dashboards can only work with one cluster. And as you scale up your workload across multiple clusters, there is not a unified experience to analyze your data in one place. In comparison, the next-generation OpenSearch UI is designed to work across multiple OpenSearch clusters to aggregate the comprehensive insights in one place. An OpenSearch application is an instance of the next-generation OpenSearch UI. Currently, OpenSearch applications can be associated with multiple OpenSearch clusters (above version 1.3), Amazon OpenSearch Service Serverless collections, and integrated data sources such as Amazon S3. Each OpenSearch cluster can be associated with multiple OpenSearch applications, in addition to its co-located OpenSearch Dashboards that will remain functional.

Workspace: With workspaces, you can easily create your use case specific contents in a private space and manage the permissions in team collaboration. Workspace provide curated experiences for popular use cases such as Observability, Security Analytics and Search, so that you can find it straightforward to build contents for your use case. Workspace supports collaborator management, so that you can share your workspace only to your intended collaborators, and manage the permissions for each collaborator.

Discover: The improved Discover feature now provides a unified log exploration experience that adds the support for SQL and Piped Processing Language (PPL), in addition to the existing support for DQL and Lucene. Discover features a new data selector to support multiple data sources, a new visual design, query autocomplete and natural language query generation for improved usability. With the enhanced Discover interface, you can now analyze data from multiple sources without switching tools, reducing complexity and improving efficiency.

Solution Overview

The following diagram illustrates architecture of the OpenSearch Dashboards.

The following diagram illustrates the next-generation OpenSearch UI architecture.

In the following sections, we discuss the following topics

  1. The process of creating an application
  2. Setting up and using the new Workspaces functionality
  3. The enhanced Discover experience

We’ll demonstrate how these improvements streamline data analysis, foster collaboration, and empower you to extract insights more efficiently across various use cases.

Create an application:

To begin using the next-generation OpenSearch UI, you can first create an application. An application is an instance of the OpenSearch UI (Dashboards), and you have the flexibility to create multiple applications within a single account. To create a new application, complete the following steps:

  1. On the Amazon OpenSearch Service console, choose OpenSearch UI (Dashboards)under Central management in the navigation panel.
  2. Choose Create application.
  3. For application name, enter a descriptive name for your new application.
  4. AWS Identity and Access Management (IAM) is the default authentication mechanism. Optionally, you can select Authentication with IAM Identity Center (IDC), so that you can use credentials and access management from your existing identity providers to manage user access.
  5. For OpenSearch application admins, specify the IAM principals or IDC users that will have permissions to update or delete the application configuration. You will automatically be set as the first admin.

This page lists all the existing applications under your account in the current AWS region. You can create new application from this page.

This page is the create application workflow. You can specify the application name, enable/disable IDC and define application admins to create an application.

After you configured these settings and created an application, your new OpenSearch application will be ready for you to associate data sources and start using the enhanced UI capabilities.

Associating data sources:

After you create your new OpenSearch application, the next step is to associate the relevant data sources. This allows you to connect the application to the necessary OpenSearch domains, collections, and other data sources.

  1. On the application details page, choose Manage data sources.

You will be presented with a list of all the OpenSearch data sources you have access to, including managed domains and serverless collections.

  1. Select the data sources you want to associate with this application.

OpenSearch domains below version v1.3 will not be compatible with the next-generation UI, and will be grayed out in the list. Additionally, if you need to connect to a domain within a virtual private cloud (VPC), you will need to authorize OpenSearch application as a new principal under its security configurations. If you need to connect to a collection within a VPC, you will need to configure its network policy to Private, enable AWS service private access with OpenSearch application.

  1. Choose Save to finalize the data source association.

Your OpenSearch application is now ready to use, with access to the connected data.

Working with the OpenSearch application:

To access your new OpenSearch application, you can either choose the application URL or choose Launch application on the application details page. After you’ve successfully logged in either with IAM or IDC, you’ll be directed to the application’s homepage. From here, you can choose to create a new workspace or navigate to an existing one that you have access to.

Creating a new workspace:

A workspace is a tailored experience for your use case and team collaboration. There are five types of workspaces: Observability, Security Analytics, Search, Essentials, and Analytics. You can click on the info button to learn more about each workspace type. Existing workspaces will be listed on the homepage. To create a new workspace, complete the following steps:

  1. Choose Create workspace.
  2. Enter a name for your workspace.
  3. Optionally, you can select a different color for the workspace icon for easier identification.
  4. Select the type of workspace you want to create: Observability, Security Analytics, Search, Essentials, or Analytics
  5. Add at least one data source for this workspace (from the list of data sources you previously associated with the application).

For this post, we create an Observability workspace named MyWorkspace and associate it with one Amazon OpenSearch Serverless collection and one Amazon OpenSearch Service managed cluster. You can always manage the data sources associated with a workspace, even after it has been created.

Invite Collaborators

After you create your new workspace, you can add users or groups as collaborators. Workspace collaborators are the users you want to invite to work with you in this workspace, and there are three available permission levels for collaborators: admin, read/write, and read-only. Read/write permission allows a collaborator to create, edit and delete the dashboards, visualizations, and saved queries within the workspace, whereas collaborators with read-only access can only view the results. Admin level gives a collaborator the same permissions as you to not only read/write but also update the configurations of the workspace or delete it.

To add collaborators to your workspace, complete the following steps:

  1. Choose Collaborators in the navigation panel.
  2. Choose Add collaborators.
  3. Choose the type of users you want to add as collaborators. You can add collaborators by their IAM Amazon Resource Name (ARN) or IDC username.
  4. Select a permission level for the collaborator from the three options: Read only, Read and write, and Admin

If you do not know the ARN of your intended collaborator, follow the instruction to check for their ARN, for example.

Improved navigation:

The improved navigation in workspaces provides a more contextual and purpose-built interface, ensuring that each workspace includes only the tools and features relevant to its use case. With enhanced clarity and better organization, the new navigation system is tailored to help you find the features you need quickly, improving overall productivity and minimizing time spent searching through menus.

Revamped Discover experience

Discover is now revamped to offer improved usability and efficiency. You can access multiple data sources, natural language query generation, a new data selector, and polished design with optimized data density, allowing you to effortlessly navigate and analyze your data:

  • Unified language selector – Discover now offers a unified language selector, allowing users to choose from SQL, PPL, Dashboards Query Language (DQL), or Lucene, making it convenient to work with your preferred query languages in one place.

  • Natural language query generation – Discover now supports natural language query building for PPL. Enter your questions in plain language, and Discover converts them to PPL syntax, making data exploration simpler and more accessible. This new feature empowers users of different skill levels to get insights without needing to fully understand the PPL syntax.

  • Powerful query autocomplete – The enhanced query bar includes autocomplete functionality and natural language query generation support, simplifying query building by offering relevant suggestions as you type, making it faster and more efficient to write complex queries

  • New data selector– The new data selector makes it straightforward to connect to multiple data sources, bringing data from Amazon OpenSearch Service domains and serverless collections, and Amazon S3 into a unified view.

Conclusion

In this post, we discussed the features of the next-generation OpenSearch UI. These improvements streamline data analytics, foster collaboration, and empower you to extract insights more efficiently across various use cases.

You can create your own OpenSearch UI applications today in the US East (N. Virginia), US West (N. California, Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), South America (São Paulo), Europe (Frankfurt, Ireland, London, Paris) and Canada (Central) Regions.


About the Authors

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Rushabh Vora is a Principal Product Manager for the OpenSearch project of Amazon Web Services. Rushabh leads core experiences in data exploration, dashboards, visualizations, reporting, and data management to help organizations unlock insights at scale. Rushabh is passionate about cloud technologies and building products that enable businesses to make data-driven decisions and achieve operational excellence.

Sohaib Katariwala is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service based out of Chicago, IL. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He works closely with customers on their OpenSearch journey across various use cases including vector search, observability, and security analytics.

Xenia Tupitsyna is a UX Designer at OpenSearch. She is working on user experiences across security analytics solutions, anomaly detection, alerting, and core dashboards.