All posts by Utkarsh Agarwal

Amazon OpenSearch Ingestion 101: Set CloudWatch alarms for key metrics

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/amazon-opensearch-ingestion-service-101-set-cloudwatch-alarms-for-key-metrics/

Amazon OpenSearch Ingestion is a fully managed, serverless data pipeline that simplifies the process of ingesting data into Amazon OpenSearch Service and OpenSearch Serverless collections. Some key concepts include:

  • Source – Input component that specifies how the pipeline ingests the data. Each pipeline has a single source which can be either push-based and pull-based.
  • Processors – Intermediate processing units that can filter, transform, and enrich records before delivery.
  • Sink – Output component that specifies the destination(s) to which the pipeline publishes data. It can publish records to one or more destinations.
  • Buffer – It is the layer between the source and the sink. It serves as temporary storage for events, decoupling the source from the downstream processors and sinks. Amazon OpenSearch Ingestion also offers a persistent buffer option for push-based sources
  • Dead-letter queues (DLQs) – Configures Amazon Simple Storage Service (Amazon S3) to capture records that fail to write to the sink, enabling error handling and troubleshooting.

This end-to-end data ingestion service can help you collect, process, and deliver data to your OpenSearch environments without the need to manage underlying infrastructure.

This post provides an in-depth look at setting up Amazon CloudWatch alarms for OpenSearch Ingestion pipelines. It goes beyond our recommended alarms to help identify bottlenecks in the pipeline, whether that’s in the sink, the OpenSearch clusters data is being sent to, the processors, or the pipeline not pulling or accepting enough from the source. This post will help you proactively monitor and troubleshoot your OpenSearch Ingestion pipelines.

Overview

Monitoring your OpenSearch Ingestion pipelines is crucial for catching and addressing issues early. By understanding the key metrics and setting up the right alarms, you can proactively manage the health and performance of your data ingestion workflows. In the following sections, we provide details about alarm metrics for different sources, monitors, and sinks. The specific values for the threshold, period, and datapoints to alarm used for alarms can vary based on the individual use case and requirements.

Prerequisites

To create an OpenSearch Ingestion pipeline, refer to Creating Amazon OpenSearch Ingestion pipelines. For creating CloudWatch alarms, refer to Create a CloudWatch alarm based on a static threshold.

You can enable logging for OpenSearch Ingestion Pipeline, which captures various log messages during pipeline operations and ingestion activity, including errors, warnings, and informational messages. For details on enabling and monitoring pipeline logs, refer to Monitoring pipeline logs

Sources

The entry point of your pipeline is often where monitoring should begin. By setting appropriate alarms for source components, you can quickly identify ingestion bottlenecks or connection issues. The following table summarizes key alarm metrics for different sources.

Source Alarm Description Recommended Action
HTTP/ OpenTelemetry requestsTooLarge.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The request payload size of the client (data producer) is greater than the maximum request payload size, resulting in the status code HTTP 413. The default maximum request payload size is 10 MB for HTTP sources and 4 MB for OpenTelemetry sources. The limit for the HTTP sources can be increased for the pipelines with persistent buffer enabled. The chunk size for the client can be reduced so that the request payload doesn’t exceed the maximum size. You can examine the distribution of payload sizes of incoming requests using the payloadSize.sum metric.
HTTP requestsRejected.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The request was sent to the HTTP endpoint of the OpenSearch Ingestion pipeline by the client (data producer), but the request wasn’t accepted by the pipeline, and it rejected the request with the status code 429 in the response. For persistent issues, consider increasing the minimum OCUs for the pipeline to allocate additional resources for request processing.
Amazon S3 s3ObjectsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The pipeline is unable to read some objects from the Amazon S3 source. Refer to REF-003 in Reference Guide below.
Amazon DynamoDB Difference for totalOpenShards.max - activeShardsInProcessing.value
Threshold: >0
Statistic: Maximum (totalOpenShards.max) and Sum (activeShardsInProcessing.value)
Datapoints to Alarm: 3 out of 3.Additional Note: refer REF-004 for more details on configuring this specific alarm.
It monitors alignment between total open shards that should be processed by the pipeline and active shards currently in processing. The activeShardsInProcessing.value will go down periodically as shards close but should never misalign from ‘totalOpenShards.max’ for longer than a couple of minutes. If the alarm is triggered, you can consider stopping and starting the pipeline, this option resets the pipeline’s state, and the pipeline will restart with a new full export. It is non-destructive, so it does not delete your index or any data in DynamoDB. If you don’t create a fresh index before you do this, you might see a high number of errors from version conflicts because the export tries to insert older documents than the current _version in the index. You can safely ignore these errors. For root cause analysis on the misalignment, you can reach out to AWS Support
Amazon DynamoDB dynamodb.changeEventsProcessingErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of processing errors for change events for a pipeline with stream processing for DynamoDB. If the metrics report increasing values, refer to REF-002 in Reference Guide below
Amazon DocumentDB documentdb.exportJobFailure.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The attempt to trigger an export to Amazon S3 failed. Review ERROR-level logs in the pipeline logs for entries beginning with “Received an exception during export from DocumentDB, backing off and retrying.” These logs contain the complete exception details indicating the root cause of the failure.
Amazon DocumentDB documentdb.changeEventsProcessingErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of processing errors for change events for a pipeline with stream processing for Amazon DocumentDB. Refer to REF-002 in Reference Guide below
Kafka kafka.numberOfDeserializationErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered deserialization errors while consuming a record from Kafka. Review WARN-level logs in the pipeline logs and verify serde_format is configured correctly in the pipeline configuration and the pipeline role has access to the AWS Glue Schema Registry (if used).
OpenSearch opensearch.processingErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
Processing errors were encountered while reading from the index. Ideally, the OpenSearch Ingestion pipeline would retry automatically, but for unknown exceptions, it might skip processing. Refer to REF-001 or REF-002 in Reference Guide below, to get the exception details that resulted in processing errors.
Amazon Kinesis Data Streams kinesis_data_streams.recordProcessingErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered an error while processing the records. If the metrics report increasing values, refer to REF-002 in Reference Guide below, which can help in identifying the cause.
Amazon Kinesis Data Streams kinesis_data_streams.acknowledgementSetFailures.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The pipeline encountered a negative acknowledgment while processing the streams, causing it to reprocess the stream. Refer to REF-001 or REF-002 in Reference Guide below.
Confluence confluence.searchRequestsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
While trying to fetch the content, the pipeline encountered the exception. Review ERROR-level logs in the pipeline logs for entries beginning with “Error while fetching content.” These logs contain the complete exception details indicating the root cause of the failure.
Confluence confluence.authFailures.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of UNAUTHORIZED exceptions received while establishing the connection Although the service should automatically renew tokens, if the metrics show an increasing value, review ERROR-level logs in the pipeline logs to identify why the token refresh is failing.
Jira jira.ticketRequestsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
While trying to fetch the issue, the pipeline encountered an exception. Review ERROR-level logs in the pipeline logs for entries beginning with “Error while fetching issue.” These logs contain the complete exception details indicating the root cause of the failure.
Jira jira.authFailures.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of UNAUTHORIZED exceptions received while establishing the connection. Although the service should automatically renew tokens, if the metrics show an increasing value, review ERROR-level logs in the pipeline logs to identify why the token refresh is failing.

Processors

The following table provides details about alarm metrics for different processors.

Processor Alarm Description Recommended Action
AWS Lambda aws_lambda_processor.recordsFailedToSentLambda.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
Some of the records could not be sent to Lambda. In the case of high values for this metric, refer to REF-002 in Reference Guide below.
AWS Lambda aws_lambda_processor.numberOfRequestsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The pipeline was unable to invoke the Lambda function. Although this situation should not occur under normal conditions, if it does, review Lambda logs and refer to REF-002 in Reference Guide below.
AWS Lambda aws_lambda_processor.requestPayloadSize.max
Threshold: >= 6292536
Statistic: MAXIMUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The payload size is exceeding the 6 MB limit, so the Lambda function can’t be invoked. Consider revisiting the batching thresholds in the pipeline configuration for the aws_lambda processor.
Grok grok.grokProcessingMismatch.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The incoming data doesn’t match the Grok pattern defined in the pipeline configuration. In the case of high values for this metric, review the Grok processor configurations and make sure the defined pattern matches according to the incoming data.
Grok grok.grokProcessingErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The pipeline encountered an exception when extracting the information from the incoming data according to the defined Grok pattern. In the case of high values for this metric, refer to REF-002 in Reference Guide below.
Grok grok.grokProcessingTime.max
Threshold: >= 1000
Statistic: MAXIMUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The maximum amount of time that each individual record takes to match against patterns from the match configuration option. If the time taken is equal to or more than 1 second, check the incoming data and the Grok pattern. The maximum amount of time during which matching occurs is 30,000 milliseconds, which is controlled by the timeout_millis parameter.

Sinks and DLQs

The following table contains details about alarm metrics for different sinks and DLQs.

Sink Alarm Description Recommended Action
OpenSearch opensearch.bulkRequestErrors.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of errors encountered while sending a bulk request. Refer to REF-002 in Reference Guide below which can help to identify the exception details.
OpenSearch opensearch.bulkRequestFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The number of errors received after sending the bulk request to the OpenSearch domain. Refer to REF-001 in Reference Guide below which can help to identify the exception details.
Amazon S3 s3.s3SinkObjectsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered a failure while writing the object to Amazon S3. Verify that the pipeline role has the necessary permissions to write objects to the specified S3 key. Review the pipeline logs to identify the specific keys where failures occurred.
Monitor the s3.s3SinkObjectsEventsFailed.count metric for granular details on the number of failed write operations.
Amazon S3 DLQ s3.dlqS3RecordsFailed.count
Threshold: >0
Statistic: SUM
Period: 5 minutes
Datapoints to alarm: 1 out 1
For a pipeline with DLQ enabled, the records are either sent to the sink or to the DLQ (if they are unable to send to the sink). This alarm indicates the pipeline was unable to send the records to the DLQ due to some error. Refer to REF-002 in Reference Guide below which can help to identify the exception details.

Buffer

The following table contains details about alarm metrics for buffers.

Buffer Alarm Description Recommended Action
BlockingBuffer BlockingBuffer.bufferUsage.value
Threshold: >80
Statistic: AVERAGE
Period: 5 minutes
Datapoints to alarm: 1 out 1
The percent usage, based on the number of records in the buffer. To investigate further, check if the Pipeline is bottlenecked due to processors or sink by comparing timeElapsed.max metrics and analyzing bulkRequestLatency.max
Persistent persistentBufferRead.recordsLagMax.value
Threshold: > 5000
Statistic: AVERAGE
Period: 5 minutes
Datapoints to alarm: 1 out 1
The maximum lag in terms of number of records stored in the persistent buffer. If the value for bufferUsage is low, increase the maximum OCUs. If bufferUsage is also high [>80], investigate if pipeline is bottlenecked by processors or sink.

Reference Guide

The following provide guidance for resolving common pipeline issues along with general reference.

REF-001: WARN-level Log Review

Review WARN-level logs in the pipeline logs to identify the exception details.

REF-002: ERROR-level Log Review

Review ERROR-level logs in the pipeline logs to identify the exception details.

REF-003: S3 Objects Failed

When troubleshooting increasing s3ObjectsFailed.count values, monitor these specific metrics to narrow down the root cause:

  • s3ObjectsAccessDenied.count – This metric increments when the pipeline encounters Access Denied or Forbidden errors while reading S3 objects. Common causes include:
  • Insufficient permissions in the pipeline role.
  • Restrictive S3 bucket policy not allowing the pipeline role access.
  • For cross-account S3 buckets, incorrectly configured bucket_owners mapping.
  • s3ObjectsNotFound.count – This metric increments when the pipeline receives Not Found errors while attempting to read S3 objects.

For further assistance with the recommended actions, contact AWS support.

REF-004: Configuring Alarm for difference in totalOpenShards.max and activeShardsInProcessing.value for Amazon DynamoDB source.

  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  2. In the navigation pane, choose Alarms, All alarms.
  3. Choose Create alarm.
  4. Choose Select Metric.
  5. Select Source.
  6. In source, following JSON can be used after updating the <sub-pipeline-name>, <pipeline-name> and <region>.
    {
        "metrics": [
            [ { "expression": "m1-e1", "label": "Expression2", "id": "e2", "period": 900 } ],
            [ { "expression": "FLOOR((m2/15)+0.5)", "label": "Expression1", "id": "activeShardsInProcessing", "visible": false, "period": 900 } ],
            [ "AWS/OSIS", "<sub-pipeline-name>.dynamodb.totalOpenShards.max", "PipelineName", "<pipeline-name>", { "stat": "Maximum", "id": "m1", "visible": false } ],
            [ ".", "<sub-pipeline name>.dynamodb.activeShardsInProcessing.value", ".", ".", { "stat": "Average", "id": "m2", "visible": false } ]
        ],
        "view": "timeSeries",
        "stacked": false,
        "period": 900,
        "region": "<region>"
    }

Let’s review couple of scenarios based on the above metrics.

Scenario 1 – Understand and Lower Pipeline Latency

Latency within a pipeline is built up of three main components:

  • The time it takes to send documents via bulk requests to OpenSearch,
  • the time it takes for data to go through the pipeline processors, and
  • the time that data sits in the pipeline buffer

Bulk requests and processors (last two items in the previous list) are the root causes for why the buffer builds up and leads to latency.

To monitor how much data is being stored in the buffer, monitor the bufferUsage.value metric. The only way to lower latency within the buffer is to optimize the pipeline processors and sink bulk request latency, depending on which of those is the bottleneck.

The bulkRequestLatency metric measures the time taken to execute bulk requests, including retries, and can be used to monitor write performance to the OpenSearch sink. If this metric reports an unusually high value, it indicates that the OpenSearch sink may be overloaded, causing increased processing time. To troubleshoot further, review the bulkRequestNumberOfRetries.count metric to confirm whether the high latency is due to rejections from OpenSearch that are leading to retries, such as throttling (429 errors) or other reasons. If document errors are present, examine the configured DLQ to identify the failed document details. Additionally, the max_retries parameter can be configured in the pipeline configuration to limit the number of retries. However, if the documentErrors metric reports zero, the bulkRequestNumberOfRetries.count is also zero, and the bulkRequestLatency remains high, it is likely an indicator that the OpenSearch sink is overloaded. In this case, review the destination metrics for additional details.

If the bulkRequestLatency metric is low (for example, less than 1.5 seconds) and the bulkRequestNumberOfRetries metric is reported as 0, then the bottleneck is likely within the pipeline processors. To monitor the performance of the processors, review the <processorName>.timeElapsed.avg metric. This metric reports the time taken for the processor to complete processing of a batch of records. For example, if a grok processor is reporting a much higher value than other processors for timeElapsed, it may be due to a slow grok pattern that can be optimized or even replaced with a more performant processor, depending on the use case.

Scenario 2 – Understanding and Resolving Document Errors to OpenSearch

The documentErrors.count metric tracks the number of documents that failed to be sent by bulk requests. The failure can happen due to various reasons such as mapping conflicts, invalid data formats, or schema mismatches. When this metric reports a non-zero value, it indicates that some documents are being rejected by OpenSearch. To identify the root cause, examine the configured Dead Letter Queue (DLQ), which captures the failed documents along with error details. The DLQ provides information about why specific documents failed, enabling you to identify patterns such as incorrect field types, missing required fields, or data that exceeds size limits. For example, find the sample DLQ objects for common issues below:

Mapper parsing exception:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "<PipelineName>",
        "failedData": {
            "index": "<IndexName>",
            "indexId": null,
            "status": 400,
            "message": "failed to parse field [<fieldname>] of type [integer] in document with id '<DocumentId>'. Preview of field's value: 'N/A' caused by For input string: \"N/A\"",
            "document": {<OriginalDocument>}
        },
        "timestamp": "…"
    }]}

Here, OpenSearch cannot store the text string “N/A” in a field that is only for numbers, so it rejects the document and stores it in the DLQ.

Limit of total fields exceeded:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "<PipelineName>",
        "failedData": {
            "index": "<IndexName>",
            "indexId": null,
            "status": 400,
            "message": "Limit of total fields [<field limit>] has been exceeded",
            "document": {<OriginalDocument>}
        },
        "timestamp": "…"
    }]}

The index.mapping.total_fields.limit setting is the parameter that controls the maximum number of fields allowed in an index mapping, and exceeding this limit will cause indexing operations to fail. You can check if all those fields are required or leverage various processors provided by OpenSearch Ingestion to transform the data.

Once these issues are identified, you can either correct the source data, adjust the pipeline configuration to transform the data appropriately, or modify the OpenSearch index mapping to accommodate the incoming data format.

Clean up

When setting up alarms for monitoring your OpenSearch Ingestion pipelines, it’s important to be mindful of the potential costs involved. Each alarm you configure will incur charges based on the CloudWatch pricing model.

To avoid unnecessary expenses, we recommend carefully evaluating your alarm requirements and configuring them accordingly. Only set up the alarms that are essential for your use case, and regularly review your alarm configurations to identify and remove unused or redundant alarms.

Conclusion

In this post, we explored the comprehensive monitoring capabilities for OpenSearch Ingestion pipelines through CloudWatch alarms, covering key metrics across various sources, processors, and sinks. Although this post highlights the most critical metrics, there’s more to discover. For a deeper dive, refer to the following resources:

Effective monitoring through CloudWatch alarms is crucial for maintaining healthy ingestion pipelines and maintaining optimal data flow.


About the authors

Utkarsh Agarwal

Utkarsh Agarwal

Utkarsh is a Cloud Support Engineer in the Support Engineering team at AWS. He provides guidance and technical assistance to customers, helping them build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course cricket! Lately, he is also attempting to master foosball.

Ramesh Chirumamilla

Ramesh Chirumamilla

Ramesh is a Technical Manager with Amazon Web Services. In his role, Ramesh works proactively to help craft and execute strategies to drive customers’ adoption and use of AWS services. He uses his experience working with Amazon OpenSearch Service to help customers cost-optimize their OpenSearch domains by helping them right-size and implement best practices.

Taylor Gray

Taylor Gray

Taylor is a Software Engineer in the Amazon OpenSearch Ingestion team at Amazon Web Services. He has contributed many features within both Data Prepper and OpenSearch Ingestion to enable scalable solutions for customers. In his free time, he enjoys pickle ball, reading, and playing Rocket League.

Perform reindexing in Amazon OpenSearch Serverless using Amazon OpenSearch Ingestion

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/perform-reindexing-in-amazon-opensearch-serverless-using-amazon-opensearch-ingestion/

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it straightforward to run search and analytics workloads without managing infrastructure. Customers using OpenSearch Serverless often need to copy documents between two indexes within the same collection or across different collections. This primarily arises from two scenarios:

  • Reindexing – You frequently need to update or modify index mapping due to evolving data needs or schema changes
  • Disaster recovery – Although OpenSearch Serverless data is inherently durable, you may want to copy data across AWS Regions for added redundancy and resiliency

Amazon OpenSearch Ingestion had recently introduced a feature supporting OpenSearch as a source. OpenSearch Ingestion, a fully managed, serverless data collector, facilitates real-time ingestion of log, metric, and trace data into OpenSearch Service domains and OpenSearch Serverless collections. We can leverage this feature to address these two scenarios, by reading the data from an OpenSearch Serverless Collection. This capability allows you to effortlessly copy data between indexes, making data management tasks more streamlined and eliminating the need for custom code.

In this post, we outline the steps to copy data between two indexes in the same OpenSearch Serverless collection using the new OpenSearch source feature of OpenSearch Ingestion. This is particularly useful for reindexing operations where you want to change your data schema. OpenSearch Serverless and OpenSearch Ingestion are both serverless services that enable you to seamlessly handle your data workflows, providing optimal performance and scalability.

Solution overview

The following diagram shows the flow of copying documents from the source index to the destination index using an OpenSearch Ingestion pipeline.

Implementing the solution consists of the following steps:

  1. Create an AWS Identity and Access Management (IAM) role to use as an OpenSearch Ingestion pipeline role.
  2. Update the data access policy attached to the OpenSearch Serverless collection.
  3. Create an OpenSearch Ingestion pipeline that simply copies data from one index to another, or you can even create an index template using the OpenSearch Ingestion pipeline to define explicit mapping, and then copy the data from the source index to the destination index with the defined mapping applied.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection with an index that you want to reindex (copy). Refer to Creating collections to learn more about creating a collection.

When the collection is ready, note the following details:

  • The endpoint of the OpenSearch Serverless collection
  • The name of the index from which the documents need to be copied
  • If the collection is defined as a VPC collection, note down the name of the network policy attached to the collection

You use these details in the ingestion pipeline configuration.

Create an IAM role to use as a pipeline role

An OpenSearch Ingestion pipeline needs certain permissions to pull data from the source and write to its sink. For this walkthrough, both the source and sink are the same, but if the source and sink collections are different, modify the policy accordingly.

Complete the following steps:

  1. Create an IAM policy (opensearch-ingestion-pipeline-policy) that provides permission to read and send data to the OpenSearch Serverless collection. The following is a sample policy with least privileges (modify {account-id}, {region}, {collection-id} and {collection-name} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
                "Action": [
                    "aoss:BatchGetCollection",
                    "aoss:APIAccessAll"
                ],
                "Effect": "Allow",
                "Resource": "arn:aws:aoss:{region}:{account-id}:collection/{collection-id}"
            },
            {
                "Action": [
                    "aoss:CreateSecurityPolicy",
                    "aoss:GetSecurityPolicy",
                    "aoss:UpdateSecurityPolicy"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "aoss:collection": "{collection-name}"
                    }
                }
            }
        ]
    }

  2. Create an IAM role (opensearch-ingestion-pipeline-role) that the OpenSearch Ingestion pipeline will assume. While creating the role, use the policy you created (opensearch-ingestion-pipeline-policy). The role should have the following trust relationship (modify {account-id} and {region} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                }
            }
        }]
    }

  3. Record the ARN of the newly created IAM role (arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role).

Update the data access policy attached to the OpenSearch Serverless collection

After you create the IAM role, you need to update the data access policy attached to the OpenSearch Serverless collection. Data access policies control access to the OpenSearch operations that OpenSearch Serverless supports, such as PUT <index> or GET _cat/indices. To perform the update, complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Collections.
  2. From the list of the collections, choose your OpenSearch Serverless collection.
  3. On the Overview tab, in the Data access section, choose the associated policy.
  4. Choose Edit.
  5. Edit the policy in the JSON editor to add the following JSON rule block in the existing JSON (modify {account-id} and {collection-name} accordingly):
    {
        "Rules": [{
            "Resource": [
                "index/{collection-name}/*"
            ],
            "Permission": [
                "aoss:CreateIndex",
                "aoss:UpdateIndex",
                "aoss:DescribeIndex",
                "aoss:ReadDocument",
                "aoss:WriteDocument"
            ],
            "ResourceType": "index"
        }],
        "Principal": [
            "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
        ],
        "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
    }

You can also use the Visual Editor method to choose Add another rule and add the preceding permissions for arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role.

  1. Choose Save.

Now you have successfully allowed the OpenSearch Ingestion role to perform OpenSearch operations against the OpenSearch Serverless collection.

Create and configure the OpenSearch Ingestion pipeline to copy the data from one index to another

Complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create a pipeline.
  3. In Choose Blueprint, select OpenSearchDataMigrationPipeline.
  4. For Pipeline name, enter a name (for example, sample-ingestion-pipeline).
  5. For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For this walkthrough, you can use the default value of 2 Ingestion OCUs for Min capacity and 4 Ingestion OCUs for Max capacity. However, you can even choose different values as OpenSearch Ingestion automatically scales your pipeline capacity according to your estimated workload, based on the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs) that you specify.
  6. Update the following information for the source:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection that was copied as part of prerequisites.
    2. Uncomment include and index_name_regex, and specify the name of the index that will act as the source (in this demo, we’re using logs-2024.03.01).
    3. Uncomment region under aws and specify the AWS Region where your OpenSearch Serverless collection is (for example, us-east-1).
    4. Uncomment sts_role_arn under aws and specify the role that has permission to read data from the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    5. Update the serverless flag to true.
    6. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    7. Uncomment scheduling, interval, index_read_count, and start_time and modify these parameters accordingly.
      Using these parameters makes sure the OpenSearch Ingestion pipeline processes the indexes multiple times (to pick up new documents).
      Note – If the collection specified in the sink is of the Time series or Vector search type, you can keep the scheduling, interval, index_read_count, and start_time parameters commented.
  1. Update the following information for the sink:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection.
    2. Uncomment sts_role_arn under aws and specify the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    3. Update the serverless flag to true.
    4. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    5. Update the value for index and provide the index name to which you want to transfer the documents (for example, new-logs-2024.03.01).
    6. For document_id, you can get the ID from the document metadata in the source and use the same in the target.
      However, it is important to note that custom document IDs are only supported for the Search type of collection. If your collection is of the Time Series or Vector Search type, you should comment out the document_id line.
    7. (Optional) The values for bucket, region and sts_role_arn keys within the dlq section can be modified to capture any failed requests in an S3 bucket.
      Note – Additional permission to opensearch-ingestion-pipeline-role needs to be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the changes required.
      For this walkthrough, you will not set up a DLQ. You can remove the entire dlq block.
  1. Now click on Validate pipeline to validate the pipeline configuration.
  2. For Network settings, choose your preferred setting:
    1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately. Choose this option if the OpenSearch Serverless collection has VPC access. AWS recommends using a VPC endpoint for all production workloads.
    2. Choose Public to use public access. For this walkthrough, we select Public because the collection is also accessible from public network.
  3. For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
  4. Choose Next, and verify the details you specified for your pipeline settings.
  5. Choose Create pipeline.

It will take a couple of minutes to create the ingestion pipeline. After the pipeline is created, you will see the documents in the destination index, specified in the sink (for example, new-logs-2024.03.01). After all the documents are copied, you can validate the number of documents by using the count API.

When the process is complete, you have the option to stop or delete the pipeline. If you choose to keep the pipeline running, it will continue to copy new documents from the source index according to the defined schedule, if specified.

In this walkthrough, the endpoint defined in the hosts parameter under source and sink of the pipeline configuration belonged to the same collection which was of the Search type. If the collections are different, you need to modify the permissions for the IAM role (opensearch-ingestion-pipeline-role) to allow access to both collections. Additionally, make sure you update the data access policy for both the collections to grant access to the OpenSearch Ingestion pipeline.

Create an index template using the OpenSearch Ingestion pipeline to define mapping

In OpenSearch, you can define how documents and their fields are stored and indexed by creating a mapping. The mapping specifies the list of fields for a document. Every field in the document has a field type, which defines the type of data the field contains. OpenSearch Service dynamically maps data types in each incoming document if an explicit mapping is not defined. However, you can use the template_type parameter with the index-template value and template_content with JSON of the content of the index-template in the pipeline configuration to define explicit mapping rules. You also need to define the index_type parameter with the value as custom.

The following code shows an example of the sink portion of the pipeline and the usage of index_type, template_type, and template_content:

sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "<<OpenSearch-Serverless-Collection-Endpoint>>" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role"
          # Provide the region of the domain.
          region: "us-east-1"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: true
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"
        # This will make it so each document in the source cluster will be written to the same index in the destination cluster
        index: "new-logs-2024.03.01"
        index_type: custom
        template_type: index-template
        template_content: >
          {
            "template" : {
              "mappings" : {
                "properties" : {
                  "Data" : {
                    "type" : "text"
                  },
                  "EncodedColors" : {
                    "type" : "binary"
                  },
                  "Type" : {
                    "type" : "keyword"
                  },
                  "LargeDouble" : {
                    "type" : "double"
                  }          
                }
              }
            }
          }
        # This will make it so each document in the source cluster will be written with the same document_id in the destination cluster
        document_id: "${getMetadata(\"opensearch-document_id\")}"
        # Enable the 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x
        # distribution_version: "es6"
        # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. See https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gzip.html
        # enable_request_compression: true/false
        # Enable the S3 DLQ to capture any failed requests in an S3 bucket
        # dlq:
          # s3:
            # Provide an S3 bucket
            # bucket: "<<your-dlq-bucket-name>>"
            # Provide a key path prefix for the failed requests
            # key_path_prefix: "<<logs/dlq>>"
            # Provide the region of the bucket.
            # region: "<<us-east-1>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
            # sts_role_arn: "<<arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role>>"

Or you can create the index first, with the mapping in the collection before you start the pipeline.

If you want to create a template using an OpenSearch Ingestion pipeline, you need to provide aoss:UpdateCollectionItems and aoss:DescribeCollectionItems permission for the collection in the data access policy for the pipeline role (opensearch-ingestion-pipeline-role). The updated JSON block for the rule would look like the following:

{
    "Rules": [
      {
        "Resource": [
          "collection/{collection-name}"
        ],
        "Permission": [
          "aoss:UpdateCollectionItems",
          "aoss:DescribeCollectionItems"
        ],
        "ResourceType": "collection"
      },
      {
        "Resource": [
          "index/{collection-name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
    ],
    "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
  }

Conclusion

In this post, we showed how to use an OpenSearch Ingestion pipeline to copy data from one index to another in an OpenSearch Serverless collection. OpenSearch Ingestion also allows you to perform transformation of data using various processors. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. You can use the following OpenSearch Ingestion blueprints to build data pipelines with minimal configuration changes.


About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course, cricket. Lately, he has also been attempting to master the art of cooking in his free time – the taste buds are excited, but the kitchen might disagree.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Configure SAML federation for Amazon OpenSearch Serverless with AWS IAM Identity Center

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-aws-iam-identity-center/

Amazon OpenSearch Serverless is a serverless option of Amazon OpenSearch Service that makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads. With OpenSearch Serverless, you can configure SAML to enable users to access data through OpenSearch Dashboards using an external SAML identity provider (IdP).

AWS IAM Identity Center (Successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications, OpenSearch Dashboards being one of them.

In this post, we show you how to configure SAML authentication for OpenSearch Dashboards using IAM Identity Center as its IdP.

Solution overview

The following diagram illustrates how the solution allows users or groups to authenticate into OpenSearch Dashboards using single sign-on (SSO) with IAM Identity Center using its built-in directory as the identity source.

The workflow steps are as follows:

  1. A user accesses the OpenSearch Dashboard URL in their browser and chooses the SAML provider.
  2. OpenSearch Serverless redirects the login to the specified IdP.
  3. The IdP provides a login form for the user to specify the credentials for authentication.
  4. After the user is authenticated successfully, a SAML assertion is sent back to OpenSearch Serverless.

OpenSearch Serverless validates the SAML assertion, and the user logs in to OpenSearch Dashboards.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection. Refer to Creating and managing Amazon OpenSearch Serverless collections to learn more about creating a collection. Furthermore, you must have the correct AWS Identity and Access Management (IAM) permissions for configuring SAML authentication along with relevant IAM permissions for configuring the data access policy.

IAM Identity Center should be enabled, and you should have the relevant IAM permissions to create an application in IAM Identity Center and create and manage users and groups.

Create and configure the application in IAM Identity Center

To set up your application in IAM Identity Center, complete the following steps:

  1. On the IAM Identity Center dashboard, choose Applications in the navigation pane.
  2. Choose Add application
  3. For Custom application, select Add custom SAML 2.0 application.
  4. Choose Next.
  5. Under Configure application, enter a name and description for the application.
  6. Under IAM Identity Center metadata, choose Download under IAM Identity Center SAML metadata file.

We use this metadata file to create a SAML provider under OpenSearch Serverless. It contains the public certificate used to verify the signature of the IAM Identity Center SAML assertions.

  1. Under Application properties, leave Application start URL and Relay state blank.
  2. For Session duration, choose 1 hour (the default value).

Note that the session duration you configure in this step takes precedence over the OpenSearch Dashboards timeout setting specified in the configuration of the SAML provider details on the OpenSearch Serverless end.

  1. Under Application metadata, select Manually type your metadata values.
  2. For Application ACS URL, enter your URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs. For example, we enter https://collection.us-east-1.aoss.amazonaws.com/_saml/acs for this post.
  3. For Application SAML audience, enter your service provider in the format aws:opensearch:<aws account id>.
  4. Choose Submit.

Now you modify the attribute settings. The attribute mappings you configure here become part of the SAML assertion that is sent to the application.

  1. On the Actions menu, choose Edit attribute mappings.
  2. Configure Subject to map to ${user:email}, with the format unspecified.

Using ${user:email} here ensures that the email address for the user in IAM Identity Center is passed in the <NameId> tag of the SAML response.

  1. Choose Save changes.

Now we assign a user to the application.

  1. Create a user in IAM Identity Center to use to log in to OpenSearch Dashboards.

Alternatively, you can use an existing user.

  1. On the IAM Identity Center console, navigate to your application and choose Assign Users and select the user(s) you would like to assign.

You have now created a custom SAML application. Next, you will configure the SAML provider in OpenSearch Serverless.

Create a SAML provider

The SAML provider you create in this step can be assigned to any collection in the same Region. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
  2. Choose Create SAML provider.
  3. Enter a name and description for your SAML provider.
  4. Enter the metadata from your IdP that you downloaded earlier.
  5. Under Additional settings, you can optionally add custom user ID and group attributes. We leave these settings blank for now.
  6. Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, we walk you through configuring the data access policy for accessing collections.

Create the data access policy

In this section, you set up data access policies for OpenSearch Serverless and allow access to the users. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
  2. Choose Create access policy.
  3. Enter a name and description for your access policy.
  4. For Policy definition method, select Visual Editor.
  5. In the Rules section, enter a rule name.
  6. Under Select principals, for Add principals, choose SAML users and groups.
  7. For SAML provider name, choose the SAML provider you created earlier.
  8. Specify the user in the format user/<email> (for example, user/[email protected]).

The value of the email address should match the email address in IAM Identity Center.

  1. Choose Save.
  2. Choose Grant and specify the permissions.

You can configure what access you want to provide for the specific user at the collection level and specific indexes at the index pattern level.

You should select the access the user needs based on the least privilege model. Refer to Supported policy permissions and Supported OpenSearch API operations and permissions to set up more granular access for your users.

  1. Choose Save and configure any additional rules, if required.

You can now review and edit your configuration if needed.

  1. Choose Create to create the data access policy.

Now you have the data access policy that will allow the users to perform the allowed actions on OpenSearch Dashboards.

Access OpenSearch Dashboards

To sign in to OpenSearch Dashboards, complete the following steps:

  1. On the OpenSearch Service dashboard, under Serverless in the navigation pane, choose Dashboard.
  2. Locate your dashboard and copy the OpenSearch Dashboards URL (in the format <collection-endpoint>/_dashboards).
  3. Enter this URL into a new browser tab.
  4. On the OpenSearch login page, choose your IdP and specify your SSO credentials.
  5. Choose Login.

Configure SAML authentication using groups in IAM Identity Center

Groups can help you organize your users and permissions in a coherent way. With groups, you can add multiple users from the IdP, and then use groupid as the identifier in the data access policy. For more information, refer to Add groups and Add users to groups.

To configure group access to OpenSearch Dashboards, complete the following steps:

  1. On the IAM Identity Center console, navigate to your application.
  2. In the Attribute mappings section, add an additional user as group and map it to ${user:groups}, with the format unspecified.
  3. Choose Save changes.
  4. For the SAML provider in OpenSearch Serverless, under Additional settings, for Group attribute, enter group.
  5. For the data access policy, create a new rule or add an additional principal in the previous rule.
  6. Choose the SAML provider name and enter group/<GroupId>.

You can fetch the value for the group ID by navigating to the Group section on the IAM Identity Center console.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

  1. On the IAM Identity Center console, remove the application.
  2. On OpenSearch Dashboards, delete the following resources:
    1. Delete your collection.
    2. Delete the data access policy.
    3. Delete the SAML provider.

Conclusion

In this post, you learned how to set up IAM Identity Center as an IdP to access OpenSearch Dashboards using SAML as SSO. You also learned on how to set up users and groups within IAM Identity Center and control the access of users and groups for OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search solutions using OpenSearch Serverless. You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.


About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.

Ravi Bhatane is a software engineer with Amazon OpenSearch Serverless Service. He is passionate about security, distributed systems, and building scalable services. When he’s not coding, Ravi enjoys photography and exploring new hiking trails with his friends.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.