All posts by Muthu Pitchaimani

Reduce Mean Time to Resolution with an observability agent

2026-02-05 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/reduce-mean-time-to-resolution-with-an-observability-agent/

Customers of all sizes have been successfully using Amazon OpenSearch Service to power their observability workflows and gain visibility into their applications and infrastructure. During incident investigation, Site Reliability Engineers (SREs) and operations center personnel rely on OpenSearch Service to query logs, examine visualizations, analyze patterns, correlate traces to find the root cause of the incident, and reduce Mean Time to Resolution (MTTR). When an incident happens that triggers alerts, SREs typically jump between multiple dashboards, write specific queries, check recent deployments, and correlate between logs and traces to piece together a timeline of events. Not only is this process largely manual, but it also creates a cognitive load on these personnel, even when all the data is readily available. This is where agentic AI can help, by being an intelligent assistant that can understand how to query, interpret various telemetry signals, and systematically investigate an incident.

In this post, we present an observability agent using OpenSearch Service and Amazon Bedrock AgentCore that can help surface root cause and get insights faster, handle multiple query-correlation cycles, and ultimately reduce MTTR even further.

Solution overview

The following diagram shows the overall architecture for the observability agent.

Applications and infrastructure emit telemetry signals in the form of logs, traces, and metrics. These signals are then gathered by OpenTelemetry Collector (Step 1) and exported to Amazon OpenSearch Ingestion using individual pipelines for every signal: logs, traces, and metrics (Step 2). These pipelines deliver the signal data to an OpenSearch Service domain and Amazon Managed Service for Prometheus (Step 3).

OpenTelemetry is the standard for instrumentation, and provides vendor-neutral data collection across a broad range of languages and frameworks. Enterprises of various sizes are adopting this architecture pattern using OpenTelemetry for their observability needs, especially those committed to open source tools. More notably, this architecture builds on open source foundations, helping enterprises avoid vendor lock-in, benefit from the open source community, and implement it across on-premises and various cloud environments.

For this post, we use the OpenTelemetry Demo application to demonstrate our observability use case. This is an ecommerce application powered by about 20 different microservices, and generates realistic telemetry data together with feature sets to generate load and simulate failures.

Model Context Protocol servers for observability signal data

The Model Context Protocol (MCP) provides a standardized mechanism to connect agents to external data sources and tools. In this solution, we built three distinct MCP servers, one for each type of signal.

The Logs MCP server exposes tool functions for searching, filtering, and selecting log data that is stored in an OpenSearch Service domain for log data. This enables the agent to query the logs using various criteria like simple keyword matching, service name filter, log level, or time ranges. This mimics the typical queries you would run during an investigation. The following snippet shows a pseudo code of what the tool function can look like:

# Logs MCP Server - Key Functions
search_otel_logs(
    query: string,           # Text search query for log messages
    service: string,         # Service name to filter logs
    severity: string,        # Log level (INFO, WARN, ERROR)
    startTime: string,       # Start time (ISO format or relative e.g., 'now-1h')
    endTime: string,         # End time (ISO format or relative e.g., 'now')
    size: number             # Number of results to return
)
get_logs_by_trace_id(
    traceId: string,         # Trace ID to retrieve all correlated logs
    size: number             # Maximum number of logs to return
)

The Traces MCP server exposes tool functions for searching and retrieving information about distributed traces. These functions can help look up traces by trace ID and find traces for a particular service, the spans belonging to a trace, the service map information constructed based on the spans, and the rate, error, and duration (also known as RED metrics). This enables the agent to follow a request’s path across the services and pinpoint where failures happened or latency originated.

# Traces MCP Server - Key Functions
get_otel_spans(
    serviceName: string,     # Service name to filter spans
    traceId: string,         # Trace ID to filter spans
    spanId: string,          # Span ID to retrieve a specific span
    operationName: string,   # Operation/span name to filter
    startTime: string,       # Start time (ISO format or relative)
    endTime: string,         # End time (ISO format or relative)
    size: number             # Number of results to return
)
get_spans_by_trace_id(
    traceId: string,         # Trace ID to retrieve all spans for
    size: number             # Maximum number of spans to return
)
get_otel_service_map(
    serviceName: string,     # Service name to filter service map
    startTime: string,       # Start time
    endTime: string,         # End time
    size: number             # Number of results to return
)
get_otel_rate_error_duration_metrics(
    startTime: string,       # Start time (default: 'now-5m')
    endTime: string          # End time (default: 'now')
)

The Metrics MCP server exposes tool functions for querying time series metrics. The agent can use these functions to check error rate percentiles and resource utilization, which are key signals for understanding the overall health of the system and identifying anomalous behavior.

# Metrics MCP Server - Key Functions
query_instant(
    query: string,           # PromQL query expression
    time: string,            # Evaluation timestamp (optional)
    timeout: string          # Evaluation timeout (optional)
)
query_range(
    query: string,           # PromQL query expression
    start: string,           # Start timestamp
    end: string,             # End timestamp
    step: string,            # Query resolution step (e.g., '15s', '1m')
    timeout: string          # Evaluation timeout (optional)
)
get_timeseries(
    metric: string,          # Metric name or PromQL expression
    duration: string,        # Time duration to look back (e.g., '1h', '6h')
    step: string             # Step size (optional)
)
search_metrics(
    pattern: string          # Search pattern (supports regex e.g., 'http.*')
)
explore_metric(
    metric: string           # Metric name to explore (metadata + samples)
)

These three MCP servers span across the different types of data used by investigation engineers, providing a complete working set for an agent to conduct investigations with autonomous correlation across logs, traces, and metrics to determine the possible root causes for an issue. Additionally, a custom MCP server exposes tool functions over business data on revenue, sales, and other business metrics. For the OpenTelemetry demo application, you can develop synthetic data to aid in providing context for impact and other business level metrics. For brevity, we don’t show that server as a part of this architecture.

Observability agent

The observability agent is central to the solution. It is built to help with incident investigation. Traditional automations and manual runbooks typically follow predefined operating procedures, but with an observability agent, you don’t need to define them. The agent can analyze, reason based on the data available to it, and adapt its strategy based on what it discovers. It correlates findings across logs, traces, and metrics to arrive at a root cause.

The observability agent is built with the Strands Agent SDK, an open source framework that simplifies development of AI agents. The SDK provides a model-driven approach with flexibility to handle underlying orchestration and reasoning (the agent loop) by invoking exposed tools and maintaining coherent, turn-based interactions. This implementation also discovers tools dynamically, so if there is a change in the capabilities, the agent can make decisions based on up-to-date information.

The agent runs on Amazon Bedrock AgentCore Runtime, which provides fully managed infrastructure for hosting and running agents. The runtime supports popular agent frameworks, including Stands, LangGraph, and CrewAI. The runtime also provides scaling availability and compute that many enterprises require to run production-grade agents.

We use Amazon Bedrock AgentCore Gateway to connect to all three MCP servers. When deploying agents at scale, gateways are indispensable components to reduce management tasks like custom code development, infrastructure provisioning, comprehensive ingress and egress security, and unified access. These are essential enterprise functions needed when bringing a workload to production. In this application, we create gateways that connect all three MCP servers as targets using server-sent events. Gateways work alongside Amazon Bedrock AgentCore Identities to provide secure credentials management and secure identity propagation from the user to the communicating entities. The sample application uses AWS Identity and Access Management (IAM) for identity management and propagation.

Incident investigation is often a multi-step process. It involves iterative hypothesis testing, multiple rounds of querying, and building context over time. We use Amazon Bedrock AgentCore Memory for this purpose. In this solution, we use session-based namespaces to maintain separate conversation threads for different investigations. For example, when a user asks “What about Payment service?” during an investigation, the agent retrieves recent conversation history from memory to maintain awareness of prior findings. We store both user questions and agent responses with timestamps to help the agent reconstruct the conversation chronologically and reason about already completed findings.

We configured the observability agent to use Anthropic’s Claude Sonnet v4.5 in Amazon Bedrock for reasoning. The model interprets questions, decides which MCP tool to invoke, analyzes the results, and formulates the set of questions or conclusions. We use a system prompt to instruct the model to think like an experienced SRE or an operation center engineer: “Starting with a high-level check, narrowing down affected components, correlate across telemetry signal types and derive conclusion with substantiation. You ask the model to also suggest logical next steps such as performing a drill down to investigate inter service dependencies.” This makes the agent versatile to analyze and reason about common varieties of incident investigations.

Observability agent in action

We built a real-time RED (rate, errors, duration) metrics dashboards for the entire application, as shown in the following figure.

To establish a baseline, we asked the agent the following question: “Are there any errors in my application in the last five minutes?”The agent queries the traces and metrics, analyzes the results, and responds saying there are no errors in the system. It notes that all the services are active, traces are healthy, and the system is processing requests normally. The agent also proactively suggests next steps that might be useful for further investigation.

Introducing failures

The OpenTelemetry demo application has a feature flag that we can use to introduce deliberate failures in the system. It also includes load generation so these errors can surface prominently. We use these features to introduce a few failures with the payment service. The real-time RED metrics dashboards in the previous figure reflect the impact and show the error rates climbing.

Investigation and root cause analysis

Now that we are generating errors, we engage the agent again. This is typically the start of the investigation session. Also, we have workflows like alarms triggering or pages going out that will trigger the starting of an investigation.

We ask the question “Users are complaining that it is taking a long time to buy items. Can you check to see what is going on?”

The agent retrieves the conversation history from memory (if there is any), invokes tools to query RED metrics across services, and analyzes the results. It identifies a critical purchase flow performance issue: payment service is in a connectivity crisis and completely unavailable, with extreme latency observed in fraud detection, ad service, and recommendation service. The agent provides immediate action recommendations—restore payment service connectivity as the top priority—and suggests next steps, including investigating payment service logs.

Following the agent’s suggestion, we ask it to investigate the logs: “Investigate payment service logs to understand the connectivity issue.”

The agent searches logs for the checkout and payment services, correlates them with trace data, and analyzes service dependencies from the service map. It confirms that although cart service, product catalog service, and currency service are healthy, the payment service is completely unreachable, successfully identifying the root cause of our deliberately introduced failure.

Beyond root cause: Analyzing business impact

As mentioned earlier, we have synthetic business sales and revenue data in a separate MCP server, so when the user asks the agent “Analyze the business impact of the checkout and payment service failures,” the agent uses this business data, examines the transaction data from traces, calculates estimated revenue impact, and assesses customer abandonment rates due to checkout failures. This shows how the agent can go beyond identifying the root cause and provide help with operational activities like creating a runbook for issue resolution in the future, which can be first the step to providing automatic remediation without involving SREs.

Benefits and results

Although the failure scenario in this post is simplified for illustration, it highlights several key benefits that directly contribute to reducing MTTR.

Accelerated investigation cycles

Traditional workflows for troubleshooting involve multiple iterations of hypotheses, verification, querying, and data analysis at each step, requiring context switching and consuming hours of effort. The observability agent reduces these drastically to a few minutes by autonomous reasoning, correlation, and actioning, which in turn reduces MTTR.

Handling complex workflows

Real-world production scenarios often involve cascading failures and multiple system failures. The observability agent’s capabilities can extend to these scenarios by using historical data and pattern recognition. For instance, it can distinguish related issues from false positives using temporal or identity-based correlation, dependency graphs, and other techniques, helping SREs avoid wasted investigation effort on unrelated anomalies.

Rather than provide a single answer, the agent can provide probabilistic distribution across potential root causes, helping SREs prioritize remediation methods; for example:

Payment service network connectivity issue: 75%
Downstream payment gateway timeout: 15%
Database connection pool exhaustion: 8%
Other/Unknown: 2%

The agent can compare current symptoms against past incidents, identifying whether similar patterns have happened in the past, thereby evolving from a reactive query tool into a proactive diagnostic assistant.

Conclusion

Incident investigation remains largely manual. SREs juggle dashboards, craft queries, and correlate signals under pressure, even when all the data is readily available. In this post, we showed how an observability agent built with Amazon Bedrock AgentCore and OpenSearch Service can alleviate this cognitive burden by autonomously querying logs, traces, and metrics; correlating findings; and guiding SREs toward root cause faster. Although this pattern represents one approach, the flexibility of Amazon Bedrock AgentCore combined with the search and analytics capabilities of OpenSearch Service enables agents to be designed and deployed in numerous ways—at different stages of the incident lifecycle, with varying levels of autonomy, or focused on specific investigation tasks—to suit your organization’s unique operational needs. Agentic AI doesn’t replace existing observability investment, but amplifies them by providing an effective way to use your data during incident investigations.

About the authors

OpenSearch UI: Six months in review

2025-05-23 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/opensearch-ui-six-months-in-review/

OpenSearch UI has been adopted by thousands of customers for various use cases since its launch in November 2024. Exciting customer stories and feedback have helped shape our feature improvements. As we complete 6 months since its general availability, we are sharing major enhancements that have improved OpenSearch UI’s capability, especially in observability and security analytics, in this post.

OpenSearch UI is a serverless, fully managed dashboard to provide a scalable, zero-downtime, web-based interface for data analytics and visualizations. With OpenSearch UI, you can have a unified interface to gain actionable insights across multiple data sources, including Amazon OpenSearch Service domains, Amazon OpenSearch Serverless collections, and AWS services such as Amazon CloudWatch and Amazon Security Lake.

Use natural language for your AI-powered analytics with Amazon Q Developer

OpenSearch UI has transformed complex data analysis to be as simple as asking questions in natural language with its integration with Amazon Q Developer in OpenSearch. You can access the conversational chat pane by choosing the Amazon Q Developer icon in the top right corner of the UI. Amazon Q Developer will answer generic questions such as how to use the features in OpenSearch UI and how to use OpenSearch UI with additional data sources.

You can use the search bar on the Discover page to use the generative AI capabilities with your OpenSearch data. You can enter your question about your data in natural language. The query assistant feature will translate your question to Piped Processing Language (PPL), run the query, and show the results. There will also be an Amazon Q Summary section generated from the query results to answer your question. The query assistant feature now also works with data connections from Amazon Simple Storage Service (Amazon S3).

Additionally, you can use the generative AI feature for anomaly detection and visualizations for your data, so it’s straightforward to identify potential issues earlier and faster, reducing the mean time to resolution.

When an alert is triggered, you can choose the Amazon Q icon to generate a summary of the alert, so you can catch up on the context of this alert. The View insights button will provide further analysis of the alerts in combination with OpenSearch knowledge through a process called Retrieval Augmented Generation (RAG). If you want to further investigate the alert, you can choose View in Discover to proceed to log analytics.

Amazon Q Developer in OpenSearch Service will help you reduce troubleshooting time, resolve more issues without escalation, and extract actionable insights from your operational data using natural language instead of specialized queries. Refer to Amazon Q Developer in Amazon OpenSearch Service to get started with the AI assisted analytics experience.

Enhance enterprise security

We have improved OpenSearch UI’s security capability to meet the demanding needs of large enterprises. Through these enhancements, we’re making it seamless to manage secure access at scale so you can have precise control over who can access your analytics workspaces and data that resides in them.

Use SAML workflows through IAM federation

OpenSearch UI now supports Security Assertion Markup Language (SAML) through AWS Identity and Access Management (IAM) federation so that you can create a single sign-on (SSO) experience for your end-users that initiates authentication workflows from your external identity providers (IdPs), typically called IdP-initiated SSO. You might find this process familiar if your organization is using external IdPs (such as Okta) to manage user permissions and track user activities in accessing AWS services. You can now define a default relay state URL to share with your end-users with this support. Your end-users can use this URL to land directly in OpenSearch UI after authenticating with their IdP. You can also achieve fine-grained access control by defining different permissions for each IAM role assumed by different end-users. To get started, refer to Enabling SAML federation with AWS Identity and Access Management.

Secure access with AWS PrivateLink

OpenSearch UI now supports AWS PrivateLink. You can now access OpenSearch UI privately from within your virtual private cloud (VPC). To learn more, see Managing access to the OpenSearch UI from a VPC endpoint.

Enhancing workspace privacy

There are also new workspace-level privacy settings, so you can quickly configure your workspace with the right permissions with collaborators. For more details, refer to Using Amazon OpenSearch Service workspaces.

Expanded data access capabilities

OpenSearch UI now also offers following additional data access capabilities.

Support for cross-cluster search

Cross-cluster search is an OpenSearch feature with which you can query multiple connected OpenSearch Service domains across accounts and across AWS Regions. We added the capability to support these connected domains as data sources in OpenSearch UI. With this support, you can view remote connected clusters with an index pattern under the data source for the source cluster. To learn more, see Cross-Region and cross-account data access with cross-cluster search.

Regional expansion

To further expand the data access capabilities of OpenSearch UI, we expanded its availability to two more regions: Asia Pacific (Hong Kong) and Europe (Stockholm).

Conclusion

The past 6 months after general availability of OpenSearch UI have seen significant progress in making OpenSearch UI more user-friendly, more available, and more secure. From natural language-based exploration to enterprise security, these feature enhancements reflect our commitment to simplify and improve your data analytics experience. To learn more, refer to Using OpenSearch UI in Amazon OpenSearch Service and get updates through Amazon OpenSearch Service user interface release history.

About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads generative AI, workspaces, and infrastructural features in OpenSearch UI. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Introducing Amazon Q Developer in Amazon OpenSearch Service

2025-05-09 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-amazon-q-developer-in-amazon-opensearch-service/

Customers use Amazon OpenSearch Service to store their operational and telemetry signal data. They use this data to monitor the health of their applications and infrastructure, so that when a production issue happens, they can identify the cause quickly. The sheer volume and variety in data often makes this process complex and time-consuming, leading to high mean time to repair (MTTR).

To expedite this process and transform how developers interact with their operational data, today we introduced Amazon Q Developer support in OpenSearch Service. With this AI-assisted analysis, both new and experienced users can navigate complex operational data without training, analyze issues, and gain insights in a fraction of the time. Amazon Q Developer in OpenSearch Service reduces MTTR by integrating generative AI capabilities directly into OpenSearch workflows so you can improve your operational capabilities without scaling your specialist teams. You can now investigate issues, analyze patterns, and create visualizations using in-context assistance and natural language interactions.

In this post, we share how to get started using Amazon Q Developer in OpenSearch Service and explore some of its key capabilities.

Solution overview

Setting up observability signal data for analysis involves many steps, including instrumenting application code, creating complex queries, creating visualizations and dashboards, configuring appropriate alerts, and often machine learning-based anomaly detectors. This requires significant upfront investment in time, resources, and expertise. Amazon Q Developer in OpenSearch Service introduces natural language exploration and generative AI-based tooling throughout OpenSearch, simplifying both initial setup and ongoing operations. Customers already use natural language based query generation to aid constructing OpenSearch queries; Amazon Q in OpenSearch Service brings in the following additional capabilities:

Natural language-based visualizations
Result summarization for queries generated with natural language queries
Anomaly detector suggestions
Alert summarization and insights
Best practices guidance

Let’s explore each of these capabilities in detail to understand how they help transform traditional observability workflows and streamline the process of data analysis in the centralized OpenSearch UI.

Natural language-based visualization

Natural language-based visualizations with Amazon Q for OpenSearch Service fundamentally transform how users create and interact with data visualizations. You don’t need to know specialized query languages currently used in OpenSearch Service dashboards to create complex visualizations. For example, you can input requests like “show me a chart of error rates over the last 24 hours broken down by region” or “create a chart showing the distribution of HTTP response codes,” and Amazon Q will automatically generate the appropriate visualization.

To get started with this feature, choose Visualizations in the navigation pane and choose Create New Visualization. The OpenSearch UI has many built-in visualization types. To use the new natural language-based visualization, choose Natural language previewer.

This will bring will bring a new visualization page with a text field where you can enter a query in natural language.

Choose an index pattern on the dropdown menu (openSearch_dashabords_sample_data_logs in this case). Amazon Q interprets your intent, identifies relevant fields, automatically selects the most appropriate visualization type, and applies proper formatting and styling. Amazon Q can also understand multiple dimensions in the data, various aggregation methods, and different time ranges.

Now you’re ready to build your visualization in natural language. For example, for the query “Show me number of distinct IP addresses per day in logs,” we see the following visualization.

Amazon Q generates the visualization as per the instruction. The UI also gives the option to update any component of data, transformations, marks and encoding for the visualization. This window also shows the generated query for the data in PPL. For this example Amazon Q generated this query

source=opensearch_dashboards_sample_data_logs*| stats DISTINCT_COUNT(`ip`) as unique_ips by span(`timestamp`, 1d)

Using this interactive UI, you can customize different aspects of the visualization if needed. For example, if you prefer to use a bar type instead of what Amazon Q generated, you can change the mark type to bar and choose Update, or choose Edit visual and specify new set of instructions for this visualization (for example, “change to bar chart”).

After you have adjusted the visualization to your satisfaction, you can save it to retrieve later. What makes this feature particularly powerful is its ability to understand context and suggest refinements by updating your prompts—if the initial visualization doesn’t quite meet your needs, you can describe the desired changes using the Edit visual option.

Result summarization

Amazon Q acts as an interpretation layer that processes query results into a condensed, structured summary. It can also identify patterns and other significant trends in the data by observing both the qualitative and quantitative characteristics of the results. The system’s effectiveness largely depends on the quality of the underlying data, the specificity of the initial query, and the characteristics of query generation, among other things. Amazon Q also samples the result set for generating this result summarization. These summaries are a good starting point for analysis. For example, for the same query we used last time (“Show me number of distinct IP addresses per day in logs”), Amazon Q will analyze the result set in the Amazon Q Summary section.

Anomaly detector suggestions

As it responds to your query, Amazon Q can make suggestions for creating an anomaly detector based upon your data source selected. It does that by recommending relevant fields of your operational data patterns with a one-click confirmation to create the detector.

Features are aggregation of fields or scripts that determines what constitutes an anomaly. Identifying features and creating a detector to use those features typically requires deep technical understanding of spikes, dips, thresholds and inter-relationship between multiple features. Amazon Q helps reduce this traditional complexity when creating a detector by automatically identifying these features as shown below. You can also make changes to the suggested detector to fine-tune to your needs.

Alerts summarization and insights

Choosing the Amazon Q icon next to alerts generates a concise summary that includes alert definitions, the specific conditions that led to its activation, and an overview of the current state of the monitored system or service.

The insights component provides a higher-level insight into the alerts by highlighting the significance of these alerts, typical conditions that results in these alerts, along with recommendations to help mitigate the conditions of these alerts. To get an insight for an alert, you need to provide additional information about your environment with a knowledge base. For instructions on generating insights, see View alert summaries and insights.

By choosing View in Discover, you can dive deeper into the data behind the alert with a single click, facilitating a seamless transition from alert notification to detailed investigation in Discover. The insights and summarization feature helps accelerate your investigations; care must be taken to identify the root cause of the problem because it will likely require human intervention.

Best practices guidance

Amazon Q Developer in OpenSearch Service not only simplifies operations, but also serves as an intelligent assistant for implementing OpenSearch Service best practices. Amazon Q for OpenSearch Service has been trained on the developer and product documentation, so that it can suggest best practices for operating OpenSearch Service domains, Amazon OpenSearch Serverless collections, and configurations based on your needs for capacity and compliance. To get started, choose the Amazon Q icon on the top right. The assistant maintains the history of the conversations. For the guidance it provides, the assistant cites its sources, providing a helpful link to the documentation. It also provides suggestions to continue the conversation. You can ask questions regarding data access policies, index state managements, sizing leader nodes, or other best practices or operational questions about OpenSearch.

Cost considerations

OpenSearch UI is available for use without other associated costs. Amazon Q Developer for OpenSearch Service is available within OpenSearch UI in the following AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (London), Europe (Paris), and South America (São Paulo). Because it’s included at the Free Tier, there is no associated cost.

Conclusion

Amazon Q Developer support in OpenSearch Service brings in AI-powered capabilities to help alleviate the traditional barriers that teams face when setting up, monitoring, and troubleshooting their applications. This allows teams of all experience levels to harness the full power of OpenSearch.

We’re excited to see how you will use these new capabilities to transform your observability workflows and drive better operational outcomes. To get started with Amazon Q Developer in OpenSearch Service, refer to Amazon Q Developer is now generally available in Amazon OpenSearch Service

About the Authors

Dagney Braun is a Senior Manager of Product on the Amazon Web Services OpenSearch team. She is passionate about improving the ease of use of OpenSearch and expanding the tools available to better support all customer use cases.

Achieve cross-Region resilience with Amazon OpenSearch Ingestion

2024-09-24 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/

Cross-Region deployments provide increased resilience to maintain business continuity during outages, natural disasters, or other operational interruptions. Many large enterprises, design and deploy special plans for readiness during such situations. They rely on solutions built with AWS services and features to improve their confidence and response times. Amazon OpenSearch Service is a managed service for OpenSearch, a search and analytics engine at scale. OpenSearch Service provides high availability within an AWS Region through its Multi-AZ deployment model and provides Regional resiliency with cross-cluster replication. Amazon OpenSearch Serverless is a deployment option that provides on-demand auto scaling, to which we continue to bring in many features.

With the existing cross-cluster replication feature in OpenSearch Service, you designate a domain as a leader and another as a follower, using an active-passive replication model. Although this model offers a way to continue operations during Regional impairment, it requires you to manually configure the follower. Additionally, after recovery, you need to reconfigure the leader-follower relationship between the domains.

In this post, we outline two solutions that provide cross-Region resiliency without needing to reestablish relationships during a failback, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Simple Storage Service (Amazon S3). These solutions apply to both OpenSearch Service managed clusters and OpenSearch Serverless collections. We use OpenSearch Serverless as an example for the configurations in this post.

Solution overview

We outline two solutions in this post. In both options, data sources local to a region write to an OpenSearch ingestion (OSI) pipeline configured within the same region. The solutions are extensible to multiple Regions, but we show two Regions as an example as Regional resiliency across two Regions is a popular deployment pattern for many large-scale enterprises.

You can use these solutions to address cross-Region resiliency needs for OpenSearch Serverless deployments and active-active replication needs for both serverless and provisioned options of OpenSearch Service, especially when the data sources produce disparate data in different Regions.

Prerequisites

Complete the following prerequisite steps:

Deploy OpenSearch Service domains or OpenSearch Serverless collections in all the Regions where resiliency is needed.
Create S3 buckets in each Region.
Configure AWS Identity and Access Management (IAM) permissions needed for OSI. For instructions, refer to Amazon S3 as a source. Choose Amazon Simple Queue Service (Amazon SQS) as the method for processing data.

After you complete these steps, you can create two OSI pipelines one in each Region with the configurations detailed in the following sections.

Use OpenSearch Ingestion (OSI) for cross-Region writes

In this solution, OSI takes the data that is local to the Region it’s in and writes it to the other Region. To facilitate cross-Region writes and increase data durability, we use an S3 bucket in each Region. The OSI pipeline in the other Region reads this data and writes to the collection in its local Region. The OSI pipeline in the other Region follows a similar data flow.

While reading data, you have choices: Amazon SQS or Amazon S3 scans. For this post, we use Amazon SQS because it helps provide near real-time data delivery. This solution also facilitates writing directly to these local buckets in the case of pull-based OSI data sources. Refer to Source under Key concepts to understand the different types of sources that OSI uses.

The following diagram shows the flow of data.

The data flow consists of the following steps:

Data sources local to a Region write their data to the OSI pipeline in their Region. (This solution also supports sources directly writing to Amazon S3.)
OSI writes this data into collections followed by S3 buckets in the other Region.
OSI reads the other Region data from the local S3 bucket and writes it to the local collection.
Collections in both Regions now contain the same data.

The following snippets shows the configuration for the two pipelines.

#pipeline config for cross region writes
version: "2"
write-pipeline:
  source:
    http:
      path: "/logs"
  processor:
    - parse_json:
  sink:
    # First sink to same region collection
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"
    - s3:
        # Second sink to cross region S3 bucket
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "osi-cross-region-bucket"
        object_key:
          path_prefix: "osi-crw/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: 60s
        codec:
          ndjson:

The code for the write pipeline is as follows:

#pipeline config to read data from local S3 bucket
version: "2"
read-write-pipeline:
  source:
    s3:
      # S3 source with SQS 
      acknowledgments: true
      notification_type: "sqs"
      compression: "none"
      codec:
        newline:
      sqs:
        queue_url: "https://sqs.us-east-1.amazonaws.com/1234567890/my-osi-cross-region-write-q"
        maximum_messages: 10
        visibility_timeout: "60s"
        visibility_duplication_protection: true
      aws:
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123567890:role/pipe-line-role"
  processor:
    - parse_json:
  route:
  # Routing uses the s3 keys to ensure OSI writes data only once to local region 
    - local-region-write: "contains(/s3/key, \"osi-local-region-write\")"
    - cross-region-write: "contains(/s3/key, \"osi-cross-region-write\")"
  sink:
    - pipeline:
        name: "local-region-write-cross-region-write-pipeline"
    - pipeline:
        name: "local-region-write-pipeline"
        routes:
        - local-region-write
local-region-write-cross-region-write-pipeline:
  # Read S3 bucket with cross-region-write
  source:
    pipeline: 
      name: "read-write-pipeline"
  sink:
   # Sink to local-region managed OpenSearch service 
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::12345678890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"
local-region-write-pipeline:
  # Read local-region write  
  source:
    pipeline: 
      name: "read-write-pipeline"
  processor:
    - delete_entries:
        with_keys: ["s3"]
  sink:
     # Sink to cross-region S3 bucket 
    - s3:
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "osi-cross-region-write-bucket"
        object_key:
          path_prefix: "osi-cross-region-write/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: "60s"
        codec:
          ndjson:

To separate management and operations, we use two prefixes, osi-local-region-write and osi-cross-region-write, for buckets in both Regions. OSI uses these prefixes to copy only local Region data to the other Region. OSI also creates the keys s3.bucket and s3.key to decorate documents written to a collection. We remove this decoration while writing across Regions; it will be added back by the pipeline in the other Region.

This solution provides near real-time data delivery across Regions, and the same data is available across both Regions. However, although OpenSearch Service contains the same data, the buckets in each Region contain only partial data. The following solution addresses this.

Use Amazon S3 for cross-Region writes

In this solution, we use the Amazon S3 Region replication feature. This solution supports all the data sources available with OSI. OSI again uses two pipelines, but the key difference is that OSI writes the data to Amazon S3 first. After you complete the steps that are common to both solutions, refer to Examples for configuring live replication for instructions to configure Amazon S3 cross-Region replication. The following diagram shows the flow of data.

The data flow consists of the following steps:

Data sources local to a Region write their data to OSI. (This solution also supports sources directly writing to Amazon S3.)
This data is first written to the S3 bucket.
OSI reads this data and writes to the collection local to the Region.
Amazon S3 replicates cross-Region data and OSI reads and writes this data to the collection.

The following snippets show the configuration for both pipelines.

version: "2"
s3-write-pipeline:
  source:
    http:
      path: "/logs"
  processor:
    - parse_json:
  sink:
    # Write to S3 bucket that has cross region replication enabled
    - s3:
        aws:
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-2"
        bucket: "s3-cross-region-bucket"
        object_key:
          path_prefix: "pushedlogs/%{yyyy}/%{MM}/%{dd}/%{HH}"
        threshold:
          event_collect_timeout: 60s
          event_count: 2
        codec:
          ndjson:

The code for the write pipeline is as follows:

version: "2"
s3-read-pipeline:
  source:
    s3:
      acknowledgments: true
      notification_type: "sqs"
      compression: "none"
      codec:
        newline:
      # Configure SQS to notify OSI pipeline
      sqs:
        queue_url: "https://sqs.us-east-2.amazonaws.com/1234567890/my-s3-crr-q"
        maximum_messages: 10
        visibility_timeout: "15s"
        visibility_duplication_protection: true
      aws:
        region: "us-east-2"
        sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
  processor:
    - parse_json:
  # Configure OSI sink to move the files from S3 to OpenSearch Serverless
  sink:
    - opensearch:
        hosts: [ "https://abcdefghijklmn.us-east-1.aoss.amazonaws.com" ]
        aws:
          # Role must have access to S3 OpenSearch Pipeline and OpenSearch Serverless
          sts_role_arn: "arn:aws:iam::1234567890:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "cross-region-index"

The configuration for this solution is relatively simpler and relies on Amazon S3 cross-Region replication. This solution makes sure that the data in the S3 bucket and OpenSearch Serverless collection are the same in both Regions.

For more information about the SLA for this replication and metrics that are available to monitor the replication process, refer to S3 Replication Update: Replication SLA, Metrics, and Events.

Impairment scenarios and additional considerations

Let’s consider a Regional impairment scenario. For this use case, we assume that your application is powered by an OpenSearch Serverless collection as a backend. When a region is impaired, these applications can simply failover to the OpenSearch Serverless collection in the other Region and continue operations without interruption, because the entirety of the data present before the impairment is available in both collections.

When the Region impairment is resolved, you can failback to the OpenSearch Serverless collection in that Region either immediately or after you allow some time for the missing data to be backfilled in that Region. The operations can then continue without interruption.

You can automate these failover and failback operations to provide a seamless user experience. This automation is not in scope of this post, but will be covered in a future post.

The existing cross-cluster replication solution, requires you to manually reestablish a leader-follower relationship, and restart replication from the beginning once recovered from an impairment. But the solutions discussed here automatically resume replication from the point where it last left off. If for some reason only Amazon OpenSearch service that is collections or domain were to fail, the data is still available in a local buckets and it will be back filled as soon the collection or domain becomes available.

You can effectively use these solutions in an active-passive replication model as well. In those scenarios, it’s sufficient to have minimum set of resources in the replication Region like a single S3 bucket. You can modify this solution to solve different scenarios using additional services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), which has a built-in replication feature.

When building cross-Region solutions, consider cross-Region data transfer costs for AWS. As a best practice, consider adding a dead-letter queue to all your production pipelines.

Conclusion

In this post, we outlined two solutions that achieve Regional resiliency for OpenSearch Serverless and OpenSearch Service managed clusters. If you need explicit control over writing data cross Region, use solution one. In our experiments with few KBs of data majority of writes completed within a second between two chosen regions. Choose solution two if you need simplicity the solution offers. In our experiments replication completed completely in a few seconds. 99.99% of objects will be replicated within 15 minutes. These solutions also serve as an architecture for an active-active replication model in OpenSearch Service using OpenSearch Ingestion.

You can also use OSI as a mechanism to search for data available within other AWS services, like Amazon S3, Amazon DynamoDB, and Amazon DocumentDB (with MongoDB compatibility). For more details, see Working with Amazon OpenSearch Ingestion pipeline integrations.

About the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Introducing self-managed data sources for Amazon OpenSearch Ingestion

2024-07-01 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-self-managed-data-sources-for-amazon-opensearch-ingestion/

Enterprise customers increasingly adopt Amazon OpenSearch Ingestion (OSI) to bring data into Amazon OpenSearch Service for various use cases. These include petabyte-scale log analytics, real-time streaming, security analytics, and searching semi-structured key-value or document data. OSI makes it simple, with straightforward integrations, to ingest data from many AWS services, including Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon DocumentDB (with MongoDB compatibility).

Today we are announcing support for ingesting data from self-managed OpenSearch/Elasticsearch and Apache Kafka clusters. These sources can either be on Amazon Elastic Compute Cloud (Amazon EC2) or on-premises environments.

In this post, we outline the steps to get started with these sources.

Solution overview

OSI supports the AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, the AWS Command Line Interface (AWS CLI), Terraform, AWS APIs, and the AWS Management Console to deploy pipelines. In this post, we use the console to demonstrate how to create a self-managed Kafka pipeline.

Prerequisites

To make sure OSI can connect and read data successfully, the following conditions should be met:

Network connectivity to data sources – OSI is generally deployed in a public network, such as the internet, or in a virtual private cloud (VPC). OSI deployed in a customer VPC is able to access data sources in the same or different VPC and on the internet with an attached internet gateway. If your data sources are in another VPC, common methods for network connectivity include direct VPC peering, using a transit gateway, or using customer managed VPC endpoints powered by AWS PrivateLink. If your data sources are on your corporate data center or other on-premises environment, common methods for network connectivity include AWS Direct Connect and using a network hub like a transit gateway. The following diagram shows a sample configuration of OSI running in a VPC and using Amazon OpenSearch Service as a sink. OSI runs in a service VPC and creates an Elastic Network interface (ENI) in the customer VPC. For self-managed data source these ENIs are used for reading data from on-premises environment. OSI creates an VPC endpoint in the service VPC to send data to the sink.
Name resolution for data sources – OSI uses an Amazon Route 53 resolver. This resolver automatically answers queries to names local to a VPC, public domain names on the internet, and records hosted in private hosted zones. If you’re are using a private hosted zone, make sure you have a DHCP option set enabled, attached to the VPC using AmazonProvidedDNS as domain name server. For more information, see Work with DHCP option sets. Additionally, you can use resolver inbound and outbound endpoints if you need a complex resolution schemes with conditions that are beyond a simple private hosted zone.
Certificate verification for data source names – OSI supports only SASL_SSL for transport for Apache Kafka source. Within SASL, Amazon OpenSearch Service supports most authentication mechanisms like PLAIN, SCRAM, IAM, GSAPI and others. When using SASL_SSL, make sure you have access to certificates needed for OSI to authenticate. For self-managed OpenSearch data sources, make sure verifiable certificates are installed on the clusters. Amazon OpenSearch Service doesn’t support insecure communication between OSI and OpenSearch. Certificate verification cannot be turned off. In particular, the “insecure” configuration option is not supported.
Access to AWS Secrets Manager – OSI uses AWS Secrets Manager to retrieve credentials and certificates needed to communicate with self-managed data sources. For more information, see Create and manage secrets with AWS Secrets Manager.
IAM role for pipelines – You need an AWS Identity and Access Management (IAM) pipeline role to write to data sinks. For more information, see Identity and Access Management for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed Kafka as a source

After you complete the prerequisites, you’re ready to create a pipeline for your data source. Complete the following steps:

On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
Choose Create pipeline.
Choose Streaming under Use case in the navigation pane.
Select Self managed Apache Kafka under Ingestion pipeline blueprints and choose Select blueprint.

This will populate a sample configuration for this pipeline.

Provide a name for this pipeline and choose the appropriate pipeline capacity.

Under Pipeline configuration, provide your pipeline configuration in YAML format. The following code snippet shows sample configuration in YAML for SASL_SSL authentication:

version: '2'
kafka-pipeline:
  source:
    kafka:
      acknowledgments: true
      bootstrap_servers:
        - 'node-0.example.com:9092'
      encryption:
        type: "ssl"
        certificate: '${{aws_secrets:kafka-cert}}'
        
      authentication:
        sasl:
          plain:
            username: '${{aws_secrets:secrets:username}}'
            password: '${{aws_secrets:secrets:password}}'
      topics:
        - name: on-prem-topic
          group_id: osi-group-1
  processor:
    - grok:
        match:
          message:
            - '%{COMMONAPACHELOG}'
    - date:
        destination: '@timestamp'
        from_time_received: true
  sink:
    - opensearch:
        hosts: ["https://search-domain-12345567890.us-east-1.es.amazonaws.com"]
        aws:
          region: us-east-1
          sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
        index: "on-prem-kakfa-index"
extension:
  aws:
    secrets:
      kafka-cert:
        secret_id: kafka-cert
        region: us-east-1
        sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
      secrets:
        secret_id: secrets
        region: us-east-1
        sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'

Choose Validate pipeline and confirm there are no errors.
Under Network configuration, choose Public access or VPC access. (For this post, we choose VPC access).
If you chose VPC access, specify your VPC, subnets, and an appropriate security group so OSI can reach the outgoing ports for the data source.
Under VPC attachment options, select Attach to VPC and choose an appropriate CIDR range.

OSI resources are created in a service VPC managed by AWS that is separate from the VPC you chose in the last step. This selection allows you to configure what CIDR ranges OSI should use inside this service VPC. The choice exists so you can make sure there is no address collision between CIDR ranges in your VPC that is attached to your on-premises network and this service VPC. Many pipelines in your account can share same CIDR ranges for this service VPC.

Specify any optional tags and log publishing options, then choose Next.
Review the configuration and choose Create pipeline.

You can monitor the pipeline creation and any log messages in the Amazon CloudWatch Logs log group you specified. Your pipeline should now be successfully created. For more information about how to provision capacity for the performance of this pipeline, see the section Recommended Compute Units (OCUs) for the MSK pipeline in Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed OpenSearch as a source

The steps for creating a pipeline for self-managed OpenSearch are similar to the steps for creating one for Kafka. During the blueprint selection, choose Data Migration under Use case and select Self managed OpenSearch/Elasticsearch. OpenSearch Ingestion can source data from all versions of OpenSearch and Elasticsearch from version 7.0 to version 7.10.

The following blueprint shows a sample configuration YAML for this data source:

version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      hosts: [ "https://node-0.example.com:9200" ]
      username: "${{aws_secrets:secret:username}}"
      password: "${{aws_secrets:secret:password}}"
      indices:
        include:
        - index_name_regex: "opensearch_dashboards_sample_data*"
        exclude:
          - index_name_regex: '\..*'
  sink:
    - opensearch:
        hosts: [ "https://search-domain-12345567890.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
          region: "us-east-1"
        index: "on-prem-os"
extension:
  aws:
    secrets:
      secret:
        secret_id: "self-managed-os-credentials"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
        refresh_interval: PT1H

Considerations for self-managed OpenSearch data source

Certificates installed on the OpenSearch cluster need to be verifiable for OSI to connect to this data source before reading data. Insecure connections are currently not supported.

After you’re connected, make sure the cluster has sufficient read bandwidth to allow for OSI to read data. Use the Min and Max OCU setting to limit OSI read bandwidth consumption. Your read bandwidth will vary depending upon data volume, number of indexes, and provisioned OCU capacity. Start small and increase the number of OCUs to balance between available bandwidth and acceptable migration time.

This source is typically meant for one-time migration of data and not as continuous ingestion to keep data in sync between data sources and sinks.

OpenSearch Service domains support remote reindexing, but that consumes resources in your domains. Using OSI will move this compute out of the domain, and OSI can achieve significantly higher bandwidth than remote reindexing, thereby resulting in faster migration times.

OSI doesn’t support deferred replay or traffic recording today; refer to Migration Assistant for Amazon OpenSearch Service if your migration needs those capabilities.

Conclusion

In this post, we introduced self-managed sources for OpenSearch Ingestion that enable you to ingest data from corporate data centers or other on-premises environments. OSI also supports various other data sources and integrations. Refer to Working with Amazon OpenSearch Ingestion pipeline integrations to learn about these other data sources.

About the Authors

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focuses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large-scale distributed systems and cloud-centered technologies, and is based out of Seattle, Washington.

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

2024-02-27 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-amazon-opensearch-ingestion-to-migrate-to-amazon-opensearch-serverless/

Amazon OpenSearch Serverless is an on-demand auto scaling configuration for Amazon OpenSearch Service. Since its release, the interest for OpenSearch Serverless had been steadily growing. Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. Until now, customers have had to rely on using custom code or third-party solutions to move the data between provisioned OpenSearch Service domains and OpenSearch Serverless.

We recently introduced a feature with Amazon OpenSearch Ingestion (OSI) to make this migration even more effortless. OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections.

In this post, we outline the steps to make migrate the data between provisioned OpenSearch Service domains and OpenSearch Serverless. Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post.

Solution overview

The following diagram shows the necessary components for moving data between OpenSearch Service provisioned domains and OpenSearch Serverless using OSI. You will use OSI with OpenSearch Service as source and an OpenSearch Serverless collection as sink.

Prerequisites

Before getting started, complete the following steps to create the necessary resources:

Create an AWS Identity and Access Management (IAM) role that the OpenSearch Ingestion pipeline will assume to write to the OpenSearch Serverless collection. This role needs to be specified in the sts_role_arn parameter of the pipeline configuration.

Attach a permissions policy to the role to allow it to read data from the OpenSearch Service domain. The following is a sample policy with least privileges:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":"es:ESHttpGet",
         "Resource":[
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_cat/indices",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search"
         ]
      },
      {
         "Effect":"Allow",
         "Action":"es:ESHttpPost",
         "Resource":[
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/point_in_time",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/scroll"
         ]
      },
      {
         "Effect":"Allow",
         "Action":"es:ESHttpDelete",
         "Resource":[
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/point_in_time",
            "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll"
         ]
      }
   ]
}

Attach a permissions policy to the role to allow it to send data to the collection. The following is a sample policy with least privileges:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "aoss:BatchGetCollection",
        "aoss:APIAccessAll"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:aoss:{region}:{your-account-id}:collection/{collection-id}"
    },
    {
      "Action": [
        "aoss:CreateSecurityPolicy",
        "aoss:GetSecurityPolicy",
        "aoss:UpdateSecurityPolicy"
      ],
      "Effect": "Allow",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aoss:collection": "{collection-name}"
        }
      }
    }
  ]
}

Configure the role to assume the trust relationship, as follows:

{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "osis-pipelines.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

It’s recommended to add the aws:SourceAccount and aws:SourceArn condition keys to the policy for protection against the confused deputy problem:

"Condition": {
    "StringEquals": {
        "aws:SourceAccount": "{your-account-id}"
    },
    "ArnLike": {
        "aws:SourceArn": "arn:aws:osis:{region}:{your-account-id}:pipeline/*"
    }
}

Map the OpenSearch Ingestion domain role ARN as a backend user (as an all_access user) to the domain user. We show a simplified example to use the all_access role. For production scenarios, make sure to use a role with just enough permissions to read and write.
Create an OpenSearch Serverless collection, which is where data will be ingested.

Associate a data policy, as shown in the following code, to grant the OpenSearch Ingestion role permissions on the collection:

[
  {
    "Rules": [
      {
        "Resource": [
          "index/collection-name/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:WriteDocument",
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/pipeline-role"
    ],
    "Description": "Pipeline role access"
  }
]

If the collection is defined as a VPC collection, you need to create a network policy and configure it in the ingestion pipeline.

Now you’re ready to move data from your provisioned domain to OpenSearch Serverless.

Move data from provisioned domains to Serverless

Setup Amazon OpenSearch Ingestion
To get started, you must have an active OpenSearch Service domain (source) and OpenSearch Serverless collection (sink). Complete the following steps to set up an OpenSearch Ingestion pipeline for migration:

On the OpenSearch Service console, choose Pipeline under Ingestion in the navigation pane.
Choose Create a pipeline.
For Pipeline name, enter a name (for example, octank-migration).
For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For now, you can leave the default minimum as 1 and maximum as 4.
For Configuration Blueprint, select AWS-OpenSearchDataMigrationPipeline.
Update the following information for the source:
1. Uncomment hosts and specify the endpoint of the existing OpenSearch Service endpoint.
2. Uncomment distribution_version if your source cluster is an OpenSearch Service cluster with compatibility mode enabled; otherwise, leave it commented.
3. Uncomment indices, include, index_name_regex, and add an index name or pattern that you want to migrate (for example, octank-iot-logs-2023.11.0*).
4. Update region under aws where your source domain is (for example, us-west-2).
5. Update sts_role_arn under aws to the role that has permission to read data from the OpenSearch Service domain (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as a backend role within the OpenSearch Service security roles.
Update the following information for the sink:
1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless endpoint.
2. Update sts_role_arn under aws to the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as part of the data access policy in the OpenSearch Serverless collection.
3. Update the serverless flag to be true.
4. For index, you can leave it as default, which will get the metadata from the source index and write to the same name in the destination as of the sources. Alternatively, if you want to have a different index name at the destination, modify this value with your desired name.
5. For document_id, you can get the ID from the document metadata in the source and use the same in the target. Note that custom document IDs are supported only for the SEARCH type of collection; if you have your collection as TIMESERIES or VECTORSEARCH, you should comment this line.
Next, you can validate your pipeline to check the connectivity of source and sink to make sure the endpoint exists and is accessible.
For Network settings, choose your preferred setting:
1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately.
2. Choose Public to use public access. AWS recommends that you use a VPC endpoint for all production workloads, but this walkthrough, select Public.
For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
Choose Next, and verify the details you specified for your pipeline settings.
Choose Create pipeline.

It should take a couple of minutes to create the ingestion pipeline.
The following graphic gives a quick demonstration of creating the OpenSearch Ingestion pipeline via the preceding steps.

Verify ingested data in the target OpenSearch Serverless collection

After the pipeline is created and active, log in to OpenSearch Dashboards for your OpenSearch Serverless collection and run the following command to list the indexes:

GET _cat/indices?v

The following graphic gives a quick demonstration of listing the indexes before and after the pipeline becomes active.

Conclusion

In this post, we saw how OpenSearch Ingestion can ingest data into an OpenSearch Serverless collection without the need to use the third-party solutions. With minimal data producer configuration, it automatically ingested data to the collection. OSI also allows you to transform or reindex the data from ES7.x version before ingestion to an OpenSearch Service domain or OpenSearch Serverless collection. OSI eliminates the need to provision, scale, or manage servers. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. The following OpenSearch Ingestion blueprints enable you to build data pipelines with minimal configuration changes.

About the Authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Rahul Sharma is a Technical Account Manager at Amazon Web Services. He is passionate about the data technologies that help leverage data as a strategic asset and is based out of New York city, New York.

Introducing persistent buffering for Amazon OpenSearch Ingestion

2023-11-21 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-persistent-buffering-for-amazon-opensearch-ingestion/

Amazon OpenSearch Ingestion is a fully managed, serverless pipeline that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and OpenSearch Serverless collections.

Customers use Amazon OpenSearch Ingestion pipelines to ingest data from a variety of data sources, both pull-based and push-based. When ingesting data from pull-based sources, such as Amazon Simple Storage Service (Amazon S3) and Amazon MSK using Amazon OpenSearch Ingestion, the source handles the data durability and retention. Push-based sources, however, stream records directly to ingestion endpoints, and typically don’t have a means of persisting data once it is generated.

To address this need for such sources, a common architectural pattern is to add a persistent standalone buffer for enhanced durability and reliability of data ingestion. A durable, persistent buffer can mitigate the impact of ingestion spikes, buffer data during downtime, and reduce the need to expand capacity using in-memory buffers which can overflow. Customers use popular buffering technologies like Apache Kafka or RabbitMQ to add durability to their data flowing through their Amazon OpenSearch Ingestion pipelines. However, these tools add complexity to the data ingestion pipeline architecture and can be time consuming to setup, right-size, and maintain.

Solution overview

Today we’re introducing persistent buffering for Amazon OpenSearch Ingestion to enhance data durability and simplify data ingestion architectures for Amazon OpenSearch Service customers. You can use persistent buffering to ingest data for all push-based sources supported by Amazon OpenSearch Ingestion without the need to set up a standalone buffer. These include HTTP sources and OTEL sources for logs, traces and metrics. Persistent buffering in Amazon OpenSearch Ingestion is serverless and scales elastically to meet the throughput needs of even the most demanding workloads. You can now focus on your core business logic when ingesting data at scale in Amazon OpenSearch Service without worrying about the undifferentiated heavy lifting of provisioning and managing servers to add durability to your ingest pipeline.

Walkthrough

Enable persistent buffering

You can turn on the persistent buffering for existing or new pipelines using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK. If you choose not to enable persistent buffering, then the pipelines continue to use an in-memory buffer.

By default, persistent data is encrypted at rest with a key that AWS owns and manages for you. You can optionally choose your own customer managed key (KMS key) to encrypt data by selecting the checkbox labeled Customize encryption settings and selecting Choose a different AWS KMS key. Please note that if you choose a different KMS key, your pipeline needs additional permission to decrypt and generate data keys. The following snippet shows an example AWS Identity and Access Management (AWS IAM) permission policy that needs to be attached to a role used by the pipeline.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "KeyAccess",
            "Effect": "Allow",
            "Action": [
              "kms:Decrypt",
              "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab"
        }
    ]
}

Provision for persistent buffering

Once persistent buffering is enabled, data is retained in the buffer for 72 hours. Amazon OpenSearch Ingestion keeps track of the data written into a sink and automatically resumes writing from the last successful check point should there be an outage in the sink or other issues that prevents data from being successfully written. There are no additional services or components needed for persistent buffers other than minimum and maximum OpenSearch compute Units (OCU) set for the pipeline. When persistent buffering is turned on, each Ingestion-OCU is now capable of providing persistent buffering along with its existing ability to ingest, transform, and route data. Amazon OpenSearch Ingestion dynamically allocates the buffer from the minimum and maximum allocation of OCUs that you define for the pipelines.

The number of Ingestion-OCUs used for persistent buffering is dynamically calculated based on the source, the transformations on the streaming data, and the sink that the data is written to. Because a portion of the Ingestion-OCUs now applies to persistent buffering, in order to maintain the same ingestion throughput for your pipeline, you need to increase the minimum and maximum Ingestion-OCUs when turning on persistent buffering. This amount of OCUs that you need with persistent buffering depends on the source that you are ingesting data from and also on the type of processing that you are performing on the data. The following table shows the number of OCUs that you need with persistent buffering with different sources and processors.

Sources and processors	Ingestion-OCUs with buffering	Compared to number of OCUs without persistent buffering needed to achieve similar data throughput
HTTP with no processors	3 times
HTTP wit Grok	2 times
OTel Trace	2 times
OTel Metrics	2 times

You have complete control over how you want to set up OCUs for your pipelines and decide between increasing OCUs for higher throughput or reducing OCUs for cost control at a lower throughput. Also, when you turn on persistent buffering, the minimum OCUs for a pipeline go up from one to two.

Availability and pricing

Persistent buffering is available in the all the AWS Regions where Amazon OpenSearch Ingestion is available as of November 17 2023. These includes US East (Ohio), US East (N. Virginia), US West (Oregon), US West (N. California), Europe (Ireland), Europe (London), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Sydney), Asia Pacific (Singapore), Asia Pacific (Mumbai), Asia Pacific (Seoul), and Canada (Central).

Ingestion-OCUs remains at the same price of $0.24 cents per hour. OCUs are billed on an hourly basis with per-minute granularity. You can control the costs OCUs incur by configuring maximum OCUs that a pipeline is allowed to scale.

Conclusion

In this post, we showed you how to configure persistent buffering for Amazon OpenSearch Ingestion to enhance data durability, and simplify data ingestion architecture for Amazon OpenSearch Service. Please refer to the documentation to learn other capabilities provided by Amazon OpenSearch Ingestion to a build sophisticated architecture for your ingestion needs.

About the Authors

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focusses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large scale distributed systems and cloud-native technologies and is based out of Seattle, Washington.

Jay is Customer Success Engineering leader for OpenSearch service. He focusses on overall customer experience with the OpenSearch. Jay is interested in large scale OpenSearch adoption, distributed data store and is based out of Northern Virginia.

Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.

Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion

2023-08-31 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-amazon-msk-as-a-source-for-amazon-opensearch-ingestion/

Ingesting a high volume of streaming data has been a defining characteristic of operational analytics workloads with Amazon OpenSearch Service. Many of these workloads involve either self-managed Apache Kafka or Amazon Managed Streaming for Apache Kafka (Amazon MSK) to satisfy their data streaming needs. Consuming data from Amazon MSK and writing to OpenSearch Service has been a challenge for customers. AWS Lambda, custom code, Kafka Connect, and Logstash have been used for ingesting this data. These methods involve tools that must be built and maintained. In this post, we introduce Amazon MSK as a source to Amazon OpenSearch Ingestion, a serverless, fully managed, real-time data collector for OpenSearch Service that makes this ingestion even easier.

Solution overview

The following diagram shows the flow from data sources to Amazon OpenSearch Service.

The flow contains the following steps:

Data sources produce data and send that data to Amazon MSK
OpenSearch Ingestion consumes the data from Amazon MSK.
OpenSearch Ingestion transforms, enriches, and writes the data into OpenSearch Service.
Users search, explore, and analyze the data with OpenSearch Dashboards.

Prerequisites

You will need a provisioned MSK cluster created with appropriate data sources. The sources, as producers, write data into Amazon MSK. The cluster should be created with the appropriate Availability Zone, storage, compute, security and other configurations to suit your workload needs. To provision your MSK cluster and have your sources producing data, see Getting started using Amazon MSK.

As of this writing, OpenSearch Ingestion supports Amazon MSK provisioned, but not Amazon MSK Serverless. However, OpenSearch Ingestion can reside in the same or different account where Amazon MSK is present. OpenSearch Ingestion uses AWS PrivateLink to read data, so you must turn on multi-VPC connectivity on your MSK cluster. For more information, see Amazon MSK multi-VPC private connectivity in a single Region. OpenSearch Ingestion can write data to Amazon Simple Storage Service (Amazon S3), provisioned OpenSearch Service, and Amazon OpenSearch Service. In this solution, we use a provisioned OpenSearch Service domain as a sink for OSI. Refer to Getting started with Amazon OpenSearch Service to create a provisioned OpenSearch Service domain. You will need appropriate permission to read data from Amazon MSK and write data to OpenSearch Service. The following sections outline the required permissions.

Permissions required

To read from Amazon MSK and write to Amazon OpenSearch Service, you need to create a an AWS Identity and Access Management (IAM) role used by Amazon OpenSearch Ingestion. In this post we use a role called pipeline-Role for this purpose. To create this role please see Creating IAM roles.

Reading from Amazon MSK

OpenSearch Ingestion will need permission to create a PrivateLink connection and other actions that can be performed on your MSK cluster. Edit your MSK cluster policy to include the following snippet with appropriate permissions. If your OpenSearch Ingestion pipeline resides in an account different from your MSK cluster, you will need a second section to allow this pipeline. Use proper semantic conventions when providing the cluster, topic, and group permissions and remove the comments from the policy before using.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "osis-pipelines.aws.internal"
      },
      "Action": [
        "kafka:CreateVpcConnection",
        "kafka:GetBootstrapBrokers",
        "kafka:DescribeCluster"
      ],
      # Change this to your msk arn
      "Resource": "arn:aws:kafka:us-east-1:XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx"
    },    
    ### Following permissions are required if msk cluster is in different account than osi pipeline
    {
      "Effect": "Allow",
      "Principal": {
        # Change this to your sts role arn used in the pipeline
        "AWS": "arn:aws:iam:: XXXXXXXXXXXX:role/PipelineRole"
      },
      "Action": [
        "kafka-cluster:*",
        "kafka:*"
      ],
      "Resource": [
        # Change this to your msk arn
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx",
        # Change this as per your cluster name & kafka topic name
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:topic/test-cluster/xxxxxxxx-xxxx-xx/*",
        # Change this as per your cluster name
        "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:group/test-cluster/*"
      ]
    }
  ]
}

Edit the pipeline role’s inline policy to include the following permissions. Ensure that you have removed the comments before using the policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:Connect",
                "kafka-cluster:AlterCluster",
                "kafka-cluster:DescribeCluster",
                "kafka:DescribeClusterV2",
                "kafka:GetBootstrapBrokers"
            ],
            "Resource": [
                # Change this to your msk arn
                "arn:aws:kafka:us-east-1:XXXXXXXXXXXX:cluster/test-cluster/xxxxxxxx-xxxx-xx"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:*Topic*",
                "kafka-cluster:ReadData"
            ],
            "Resource": [
                # Change this to your kafka topic and cluster name
                "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:topic/test-cluster/xxxxxxxx-xxxx-xx/topic-to-consume"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:AlterGroup",
                "kafka-cluster:DescribeGroup"
            ],
            "Resource": [
                # change this as per your cluster name
                "arn:aws:kafka:us-east-1: XXXXXXXXXXXX:group/test-cluster/*"
            ]
        }
    ]
}

Writing to OpenSearch Service

In this section, you provide the pipeline role with necessary permissions to write to OpenSearch Service. As a best practice, we recommend using fine-grained access control in OpenSearch Service. Use OpenSearch dashboards to map a pipeline role to an appropriate backend role. For more information on mapping roles to users, see Managing permissions. For example, all_access is a built-in role that grants administrative permission to all OpenSearch functions. When deploying to a production environment, ensure that you use a role with enough permissions to write to your OpenSearch domain.

Creating OpenSearch Ingestion pipelines

The pipeline role now has the correct set of permissions to read from Amazon MSK and write to OpenSearch Service. Navigate to the OpenSearch Service console, choose Pipelines, then choose Create pipeline.

Choose a suitable name for the pipeline. and se the pipeline capacity with appropriate minimum and maximum OpenSearch Compute Unit (OCU). Then choose ‘AWS-MSKPipeline’ from the dropdown menu as shown below.

Use the provided template to fill in all the required fields. The snippet in the following section shows the fields that needs to be filled in red.

Configuring Amazon MSK source

The following sample configuration snippet shows every setting you need to get the pipeline running:

msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true                     # Default is false  
      topics: 
         - name: "<topic name>" 
           group_id: "<consumer group id>" 
           serde_format: json                   # Remove, if Schema Registry is used. (Other option is plaintext)  
 
           # Below defaults can be tuned as needed 
           # fetch_max_bytes: 52428800          Optional 
           # fetch_max_wait: 500                Optional (in msecs) 
           # fetch_min_bytes: 1                 Optional (in MB) 
           # max_partition_fetch_bytes: 1048576 Optional 
           # consumer_max_poll_records: 500     Optional                                
           # auto_offset_reset: "earliest"      Optional (other option is "earliest") 
           # key_mode: include_as_field         Optional (other options are include_as_field, discard)  
 
       
           serde_format: json                   # Remove, if Schema Registry is used. (Other option is plaintext)   
 
      # Enable this configuration if Glue schema registry is used            
      # schema:                                 
      #   type: aws_glue 
 
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        # sts_role_arn: "arn:aws:iam::XXXXXXXXXXXX:role/Example-Role" 
        # Provide the region of the domain. 
        # region: "us-west-2" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:us-west-2:XXXXXXXXXXXX:cluster/msk-prov-1/id" 
 
  sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          # hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
          # sts_role_arn: "arn:aws:iam::XXXXXXXXXXXX:role/Example-Role" 
          # Provide the region of the domain. 
          # region: "us-east-1" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # serverless: true 
          # index name can be auto-generated from topic name 
          index: "index_${getMetadata(\"kafka_topic\")}-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          # distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          # dlq: 
            # s3: 
            # Provide an S3 bucket

We use the following parameters:

acknowledgements – Set to true for OpenSearch Ingestion to ensure that the data is delivered to the sinks before committing the offsets in Amazon MSK. The default value is set to false.
name – This specifies topic OpenSearch Ingestion can read from. You can read a maximum of four topics per pipeline.
group_id – This parameter specifies that the pipeline is part of the consumer group. With this setting, a single consumer group can be scaled to as many pipelines as needed for very high throughput.
serde_format – Specifies a deserialization method to be used for the data read from Amazon MSK. The options are JSON and plaintext.
AWS sts_role_arn and OpenSearch sts_role_arn – Specifies the role OpenSearch Ingestion uses for reading and writing. Specify the ARN of the role you created from the last section. OpenSearch Ingestion currently uses the same role for reading and writing.
MSK arn – Specifies the MSK cluster to consume data from.
OpenSearch host and index – Specifies the OpenSearch domain URL and where the index should write.

When you have configured the Kafka source, choose the network access type and log publishing options. Public pipelines do not involve PrivateLink and they will not incur a cost associated with PrivateLink. Choose Next and review all configurations. When you are satisfied, choose Create pipeline.

Recommended compute units (OCUs) for the MSK pipeline

Each compute unit has one consumer per topic. Brokers will balance partitions among these consumers for a given topic. However, when the number of partitions is greater than the number of consumers, Amazon MSK will host multiple partitions on every consumer. OpenSearch Ingestion has built-in auto scaling to scale up or down based on CPU usage or number of pending records in the pipeline. For optimal performance, partitions should be distributed across many compute units for parallel processing. If topics have a large number of partitions, for example, more than 96 (maximum OCUs per pipeline), we recommend configuring a pipeline with 1–96 OCUs because it will auto scale as needed. If a topic has a low number of partitions, for example, less than 96, then keep the maximum compute unit to same as the number of partitions. When pipeline has more than one topic, user can pick a topic with highest number of partitions as a reference to configure maximum computes units. By adding another pipeline with a new set of OCUs to the same topic and consumer group, you can scale the throughput almost linearly.

Clean up

To avoid future charges, clean up any unused resources from your AWS account.

Conclusion

In this post, you saw how to use Amazon MSK as a source for OpenSearch Ingestion. This not only addresses the ease of data consumption from Amazon MSK, but it also relieves you of the burden of self-managing and manually scaling consumers for varying and unpredictable high-speed, streaming operational analytics data. Please refer to the ‘sources’ list under ‘supported plugins’ section for exhaustive list of sources from which you can ingest data.

About the authors

Raj Sharma is a Sr. SDM with Amazon OpenSearch Service. He builds large-scale distributed applications and solutions. Raj is interested in the topics of Analytics, databases, networking and security, and is based out of Palo Alto, California.

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

2023-08-29 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/generate-security-insights-from-amazon-security-lake-data-using-amazon-opensearch-ingestion/

Amazon Security Lake centralizes access and management of your security data by aggregating security event logs from AWS environments, other cloud providers, on premise infrastructure, and other software as a service (SaaS) solutions. By converting logs and events using Open Cybersecurity Schema Framework, an open standard for storing security events in a common and shareable format, Security Lake optimizes and normalizes your security data for analysis using your preferred analytics tool.

Amazon OpenSearch Service continues to be a tool of choice by many enterprises for searching and analyzing large volume of security data. In this post, we show you how to ingest and query Amazon Security Lake data with Amazon OpenSearch Ingestion, a serverless, fully managed data collector with configurable ingestion pipelines. Using OpenSearch Ingestion to ingest data into your OpenSearch Service cluster, you can derive insights quicker for time sensitive security investigations. You can respond swiftly to security incidents, helping you protect your business critical data and systems.

Solution overview

The following architecture outlines the flow of data from Security Lake to OpenSearch Service.

The workflow contains the following steps:

Security Lake persists OCSF schema normalized data in an Amazon Simple Storage Service (Amazon S3) bucket determined by the administrator.
Security Lake notifies subscribers through the chosen subscription method, in this case Amazon Simple Queue Service (Amazon SQS).
OpenSearch Ingestion registers as a subscriber to get the necessary context information.
OpenSearch Ingestion reads Parquet formatted security data from the Security Lake managed Amazon S3 bucket and transforms the security logs into JSON documents.
OpenSearch Ingestion ingests this OCSF compliant data into OpenSearch Service.
Download and import provided dashboards to analyze and gain quick insights into the security data.

OpenSearch Ingestion provides a serverless ingestion framework to easily ingest Security Lake data into OpenSearch Service with just a few clicks.

Prerequisites

Complete the following prerequisite steps:

Create an Amazon OpenSearch Service domain. For instructions, refer to Creating and managing Amazon OpenSearch Service domains.
You must have access to the AWS account in which you wish to set up this solution.

Set up Amazon Security Lake

In this section, we present the steps to set up Amazon Security Lake, which includes enabling the service and creating a subscriber.

Enable Amazon Security Lake

Identify the account in which you want to activate Amazon Security Lake. Note that for accounts that are part of organizations, you have to designate a delegated Security Lake administrator from your management account. For instructions, refer to Managing multiple accounts with AWS Organizations.

Sign in to the AWS Management Console using the credentials of the delegated account.
On the Amazon Security Lake console, choose your preferred Region, then choose Get started.

Amazon Security Lake collects log and event data from a variety of sources and across your AWS accounts and Regions.

Now you’re ready to enable Amazon Security Lake.

You can either select All log and event sources or choose specific logs by selecting Specific log and event sources.
Data is ingested from all Regions. The recommendation is to select All supported regions so activities are logged for accounts that you might not frequently use as well. However, you also have the option to select Specific Regions.
For Select accounts, you can select the accounts in which you want Amazon Security Lake enabled. For this post, we select All accounts.

You’re prompted to either create a new AWS Identity and Access Management (IAM) role or use an existing IAM role. This gives required permissions to Amazon Security Lake to collect the logs and events. Choose the option appropriate for your situation.
Choose Next.
Optionally, specify the Amazon S3 storage class for the data in Amazon Security Lake. For more information, refer to Lifecycle management in Security Lake.
Choose Next.
Review the details and create the data lake.

Create an Amazon Security Lake subscriber

To access and consume data in your Security Lake managed Amazon S3 buckets, you must set up a subscriber.

Complete the following steps to create your subscriber:

On the Amazon Security Lake console, choose Summary in the navigation pane.

Here, you can see the number of Regions selected.

Choose Create subscriber.

A subscriber consumes logs and events from Amazon Security Lake. In this case, the subscriber is OpenSearch Ingestion, which consumes security data and ingests it into OpenSearch Service.

For Subscriber name, enter OpenSearchIngestion.
Enter a description.
Region is automatically populated based on the current selected Region.
For Log and event sources, select whether the subscriber is authorized to consume all log and event sources or specific log and event sources.
For Data access method, select S3.
For Subscriber credentials, enter the subscriber’s <AWS account ID> and OpenSearchIngestion-<AWS account ID>.
For Notification details, select SQS queue.

This prompts Amazon Security Lake to create an SQS queue that the subscriber can poll for object notifications.

Choose Create.

Install templates and dashboards for Amazon Security Lake data

Your subscriber for OpenSearch Ingestion is now ready. Before you configure OpenSearch Ingestion to process the security data, let’s configure an OpenSearch sink (destination to write data) with index templates and dashboards.

Index templates are predefined mappings for security data that selects the correct OpenSearch field types for corresponding Open Cybersecurity Schema Framework (OCSF) schema definition. In addition, index templates also contain index-specific settings for a particular index patterns. OCSF classifies security data into different categories such as system activity, findings, identity and access management, network activity, application activity and discovery.

Amazon Security Lake publishes events from four different AWS sources: AWS CloudTrail with subsets for AWS Lambda and Amazon Simple Storage Service (Amazon S3), Amazon Virtual Private Cloud(Amazon VPC) Flow Logs, Amazon Route 53, and AWS Security Hub. The following table details the event sources and their corresponding OCSF categories and OpenSearch index templates.

Amazon Security Lake Source	OCSF Category ID	OpenSearch Index Pattern
CloudTrail (Lambda and Amazon S3 API subsets)	3005	ocsf-3005*
VPC Flow Logs	4001	ocsf-4001*
Route 53	4003	ocsf-4003*
Security Hub	2001	ocsf-2001*

To easily identify OpenSearch indices containing Security Lake data, we recommend following a structured index naming pattern that includes the log category and its OCSF defined class in the name of the index. An example is provided below

ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd}

Complete the following steps to install the index templates and dashboards for your data:

Download the component_templates.zip and index_templates.zip files and unzip them on your local device.

Component templates are composable modules with settings, mappings, and aliases that can be shared and used by index templates.

Upload the component templates before the index templates. For example, the following Linux command line shows how to use the OpenSearch _component_template API to upload to your OpenSearch Service domain (change the domain URL and the credentials to appropriate values for your environment):
```
ls component_templates | awk -F'_body' '{print $1}' | xargs -I{} curl  -u adminuser:password -X PUT -H 'Content-Type: application/json' -d @component_templates/{}_body.json https://my-opensearch-domain.es.amazonaws.com/_component_template/{}
```

Once the component templates are successfully uploaded, proceed to upload the index templates:

ls index_templates | awk -F'_body' '{print $1}' | xargs -I{} curl  -uadminuser:password -X PUT -H 'Content-Type: application/json' -d @index_templates/{}_body.json https://my-opensearch-domain.es.amazonaws.com/_index_template/{}

Verify whether the index templates and component templates are uploaded successfully, by navigating to OpenSearch Dashboards, choose the hamburger menu, then choose Index Management.

In the navigation pane, choose Templates to see all the OCSF index templates.

Choose Component templates to verify the OCSF component templates.

After successfully uploading the templates, download the pre-built dashboards and other components required to visualize the Security Lake data in OpenSearch indices.
To upload these to OpenSearch Dashboards, choose the hamburger menu, and under Management, choose Stack Management.
In the navigation pane, choose Saved Objects.

Choose Import.

Choose Import, navigate to the downloaded file, then choose Import.

Confirm the dashboard objects are imported correctly, then choose Done.

All the necessary index and component templates, index patterns, visualizations, and dashboards are now successfully installed.

Configure OpenSearch Ingestion

Each OpenSearch Ingestion pipeline will have a single data source with one or more sub-pipelines, processors, and sink. In our solution, Security Lake managed Amazon S3 is the source and your OpenSearch Service cluster is the sink. Before setting up OpenSearch Ingestion, you need to create the following IAM roles and set up the required permissions:

Pipeline role – Defines permissions to read from Amazon Security Lake and write to the OpenSearch Service domain
Management role – Defines permission to allow the user to create, update, delete, validate the pipeline and perform other management operations

The following figure shows the permissions and roles you need and how they interact with the solution services.

Before you create an OpenSearch Ingestion pipeline, the principal or the user creating the pipeline must have permissions to perform management actions on a pipeline (create, update, list, and validate). Additionally, the principal must have permission to pass the pipeline role to OpenSearch Ingestion. If you are performing these operations as a non-administrator, add the following permissions to the user creating the pipelines:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Resource": "*",
			"Action": [
				"osis:CreatePipeline",
				"osis:ListPipelineBlueprints",
				"osis:ValidatePipeline",
				"osis:UpdatePipeline"
			]
		},
		{
			"_comment": "Replace {your-account-id} with your AWS account ID",
			"Resource": [
				"arn:aws:iam::{your-account-id}:role/pipeline-role"
			],
			"Effect": "Allow",
			"Action": [
				"iam:PassRole"
			]
		}
	]
}

Configure a read policy for the pipeline role

Security Lake subscribers only have access to the source data in the Region you selected when you created the subscriber. To give a subscriber access to data from multiple Regions, refer to Managing multiple Regions. To create a policy for read permissions, you need the name of the Amazon S3 bucket and the Amazon SQS queue created by Security Lake.

Complete the following steps to configure a read policy for the pipeline role:

On the Security Lake console, choose Regions in the navigation pane.
Choose the S3 location corresponding to the Region of the subscriber you created.

Make a note of this Amazon S3 bucket name.

Choose Subscribers in the navigation pane.
Choose the subscriber OpenSearchIngestion that you created earlier.

Take note of the Amazon SQS queue ARN under Subscription endpoint.

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
In the Specify permissions section, choose JSON to open the policy editor.

Remove the default policy and enter the following code (replace the S3 bucket and SQS queue ARN with the corresponding values):

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "ReadFromS3",
			"Effect": "Allow",
			"Action": "s3:GetObject",
			"Resource": "arn:aws:s3:::{bucket-name}/*"
		},
		{
			"Sid": "ReceiveAndDeleteSqsMessages",
			"Effect": "Allow",
			"Action": [
				"sqs:DeleteMessage",
				"sqs:ReceiveMessage"
			],
			"_comment": "Replace {your-account-id} with your AWS account ID",
			"Resource": "arn:aws:sqs:{region}:{your-account-id}:{sqs-queue-name}"
		}
	]
}

Choose Next.
For policy name, enter read-from-securitylake.
Choose Create policy.

You have successfully created the policy to read data from Security Lake and receive and delete messages from the Amazon SQS queue.

The complete process is shown below.

Configure a write policy for the pipeline role

We recommend using fine-grained access control (FGAC) with OpenSearch Service. When you use FGAC, you don’t have to use a domain access policy; you can skip the rest of this section and proceed to creating your pipeline role with the necessary permissions. If you use a domain access policy, you need to create a second policy (for this post, we call it write-to-opensearch) as an added step to the steps in the previous section. Use the following policy code:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "es:DescribeDomain",
			"Resource": "arn:aws:es:*:{your-account-id}:domain/*"
		},
		{
			"Effect": "Allow",
			"Action": "es:ESHttp*",
			"Resource": "arn:aws:es:*:{your-account-id}:domain/{domain-name}/*"
		}
	]
}

If the configured role has permissions to access Amazon S3 and Amazon SQS across accounts, OpenSearch Ingestion can ingest data across accounts.

Create the pipeline role with necessary permissions

Now that you have created the policies, you can create the pipeline role. Complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Use cases for other AWS services, select OpenSearch Ingestion pipelines.
Choose Next.
Search for and select the policy read-from-securitylake.
Search for and select the policy write-to-opensearch (if you’re using a domain access policy).
Choose Next.
For Role Name, enter pipeline-role.
Choose Create.

Keep note of the role name; you will be using it while configuring opensearch-pipeline.

Now you can map the pipeline role to an OpenSearch backend role if you’re using FGAC. You can map the ingestion role to one of predefined roles or create your own with necessary permissions. For example, all_access is a built-in role that grants administrative permission to all OpenSearch functions. When deploying to a production environment, make sure to use a role with just enough permissions to write to your Amazon OpenSearch Service domain.

Create the OpenSearch Ingestion pipeline

In this section, you use the pipeline role you created to create an OpenSearch Ingestion pipeline. Complete the following steps:

On the OpenSearch Service console, choose OpenSearch Ingestion in the navigation pane.
Choose Create pipeline.
For Pipeline name, enter a name, such as security-lake-osi.
In the Pipeline configuration section, choose Configuration blueprints and choose AWS-SecurityLakeS3ParquetOCSFPipeline.

Under source, update the following information:
1. Update the queue_url in the sqs section. (This is the SQS queue that Amazon Security Lake created when you created a subscriber. To get the URL, navigate to the Amazon SQS console and look for the queue ARN created with the format AmazonSecurityLake-abcde-Main-Queue.)
2. Enter the Region to use for aws credentials.

Under sink, update the following information:
1. Replace the hosts value in the OpenSearch section with the Amazon OpenSearch Service domain endpoint.
2. For sts_role_arn, enter the ARN of pipeline-role.
3. Set region as us-east-1.
4. For index, enter the index name that was defined in the template created in the previous section ("ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd}").
Choose Validate pipeline to verify the pipeline configuration.

If the configuration is valid, a successful validation message appears; you can now proceed to the next steps.

Under Network, select Public for this post. Our recommendation is to select VPC access for an inherent layer of security.
Choose Next.
Review the details and create the pipeline.

When the pipeline is active, you should see the security data ingested into your Amazon OpenSearch Service domain.

Visualize the security data

After OpenSearch Ingestion starts writing your data into your OpenSearch Service domain, you should be able to visualize the data using the pre-built dashboards you imported earlier. Navigate to dashboards and choose any one of the installed dashboards.

For example, choosing DNS Activity will give you dashboards of all DNS activity published in Amazon Security Lake.

This dashboard shows the top DNS queries by account and hostname. It also shows the number of queries per account. OpenSearch Dashboards are flexible; you can add, delete, or update any of these visualizations to suit your organization and business needs.

Clean up

To avoid unwanted charges, delete the OpenSearch Service domain and OpenSearch Ingestion pipeline, and disable Amazon Security Lake.

Conclusion

In this post, you successfully configured Amazon Security Lake to send security data from different sources to OpenSearch Service through serverless OpenSearch Ingestion. You installed pre-built templates and dashboards to quickly get insights from the security data. Refer to Amazon OpenSearch Ingestion to find additional sources from which you can ingest data. For additional use cases, refer to Use cases for Amazon OpenSearch Ingestion.

About the authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Jimish Shah is a Senior Product Manager at AWS with 15+ years of experience bringing products to market in log analytics, cybersecurity, and IP video streaming. He’s passionate about launching products that offer delightful customer experiences, and solve complex customer problems. In his free time, he enjoys exploring cafes, hiking, and taking long walks.

Use SAML Identities for programmatic access to Amazon OpenSearch Service

2023-05-09 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-saml-identities-for-programmatic-access-to-amazon-opensearch-service/

Customers of Amazon OpenSearch Service can already use Security Assertion Markup Language (SAML) to access OpenSearch Dashboards.

This post outlines two methods by which programmatic users can now access OpenSearch using SAML identities. This applies to all identity providers (IdPs) that support SAML 2.0, including prevalent ones like Active Directory Federation Service (ADFS), Okta, AWS IAM Identity Center (Successor to AWS Single Sign-On), KeyCloak, and others. Although we outline the methods as they pertain to OpenSearch Service and AWS Identity and Access Management (IAM), programmatic access to each of these individual providers is outside the scope of this post. Most of these providers do provide such a facility.

Single sign-on methods

When you use single sign-on (SSO), there are two different authentication methods:

Identity provider initiated – This is when a user or a user-agent first authenticates with an IdP and gets a SAML assertion that establishes the identity of the user. This assertion is then passed to a service provider (SP) that provides access to a protected resource.
Service provider initiated – Although the IdP-initiated exchange is straightforward, a more typical sign-on experience is when the protected resource is accessed directly. The SP then redirects the user to the IdP for authentication along with a SAML authentication request. The IdP responds with an authentication assertion inside a SAML response. After that, the SSO experience is the same as that of an IdP-initiated flow.

For programmatic access to OpenSearch Service, an external IdP is the IdP, and OpenSearch Service and IAM both serve as SPs. To configure your IdP of choice as the SAML IdP for IAM, refer to Creating IAM SAML identity providers. To configure OpenSearch Service, refer to SAML authentication for OpenSearch Dashboards.

In the following sections, we outline two methods to access OpenSearch Service API:

Using AWS Security Token Service (AWS STS)
Using the OpenSearch Dashboards’ console proxy

Method 1: Use AWS STS

The following figure shows the sequence of calls to access OpenSearch Service API using AWS STS.

Let’s explore each step in more detail.

Steps 1 and 2

Steps 1 and 2 vary depending upon your chosen IdP. In general, they typically provide an authentication API or session API or another similar API to authenticate and retrieve the SAML authentication assertion response. We use this SAML assertion in the next step.

Steps 3 and 4

Call the AssumeRoleWithSAML AWS STS API to exchange the SAML assertion for temporary credentials associated with your SAML identity. See the following code:

curl --location 'https://sts.amazonaws.com?
Version=2011-06-15&
Action=AssumeRoleWithSAML&
RoleArn=<ARN of the role being assumed>&
PrincipalArn=<ARN of the IdP integrated with IAM>&
SAMLAssertion=<Base-64 encoded SAML assertion>'

The response contains the temporary AWS STS credentials with AccessKeyId, SecretAccessKey, and a SessionToken.

Step 5

Use the temporary credentials from the last step to sign all API requests to OpenSearch Service. Also ensure the role that you assumed with the AssumeRoleWithSAML call has sufficient permission to access the requisite data in OpenSearch Service. Refer to Mapping roles to users for more information about mapping this role as a backend role. As an additional step to ensure consistency, this AWS STS role and any SAML group the user is part of can be mapped to the same role in OpenSearch Service. The following code shows a model to make this call:

curl --location ‘<OpenSearch Service domain URL>/_search' \
--header 'X-Amz-Security-Token: Fwo...==(truncated)' \
--header 'X-Amz-Date: 20230327T134710Z' \
--header 'Authorization: AWS4-HMAC-SHA256 Credential=ASI..(truncated)/20230327/us-east-1/es/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token, Signature=95eb…(truncated)'

Method 2: Use OpenSearch Dashboards’ console proxy

OpenSearch Dashboards has a component called a console proxy that can proxy requests to OpenSearch. This allows OpenSearch clients to make the same API calls in Domain Specific Language (DSL) to this console proxy instead of directly calling OpenSearch. The console proxy forwards these calls to OpenSearch and responds back to the clients in the same format as OpenSearch.

The following figure shows the sequence of calls you can make to the console proxy to gain programmatic access to OpenSearch Service.

Steps 1 and 2

The first two steps are similar to method 1, and they will vary depending on what IdP is chosen. Essentially, you need to obtain a SAML authentication assertion response from the IdP.

Steps 3 and 4

Use the SAML assertion from the previous steps and POST it to the Assertion Consumer Service (ACS) URL, _opendistro/_security/saml/acs/idpinitiated, to exchange the assertion for the security_authentication token. The following code shows the command line for these steps:

curl --location ‘<dashboards URL>/_opendistro/_security/saml/acs/idpinitiated' \
--header 'content-type: application/x-www-form-urlencoded' \
--data-urlencode ‘SAMLResponse=Base-64 encoded SAML assertion' \
--data-urlencode 'RelayState=’

If you’re using the OpenSearch engine, the dashboard URL is <domain URL>/_dashboards. If you’re using the Elasticsearch engine, the dashboard URL is <domain URL>/_plugin/kibana. OpenSearch Dashboards processes this and responds with a redirect response with code 302 and an empty body. The response headers now also contain a cookie named security_authentication, which is the token you must use in all subsequent calls.

Steps 5–8

Use the security_authentication cookie in the API calls to the console proxy to perform programmatic API calls. The following code shows a command line for these steps:

curl --location ‘<dashboardsURL>/api/console/proxy?path=_search&method=GET' \
--header 'content-type: application/json' \
--header 'cookie: security_authentication=Fe26.2**1...(truncated)' \
--header 'osd-xsrf: true' \
--data '{
  "query": {
    "match_all": {}
  }
}’

Make sure to include a header called osd-xsrf : true for programmatic access to dashboards. The console proxy path is /api/console/proxy for Elasticsearch engines version 6.x and 7.x and OpenSearch engine version 1.x and 2.x.

Similar to method 1, make sure to map roles and groups associated with a particular SAML identity as the correct backend role with requisite permissions.

Comparing these methods

You can use method 1 in any domain regardless of the engine as long as fine-grained access control is enabled. Method 2 only works for domains with Elasticsearch engine versions greater than 6.7 and all OpenSearch engine versions.

The OpenSearch Dashboards process is generally meant for human interactions, which has a lower API call rate and volume than those of programmatic calls. OpenSearch can handle considerably higher API call rates and volume, so take care not to send high-volume API calls using method 2. As a best practice for programmatic access with SAML identities, we recommend method 1 wherever possible to avoid performance bottlenecks.

Conclusion

Both of the methods outlined in this post provide a similar flow to access OpenSearch Service programmatically using SAML identities (exchanging a SAML assertion for an authentication token). AssumeRoleWithSAML is a key and fairly straightforward-to-use API that enables this access and is our recommended method. Try one of OpenSearch Service labs and launch an OpenSearch Service domain to experiment with these methods. Good luck!

About the author

Top strategies for high volume tracing with Amazon OpenSearch Ingestion

2023-04-27 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/top-strategies-for-high-volume-tracing-with-amazon-opensearch-ingestion/

Amazon OpenSearch Ingestion is a serverless, auto-scaled, managed data collector that receives, transforms, and delivers data to Amazon OpenSearch Service domains or Amazon OpenSearch Serverless collections. OpenSearch Ingestion is powered by Data Prepper, an open-source, streaming ETL (extract, transform, and load) solution that’s part of the OpenSearch project. When you use OpenSearch Ingestion, you don’t need to maintain self-managed data pipelines to ingest logs, traces, metrics, and other data with OpenSearch Service. Amazon OpenSearch Ingestion responds to changing volumes of data, automatically scaling your ingest pipeline.

Distributed tracing is the leading way to locate, alert on, and remediate problems with your application and infrastructure. Distributed tracing is part of a broader observability solution, often combined with metrics and log data. OpenSearch Service gives you a native toolset to store and analyze large volumes of log, metric, and trace data. However, moving these large volumes of data is non-trivial to set up, monitor, and maintain.

In this post, we outline steps to set up a trace pipeline and strategies to deal with high volume tracing with Amazon OpenSearch Ingestion.

Solution overview

There is now a new option on the OpenSearch Service console called Pipelines under Ingestion in the navigation pane. We use this feature to create a trace pipeline.

You can also use the AWS Command Line Interface (AWS CLI), AWS CloudFormation, or AWS APIs to create a trace pipeline.

Prerequisites

Refer to Security in OpenSearch Ingestion to set up the permissions you need to create a pipeline and write to a pipeline, and the permissions the pipeline needs to write to a sink.

Create a trace pipeline

To create a trace pipeline, complete the following steps:

On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
Choose Create pipeline.

Amazon OpenSearch Ingestion, powered by Data Prepper, uses pipelines as a mechanism to move the data from a source to a sink, with optional processors to mutate, route, sample, and detect anomalies for the data in the pipe. For more information, refer to Data Prepper. When you use Data Prepper, you build a YAML configuration file. When you use OpenSearch Ingestion, you upload your YAML configuration to the service. If you’re using the OpenSearch Service console, you can use one of the configuration blueprints that we provide. For distributed tracing, you will use an otel_trace_source and an OpenSearch Service domain as the sink.

On the Configuration blueprints menu, choose AWS-TraceAnalyticsPipeline.

Choosing this blueprint will create a sample pipeline with otel_trace_source, an OpenSearch sink, along with span-pipeline and service-map-pipeline.

Enter a name for this pipeline along with a minimum (1) and maximum (96) capacity value for Ingestion-OCUs.

Amazon OpenSearch Ingestion will scale automatically between these values to suit the volume of data you are ingesting.

Edit the configuration’s hosts, aws.sts_role_arn, and region fields of the OpenSearch Service sink.
Follow rest of the steps to complete the trace pipeline creation.

Sample trace pipeline

The following code shows the components of a sample trace pipeline:

version: "2"
entry-pipeline: 
  source:
    otel_trace_source:
      path: "/${pipelineName}/v1/traces"
  processor:
    - trace_peer_forwarder:
  sink:
    - pipeline:
        name: "span-pipeline"
    - pipeline:
        name: "service-map-pipeline"
span-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - otel_trace_raw:
  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          region: "us-east-1"
        index_type: "trace-analytics-raw"
service-map-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - service_map_stateful:
  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          region: "us-east-1"
        index_type: "trace-analytics-service-map"

The sample trace pipeline has three sub-pipelines in its configuration. These are entry-pipeline, span-pipeline, and service-map-pipeline. The following diagram illustrates the workflow.

entry-pipeline specifies the source of data as otel_trace_source, which creates an HTTP listener for receiving OpenTelemetry traces at the ingestion URL for the pipeline. You use a trace_peer_forwarder processor to eliminate duplicate HTTP requests and forward the data to the span-pipeline and service-map pipelines. span-pipeline gets the raw trace data from entry-pipeline and uses the otel_trace_raw processor to complete trace group-related fields for the incoming span records. You use the service_map_stateful processor to have Data Prepper create the distributed service map for visualization in OpenSearch Dashboards. After the sample trace pipeline is created, it’s ready to receive OpenTelemetry traces at its ingestion URL!

Reduce your storage footprint and optimize for cost

The volume of traces collected from instrumenting a modern production enterprise application can reach tens or hundreds of terabytes very quickly, especially when you store every trace from every request. The problem of managing the storage footprint becomes important. In this section, we discuss strategies for reducing your storage footprint and optimizing for cost.

Use storage tiering

OpenSearch Service has three storage tiers: hot, UltraWarm, and cold. You use the hot tier to store frequently accessed data for quick reading and writing, the UltraWarm tier for infrequently used, read-only data backed by Amazon Simple Storage Service (Amazon S3) for lower cost, and the cold tier to maintain re-attachable data at near-Amazon S3 cost. By adjusting relative retention periods between these tiers, you can store a high volume of traces. For example, instead of storing 1 weeks’ worth of traces in the hot tier, you can store 2 days of traces in the hot tier and 15 days in the UltraWarm tier.

Extract metrics without storing traces

You can also use Data Prepper’s aggregation process to extract metrics in the pipeline to avoid delivering all of your data to OpenSearch Service. For example, you may want to analyze request, error, and duration (RED) metrics of your traces to know the current state of your services. OpenSearch Ingestion can calculate these metrics in the pipeline, aggregating them and storing them in separate indexes for analysis, reducing the ingestion and storage footprint of your traces. The following pipeline configuration snippet shows how to use the aggregate processor to calculate a histogram of the duration metric:

...
  processor:
    - aggregate:
        identification_keys: ["serviceName", "traceId"]
        action:
          histogram:
            key: "durationInNanos"
            record_minmax: true
            units: "nanoseconds"
            buckets: [1000000000, 1500000000, 2000000000]
        group_duration: "20s"
   sink:
    - opensearch:
        hosts: ...
        aws_sts_role_arn: ...
        aws_region: ...
        aws_sigv4: true
        index: "red_metrics_from_traces"
  ...

Use sampling

When your application is running without issues, the proportion of error traces is just a small percentage of your overall trace volume. Storing all of the traces for successful requests increases the cost substantially, while offering low value. To reduce cost, you can sample your trace data, reducing the number of traces you store in OpenSearch Service. There are generally two techniques for sampling:

Head sampling – When you do head sampling, you ask OpenSearch Ingestion to make a sampling decision without looking at the whole trace. Head sampling is easy to configure and is efficient, but has a downside of possibly missing important traces.
Tail sampling – Tail sampling is where you analyze the entirety of the trace and then decide whether to sample the trace or not. This accurately captures all the needed traces at the cost of complexity in configuring and implementing.

The following configuration snippet shows an example of the percent_sampler, from the aggregate processor. In this example, you send only 25% of your traces to OpenSearch Service, based on head sampling:

  ...
  processor:
    - aggregate:
        identification_keys: ["serviceName"]
        action:
          percent_sampler:
            percent: 25
        group_duration: "30s"
  sink:
    - opensearch:
        hosts: ...
        aws_sts_role_arn: ...
        aws_region: ...
        aws_sigv4: true
        index: "sampled-traces"
  ...

Use conditional routing with sampling

Head sampling using the percentage_sampler is simple and straightforward, but is a blunt tool. A better way to sample would be to gather, for example, 10% of successful responses, and 100% of failed responses or 100% high duration traces. To solve this, use conditional routing. Routes define conditions that can be used within processors and sinks to direct the data flowing through different parts of pipeline. For example, the following configuration snippet routes traces whose status code indicates a failure to the error_trace pipeline. You forward 100% of the data in that pipe. You route traces whose duration metric is more than 1 second to the high_latency pipeline where you sample them at 80%. Other normal traces are only sampled at 20%.

  processor:
    - otel_trace_raw:
  route:
    - error_traces: "/traceGroupFields/statusCode == 2"
    - high_latency_traces: '/durationInNanos >= 1000000000'
    - normal_traces: '/traceGroupFields/statusCode!= 2 and /durationInNanos < 1000000000'
  sink:
    - pipeline:
        name: "trace-error-pipeline"
        routes:
          - error_traces
    - pipeline: 
        name: "trace-high-latency-metrics-pipeline"
        routes: 
          - high_latency_traces
    - pipeline: 
        name: "trace-normal-pipeline"
        routes: 
          - normal_traces
  ...

Conclusion

In this post, you learned how to configure an OpenSearch Ingestion pipeline and several strategies to keep in mind that help minimize cost while supporting a large-scale production system for distributed tracing. As next step, refer to the Amazon OpenSearch Developer Guide to explore logs and metric pipelines that you can use to build a scalable observability solution for your enterprise applications.