Tag Archives: Technical How-to

Enhance security and performance with TLS 1.3 and Perfect Forward Secrecy on Amazon OpenSearch Service

Post Syndicated from Shubham Kumar original https://aws.amazon.com/blogs/big-data/enhance-security-and-performance-with-tls-1-3-and-perfect-forward-secrecy-on-amazon-opensearch-service/

Amazon OpenSearch Service recently introduced a new Transport Layer Security (TLS) policy Policy-Min-TLS-1-2-PFS-2023-10, which supports the latest TLS 1.3 protocol and TLS 1.2 with Perfect Forward Secrecy (PFS) cipher suites. This new policy improves security and enhances OpenSearch performance.

OpenSearch Service previously offered predefined TLS policies for domain endpoint security, making it possible to encrypt your traffic end-to-end by enforcing HTTPS. However, these policies were limited to older versions of TLS, such as TLS 1.0 and TLS 1.2, without any PFS offerings.

In this post, we discuss the benefits of this new policy and how to enable it using the AWS Command Line Interface (AWS CLI).

Solution overview

The new TLS security policy provides an upgraded security posture for OpenSearch Service domains by implementing TLS 1.3 and PFS. This makes it possible to enhance the confidentiality and integrity of traffic between clients and your OpenSearch Service domains, providing a more secure and efficient communication channel for your sensitive data. TLS 1.3 is the latest version of the Transport Layer Security protocol, designed to prevent certain attacks targeting legacy TLS ciphers and provide improvements like 0-RTT resumption for faster connection times. TLS 1.3 can establish secure connections faster than TLS 1.2, resulting in reduced latency for your applications. PFS is an important security enhancement that makes sure past communications remain secure, even if the server’s long-term secret key is compromised in the future. By using a unique, randomly generated session key for each connection, PFS adds an extra layer of protection against potential eavesdropping or decryption of encrypted data. Compared to the older TLS 1.2 policy Policy-Min-TLS-1-2-2019-07, TLS 1.2 with PFS offers stronger security by protecting against potential key compromises, while still maintaining compatibility with older clients that don’t support TLS 1.3.

Prerequisites

To start using this new policy, you need the following prerequisites:

Enable the new TLS policy on OpenSearch Service

To create new domains with the new TLS policy enabled, add --domain-endpoint-options '{"TLSSecurityPolicy": "Policy-Min-TLS-1-2-PFS-2023-10"}' to the create-domain AWS CLI command:

aws opensearch create-domain \
--domain-name my-domain \
--domain-endpoint-options '{"TLSSecurityPolicy": "Policy-Min-TLS-1-2-PFS-2023-10"}' <other config options>

For existing domains, you can update the domain configuration to use the new TLS policy by running the update-domain-config AWS CLI command:

aws opensearch update-domain-config \
--domain-name my-domain \
--domain-endpoint-options '{"TLSSecurityPolicy": "Policy-Min-TLS-1-2-PFS-2023-10"}'

Client-side considerations

Most modern clients and libraries should support TLS 1.3 and TLS 1.2 with PFS out of the box. However, if you encounter issues or compatibility concerns, you might need to update your client libraries or configurations to enable support for the new TLS policy.

Conclusion

The new Policy-Min-TLS-1-2-PFS-2023-10 security policy for OpenSearch Service offers significant improvements in security and performance. By supporting TLS 1.3 and TLS 1.2 with PFS, this policy helps protect your data in transit and provides faster connection times. We recommend that you start using this new TLS security policy for improved security posture and performance when connecting to your OpenSearch Service domains. To get started, follow the steps outlined in this post to enable the new policy on your existing or new domains.

For more information on the available TLS options and how to configure them, refer to Infrastructure security in Amazon OpenSearch Service.

At Amazon, security is our top priority, and we are continuously working to enhance the security and performance of our services. Stay tuned for more exciting updates!


About the authors

Shubham Kumar is a Software Development Engineer at Amazon OpenSearch Service, specializing in the security domain. He is passionate about developing robust security features to enhance the protection of customer data and infrastructure.

Sachet Alva is a Software Development Manager at Amazon OpenSearch Service, overseeing the infrastructure security and custom package initiatives. His team’s innovations contribute to the enhanced security and flexibility of Amazon OpenSearch Service deployments.

Naveen Negi is a Senior Tech Product Manager for Amazon OpenSearch Service. He works closely with engineering teams and customers to shape the future of OpenSearch Service, making sure it meets evolving security and performance needs.

How Nexthink built real-time alerts with Amazon Managed Service for Apache Flink

Post Syndicated from Nikos Tragaras, Raphaël Afanyan original https://aws.amazon.com/blogs/big-data/how-nexthink-built-real-time-alerts-with-amazon-managed-service-for-apache-flink/

This post is cowritten with Nikos Tragaras and Raphaël Afanyan from Nexthink.

In this post, we describe Nexthink’s journey as they implemented a new real-time alerting system using Amazon Managed Service for Apache Flink. We explore the architecture, the rationale behind key technology choices, and the Amazon Web Services (AWS) services that enabled a scalable and efficient solution.

Nexthink is a pioneering leader in digital employee experience (DEX). With a mission to empower IT teams and elevate workplace productivity, Nexthink’s Infinity platform offers real-time visibility into end user environments, actionable insights, and robust automation capabilities. By combining real-time analytics, proactive monitoring, and intelligent automation, Infinity enables organizations to deliver an optimal digital workspace.

In the past 5 years, Nexthink completed its transformation into a fully-fledged cloud platform that processes trillions of events per day, reaching over 5 GB per second of aggregated throughput. Internally, Infinity comprises more than 300 microservices that use the power of Apache Kafka through Amazon Managed Service for Apache Kafka (Amazon MSK) for data ingestion and intra-service communication. The Nexthink ecosystem includes several hundreds of Micronaut-based Java microservices deployed in Amazon Elastic Kubernetes Service (Amazon EKS). The vast majority of microservices interact with Kafka through the Kafka Streams framework.

Nexthink alerting system

To help you understand Nexthink’s journey toward a new real-time alerting solution, we begin by examining the existing system and the evolving requirements that led them to seek a new solution.

Nexthink’s existing alerting system provides near real-time notifications, helping users detect and respond to critical events quickly. While effective, this system has limitations in scalability, flexibility, and real-time processing capabilities.

Nexthink gathers telemetry data from thousands of customers’ laptops covering CPU usage, memory, software versions, network performance, and more. Amazon MSK and ClickHouse serve as the backbone for this data pipeline. All endpoint data is ingested in Kafka multi-tenant topics, which are processed and finally stored in a ClickHouse database.

Using the current alerting system, clients can define monitoring rules in Nexthink Query Language (NQL), which are evaluated in near real time by polling the database every 15 minutes. Alerts are triggered when anomalies are detected against client-defined thresholds or long-term baselines. This process is illustrated in the following architecture diagram.

Originally, database-polling allowed great flexibility in the evaluation of complex alerts. However, this approach placed heavy stress on the database. As the company grew and supported larger customers with more endpoints and monitors, the database experienced increasingly heavy loads.

Evolution to a new use-case: Real-time alerts

As Nexthink expanded its data collection to include virtual desktop infrastructure (VDI), the need for real-time alerting became even more critical. Unlike traditional endpoints, such as laptops, where events are gathered every 5 minutes, VDI data is ingested every 30 seconds—significantly increasing the volume and frequency of data. The existing architecture relied on database polling to evaluate alerts, running at a 15-minute interval. This approach was inadequate for the new VDI use case, where alerts needed to be evaluated in near real time on messages arriving every 30 seconds. Merely increasing the polling frequency wasn’t a viable option because it would place excessive load on the database, leading to performance bottlenecks and scalability challenges. To meet these new demands efficiently, we shifted to real-time alert evaluation directly on Kafka topics.

Technology options

As we evaluated solutions for our real-time alerting system, we analyzed two main technology options: Apache Kafka Streams and Apache Flink. Each option had benefits and limitations that needed to be considered.

All Nexthink microservices up to that point integrated with Kafka using Apache Kafka Streams. We’ve observed in practice multiple benefits:

  • Lightweight and seamless integration. No need for additional infrastructure.
  • Low latency using RocksDB as a local key-value store.
  • Team expertise. Nexthink teams have been writing microservices with Kafka-streams for a long time and feel very comfortable using it.

In some use cases however, we found that there were important limitations:

  • Scalability – Scalability was constrained by the tight coupling between parallelism of microservices and the number of partitions in Kafka topics. Many microservices had already scaled out to match the partition count of the topics they consumed, limiting their ability to scale further. One potential solution was increasing the partition count. However, this approach introduced significant operational overhead, especially with microservices consuming topics owned by other domains. It required rebalancing the entire Kafka cluster and needed coordination across multiple teams. Additionally, such modifications impacted downstream services, requiring careful reconfiguration of stateful processing. The alternative approach would be to introduce intermediate topics to redistribute workload, but this would add complexity to the data pipeline and increase resource consumption on Kafka. These challenges made it clear that a more flexible and scalable approach was needed.
  • State management – Services that needed to create large K-tables in memory had an increased startup time. Also, in cases where the internal state was large in volume, we found that it applied significant load to the Kafka cluster during the creation of the internal state.
  • Late event processing – In windowing operations, late events had to be managed manually with techniques that complexified the codebase.

Seeking an alternative that could help us overcome the challenges posed by our current system, we decided to evaluate Flink. Its robust streaming capabilities, scalability, and flexibility made it an excellent choice for building real-time alerting systems based on Kafka topics. Several advantages made Flink particularly appealing:

  • Native integration with Kafka – Flink offers native connectors for Kafka, which is a central component in the Nexthink ecosystem.
  • Event-time processing and support for late events – Flink allows messages to be processed based on the event time (that is, when the event actually occurred) even if they arrive out of order. This feature is crucial for real-time alerts because it guarantees their accuracy.
  • Scalability – Flink’s distributed architecture allows it to scale horizontally independently from the number of partitions in the Kafka topics. This feature weighed a lot in our decision-making because the dependence on the number of partitions was a strong limitation in our platform up to this point.
  • Fault tolerance – Flink supports checkpoints, allowing managed state to be persisted and ensuring consistent recovery in case of failures. Unlike Kafka Streams, which relies on Kafka itself for long-term state persistence (adding extra load to the cluster), Flink’s checkpointing mechanism operates independently and runs out-of-band, minimizing the impact on Kafka while providing efficient state management.
  • Amazon Managed Service for Apache Flink – Amazon Managed Service for Apache Flink is a fully managed service that simplifies the deployment, scaling, and management of Flink applications for real-time data processing. By eliminating the operational complexities of managing Flink clusters, AWS enables organizations to focus on building and running real-time analytics and event-driven applications efficiently. Amazon Managed Service for Apache Flink provided us with significant flexibility. It streamlined our evaluation process, which meant we could quickly set up a proof-of-concept environment without getting into the complexities of managing an internal Flink cluster. Moreover, by reducing the overhead of cluster management, it made Flink a viable technology choice and accelerated our delivery timeline.

Solution

After careful evaluation of both options, we chose Apache Flink as our solution due to its superior scalability, robust event-time processing, and efficient state management capabilities. Here’s how we implemented our new real-time alerting system.

The following diagram is the solution architecture.

The first use case was to detect issues with VDI. However, our intention was to build a generic solution that would give us the option to onboard in the future existing use cases currently implemented through polling. We wanted to maintain a common way of configuring monitoring conditions and allow alert evaluation both with polling as well as in real time, depending on the type of device being monitored.

This solution comprises multiple parts:

  • Monitor configuration – Using Nexthink Query Language (NQL), the alerts administrator defines a monitor that specifies, for example:
    • Data source – VDI events
    • Time window – Every 30 seconds
    • Metric – Average network latency, grouped by desktop pool
    • Trigger condition(s) – Latency exceeding 300 ms for a continual period of 5 minutes

This monitor configuration is then stored in an internally developed document store and propagated downstream in a Kafka topic.

  • Data processing using Generic Stream Services– The Nexthink Collector, an agent installed on endpoints, captures and reports various kinds of activities from the VDI endpoints where it’s installed. These events are forwarded to Amazon MSK in one of Nexthink’s production virtual private clouds (VPCs) and are consumed by Java microservices running on Amazon EKS belonging to several domains within Nexthink

One of them is Generic Stream Services, a system that processes the collected events and aggregates them in buckets of 30 seconds. This component works as self-service for all the feature teams in Nexthink and can query and aggregate data from an NQL query. This way, we were able to keep a unified user experience on monitor configuration using NQL, regardless of how alerts were evaluated. This component is broken down into two services:

    • GS processor – Consumes raw VDI session events and applies initial processing
    • GS aggregator – Groups and aggregates the data according to the monitor configuration
  • Real-time monitoring using Flink – Static threshold alerting and seasonal change detection, which identifies variations in data that follow a recurring pattern over time, are the two types of detection that we offer for VDI issues. The system splits the processing between two applications:
    • Baseline application – Calculates statistical baselines with seasonality using time-of-day anomaly algorithm. For example, the latency by VDI client location or the CPU queue length of a desktop pool.
    • Alert application – Generates alerts based on user-defined thresholds when the unexpected values don’t change over time or dynamic thresholds based on baselines, which trigger when a metric deviates from an expected pattern.

The following diagram illustrates how we join VDI metrics with monitor configurations, aggregate data using sliding time windows, and evaluate threshold rules, all within Apache Flink. From this process, alerts are generated and are then grouped and filtered before being processed further by the consumers of alerts.

  • Alert processing and notifications – After an alert is triggered (when a threshold is exceeded) or recovered (when a metric returns to normal levels), the system will assess their impact to prioritize response through the impact processing module. Alerts are then consumed by notification services that deliver messages through emails or webhooks. The alert and impact data are then ingested into a time series database.

Benefits of the new architecture

One of the key advantages of adopting a streaming-based approach over polling was its ease of configuration and management, especially for a small team of three engineers. There was no need for cluster management, so all we needed to do was to provision the service and start coding.

Given our prior experience with Kafka and Kafka Streams and combined with the simplicity of a managed service, we were able to quickly develop and deploy a new alerting system without the overhead of complex infrastructure setup. We used Amazon Managed Service for Apache Flink to spin up a proof of concept within a few hours, which meant the team could focus on defining the business logic without having concerns related to cluster management.

Initially, we were concerned about the challenges of joining multiple Kafka topics. With our previous Kafka Streams implementation, joined topics required identical partition keys, a constraint known as co-partitioning. This created an inflexible architecture, particularly when integrating topics across different business domains. Each domain naturally had its own optimal partitioning strategy, forcing difficult compromises.

Amazon Managed Service for Apache Flink solved this problem through its internal data partitioning capabilities. Although Flink still incurs some network traffic when redistributing data across the cluster during joins, the overhead is practically negligible. The resulting architecture is both more scalable (because topics can be scaled independently based on their specific throughput requirements) and easier to maintain without complex partition alignment concerns.

This significantly improved our ability to detect and respond to VDI performance degradations in real time while keeping our architecture clean and efficient.

Lessons learnt

As with any new technology, adopting Flink for real-time processing came with its own set of challenges and insights.

One of the primary difficulties we encountered was observing Flink’s internal state. Unlike Kafka Streams, where the internal state is by default backed by a Kafka topic from which its content can be visualized, Flink’s architecture makes it inherently difficult to inspect what is happening inside a running job. This required us to invest in robust logging and monitoring strategies to better understand what is happening during the execution and debug issues effectively.

Another critical insight emerged around late event handling—specifically, managing events with timestamps that fall within a time-window’s boundaries but arrive after that window has closed. Amazon Managed Service for Apache Flink addresses this challenge through its built-in watermarking mechanism. A watermark is a timestamp-based threshold that indicates when Flink should consider all events before a specific time to have arrived. This allows the system to make informed decisions about when to process time-based operations like window aggregations. Watermarks flow through the streaming pipeline, enabling Flink to track the progress of event time processing even with out-of-order events.

Although watermarks provide a mechanism to manage late data, they introduce challenges when dealing with multiple input streams operating at different speeds. Watermarks work well when processing events from a single source but can become problematic when joining streams with varying velocities. This is because they can lead to unintended delays or premature data discards. For example, a slow stream can hold back processing across the entire pipeline, and an idle stream might cause premature window closing. Our implementation required careful tuning of watermark strategies and allowable lateness parameters to balance processing timeliness with data completeness.

Our transition from Kafka Streams to Apache Flink proved smoother than initially anticipated. Teams with Java backgrounds and prior experience with Kafka Streams found Flink’s programming model intuitive and easy to use. The DataStream API offers familiar concepts and patterns, and Flink’s more advanced features could be adopted incrementally as needed. This gradual learning curve gave our developers the flexibility to become productive quickly, focusing first on core stream processing tasks before moving on to more advanced concepts like state management and late event processing.

The future of Flink in Nexthink

Real-time alerting is now deployed to production and available to our clients. A major success of this project was the fact that we successfully introduced a technology as an alternative to Kafka streams, with very little management requirements, guaranteed scalability, data-management flexibility, and comparable cost.

The impact on the Nexthink alerting system was significant because we no longer have a single evaluating alert through database polling. Therefore, we’re already assessing the timeframe for onboarding other alerting use cases to real-time evaluation with Flink. This will alleviate database load and will also provide more accuracy on the alert triggering.

Yet the impact of Flink isn’t limited to the Nexthink alerting system. We now have a proven production-ready alternative for services that are limited in terms of scalability due to the number of partitions of the topics they are consuming. Thus, we’re actively evaluating the option to convert more services to Flink to allow them to scale out more flexibly.

Conclusion

Amazon Managed Service for Apache Flink has been transformative for our real-time alerting system at Nexthink. By handling the complex infrastructure management, AWS enabled our team to deploy a sophisticated streaming solution in less than a month, keeping our focus on delivering business value rather than managing Flink clusters.

The capabilities of Flink have proven it to be more than an alternative to Kafka Streams. It’s become a compelling first choice for both new projects and existing feature refactoring. Windowed processing, late event management, and stateful streaming operations have made complex use cases remarkably straightforward to implement. As our development teams continue to explore Flink’s potential, we’re increasingly confident that it will play a central role in Nexthink’s real-time data processing architecture moving forward.

To get started with Amazon Managed Service for Apache Flink, explore the getting started resources and the hands-on workshop. To learn more about Nexthink’s broader journey with AWS, visit the blog post on Nexthink’s MSK-based architecture.


About the authors

Nikos Tragaras is a Principal Software Architect at Nexthink with around two decades of experience in building distributed systems, from traditional architectures to modern cloud-native platforms. He has worked extensively with streaming technologies, focusing on reliability and performance at scale. Passionate about programming, he enjoys building clean solutions to complex engineering problems

Raphaël Afanyan is a Software Engineer and Tech Lead of the Alerts team at Nexthink. Over the years, he has worked on designing and scaling data processing systems and played a key role in building Nexthink’s alerting platform. He now collaborates across teams to bring innovative product ideas to life, from backend architecture to polished user interfaces.

Simone Pomata is a Senior Solutions Architect at AWS. He has worked enthusiastically in the tech industry for more than 10 years. At AWS, he helps customers succeed in building new technologies every day.

Subham Rakshit is a Senior Streaming Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build streaming architectures so they can get value from analyzing their streaming data. His two little daughters keep him occupied most of the time outside work, and he loves solving jigsaw puzzles with them. Connect with him on LinkedIn.

Lorenzo Nicora works as a Senior Streaming Solutions Architect at AWS, helping customers across EMEA. He has been building cloud-centered, data-intensive systems for over 25 years, working across industries both through consultancies and product companies. He has used open source technologies extensively and contributed to several projects, including Apache Flink.

Simplify real-time analytics with zero-ETL from Amazon DynamoDB to Amazon SageMaker Lakehouse

Post Syndicated from Narayani Ambashta original https://aws.amazon.com/blogs/big-data/simplify-real-time-analytics-with-zero-etl-from-amazon-dynamodb-to-amazon-sagemaker-lakehouse/

At AWS re:Invent 2024, we introduced a no code zero-ETL integration between Amazon DynamoDB and Amazon SageMaker Lakehouse, simplifying how organizations handle data analytics and AI workflows. This integration alleviates the traditional challenges of building and maintaining complex extract, transform, and load (ETL) pipelines for transforming NoSQL data into analytics-ready formats, which previously required significant time and resources while introducing potential system vulnerabilities. Organizations can now seamlessly combine the strength of DynamoDB in handling rapid, concurrent transactions with immediate analytical processing through the zero-ETL integration. For example, an ecommerce platform storing user session data and cart information in DynamoDB can now analyze this data in near real time without building custom pipelines. Gaming companies using DynamoDB for player data can instantly analyze user behavior as events occur, enabling real-time insights into game balance and player engagement patterns.

The zero-ETL capability uses built-in change data capture (CDC) to automatically synchronize data updates and schema changes between DynamoDB and SageMaker Lakehouse tables. By using Apache Iceberg format, the integration provides reliable performance with ACID transaction support and efficient large-scale data handling. Data scientists can train ML models on fresh data and data analysts can generate reports using current information, with typical synchronization latency in minutes rather than hours.

In this post, we share how to set up this zero-ETL integration from DynamoDB to your SageMaker Lakehouse environment.

Solution overview

We use a SageMaker Lakehouse catalog, AWS Lake Formation, Amazon Athena, AWS Glue, and Amazon SageMaker Unified Studio for this integration. The following is the reference data flow diagram for the zero-ETL integration.

ref architecture

The workflow consists of the following components:

  1. The recently launched zero-ETL integration capability within the AWS Glue console enables direct integration between DynamoDB and SageMaker Lakehouse, storing data in Iceberg format. This streamlined approach opens up new possibilities for data teams by creating a large-scale open and secure data ecosystem without traditional ETL processing overhead.
  2. When building a SageMaker Lakehouse architecture, you can use an Amazon Simple Storage Service (Amazon S3) based managed catalog as your zero-ETL target, providing seamless data integration without transformation overhead. This approach creates a robust foundation for your SageMaker Lakehouse implementation while maintaining the cost-effectiveness and scalability inherent to Amazon S3 storage, enabling efficient analytics and machine learning workflows.
  3. Organizations can use a Redshift Managed Storage (RMS) based managed catalog when they need high-performance SQL analytics and multi-table transactions. This approach uses RMS for storage while maintaining data in the Iceberg format, providing an optimal balance of performance and flexibility.
  4. After you establish your Lakehouse infrastructure, you can access it through diverse analytics engines, including AWS services like Athena, Amazon Redshift, AWS Glue, and Amazon EMR as independent services. For a more streamlined experience, SageMaker Unified Studio offers centralized analytics management, where you can query your data from a single unified interface.

Prerequisites

In this section, we walk through the steps to set up your solution resources and confirm your permission settings.

Create a SageMaker Unified Studio domain, project, and IAM role

Before you begin, you need an AWS Identity and Access Management (IAM) role for enabling the zero-ETL integration. In this post, we use SageMaker Unified Studio, which offers a unified data platform experience. It automatically manages required Lake Formation permissions on data and catalogs for you.

You have to first create a SageMaker Unified Studio domain, an administrative entity that controls user access, permissions, and resources for teams working within the SageMaker Unified Studio environment. Note down the SageMaker Unified Studio URL after you create the domain. You will be using this URL later to log in to the SageMaker Unified Studio portal and query our data across multiple engines.

Then, you create a SageMaker Unified Studio project, an integrated development environment (IDE) that provides a unified experience for data processing, analytics, and AI development. As part of project creation, an IAM role is automatically generated. This role will be used when you access SageMaker Unified Studio later. For more details on how to create a SageMaker Unified Studio project and domain, refer to An integrated experience for all your data and AI with Amazon SageMaker Unified Studio.

Prepare a sample dataset within DynamoDB

To implement this solution, you need a DynamoDB table that can either be used from your existing resources, or created using the sample data file that you can import from an S3 bucket. For this post, we guide you through importing sample data from an S3 bucket into a new DynamoDB table, providing a practical foundation for the concepts discussed.

To create a sample table in DynamoDB, complete the following steps:

  1. Download the fictitious ecommerce_customer_behavior.csv dataset. This dataset captures customer behavior and interactions on an ecommerce platform.
  2. On the Amazon S3 console, open the S3 bucket used by the SageMaker Unified Studio project.
  3. Upload the CSV file you downloaded.

BDB-4928-image-2.png

  1. Select the uploaded file to view its details page.

  1. Copy the value for S3 URI and make a note of it; you will use this path for the subsequent DynamoDB table creation step.

Create a Dynamo DB table

Complete the following steps to create a DynamoDB table from a file from Amazon S3, using the import from Amazon S3 functionality. Then you can enable the settings on the DynamoDB table required to enable zero-ETL integration.

  1. On the DynamoDB console, select Imports from S3 in the navigation pane.
  2. Select Import from S3.

  1. Enter the S3 URI from previous step for Source S3 URL, select CSV for Import file format, and select Next.

  1. Provide the table name as ecommerce_customer_behavior, the partition key as customer_id, and the sort key as product_id, then select Next.

  1. Use the default table settings, then select Next to review the details.

  1. Review the settings and select Import.

It will take a few minutes for the import status to change from Importing to Completed.

When the import is complete, you should be able to see the table created on the Tables page.

  1. Select the ecommerce_customer_behavior table and select Edit PTIR.

  1. Select Turn on point in time recovery and select Save changes.

This is required for setting up zero-ETL using DynamoDB as source.
On the Backups tab, you should see the status for PITR as On.

  1. Additionally, you need to use a table policy to enable access for zero-ETL integration. On the Permissions tab, and copy the following code under Resource-based policy for table:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "TablePolicy01",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": [
                "dynamodb:ExportTableToPointInTime",
                "dynamodb:DescribeExport",
                "dynamodb:DescribeTable"
            ],
            "Resource": "*"
        }
    ]
}

This policy uses all the resources, which shouldn’t be used in production workload. To deploy this setup in production, restrict it to only specific zero-ETL integration resources by adding a condition to the resource-based policy.

Now that you have used the Amazon S3 import method to load a CSV file to create a DynamoDB table, you can enable zero-ETL integration on the table.

Validate permission settings

To validate if the catalog permission setting is appropriate, complete the following steps:

  1. On the AWS Glue console, select Databases in the navigation pane.

  1. Check for the database salesmarketing_XXX.

  1. Select Catalog settings in the navigation pane, and save the permissions.

The following code is an example of permissions for catalog settings:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account-id>:root"
            },
            "Action": "glue:CreateInboundIntegration",
            "Resource": "arn:aws:glue:<region>:<Account-id>:database/salesmarketing_XXX"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "glue:AuthorizeInboundIntegration",
            "Resource": "arn:aws:glue:<region>:<Account-id>:database/salesmarketing_XXX"
        }
    ]
}

Now you’re ready to create your zero-ETL integration.

Create a zero-ETL integration

Complete the following steps to create a zero-ETL integration:

  1. On the AWS Glue console, select Zero-ETL integrations in the navigation pane.

  1. Select “Create zero-ETL integration” to create a new configuration.

  1. Select Amazon DynamoDB as the source type.

  1. Under Source details, select ecommerce_customer_behavior for DynamoDB table.


  1. Under Target details, provide the following information:
    1. For AWS account, select Use the current account.
    2. For Data warehouse or catalog, enter the account ID of your default catalog.
    3. For Target database, enter salesmarketing_XXX.
    4. For Target IAM role, enter datazone_usr_role_XXX.

  1. Under Output settings, select Unnest all fields and Use primary keys from DynamoDB tables, leave Configure target table name as the default value (ecommerce_customer_behavior), then select Next.

  1. Enter zetl-ecommerce-customer-behavior for Name under Integration details, then select Next.

  1. Select Create and launch integration to launch the integration.

The status should be Creating after the integration is successfully initiated.
The status will change to Active in approximately a minute.

Verify that the SageMaker Lakehouse table exists. This process might take up to 15 minutes to complete, because the default refresh interval from DynamoDB is set to 15 minutes.

Validate the SageMaker Lakehouse table

You can now query your SageMaker Lakehouse table, created through zero-ETL integration, using various query engines. Complete the following steps to verify you can you see the table in SageMaker Unified Studio:

  1. Log in to the SageMaker Unified Studio portal using the single sign-on (SSO) option.

  1. Select your project to view its details page.

  1. Select Data in the navigation pane.

  1. Verify that you can see the Iceberg table in the SageMaker Lakehouse catalog.

Query with Athena

In this section, we show how to use Athena to query the SageMaker Lakehouse table from SageMaker Unified Studio. On the project page, locate the ecommerce_customer_behavior table in the catalog, and on the options menu (three dots), select Query with Athena.

This creates a SELECT query against the SageMaker Lakehouse table in a new window, and you should see the query results as shown in the following screenshot.

Query with Amazon Redshift

You can also query the SageMaker Lakehouse table from SageMaker Unified Studio using Amazon Redshift. Complete the following steps:

  1. Select the connection on the top right.
  2. Select Redshift (Lakehouse) from the list of connections.
  3. Select the awsdatacatalog database.
  4. Select the salesmarketing schema.
  5. Select Choose button.

The results will be shown in the Amazon Redshift Query Editor.

Query with Amazon EMR Serverless

You can query the Lakehouse table using Amazon EMR Serverless, which uses Apache Spark’s processing capabilities. Complete the following steps:

  1. On the project page, select Compute in the navigation pane.
  2. Select Add compute on the Data processing tab to create an EMR Serverless compute associated to the project.

  1. You can create new compute resources or connect to existing resources. For this example, select Create new compute resources.

  1. Select EMR Serverless.

  1. Enter a compute name (for example, Sales-Marketing), select the most recent release of EMR Serverless, and select Add compute.

It will take some time to create the compute.

You should see the status as Started for the compute. Now it’s ready to be used as your compute option for querying through a Jupyter notebook.

  1. Select the Build menu and select JupyterLab.

It will take some time to set up the workspace for running JupyterLab.

After the Jupyter Lab space is set up, you should see a page similar to the following screenshot.

  1. Select the new folder icon to create a new folder.

  1. Name the folder lakehouse_zetl_lab.

  1. Navigate to the folder you just created and create a notebook under this folder.
  1. Select the notebook Python3 (ipykernel) on the Launcher tab, and rename the notebook to query_lakehouse_table.

You can observe that the notebook is showing local Python as default language and compute. The two drop down menus show the connection type and compute for the selected connection type, just above the first cell within the Jupyter notebook.

  1. Select PySpark as the connection, and select the EMR Serverless application as compute.

  1. Enter the following sample code to query the table using Spark SQL:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Set the current database
spark.catalog.setCurrentDatabase("salesmarketing_XXX")

# Execute SQL query and store results in DataFrame
df = spark.sql("select * from ecommerce_customer_behavior limit 10")

# Display the results
df.show()

You can see the Spark DataFrame results.

Clean up

To avoid incurring future charges, delete the SageMaker domain, DynamoDB table, AWS Glue resources, and other objects created from this post.

Conclusion

This post demonstrated how you can establish a zero-ETL connection from DynamoDB to SageMaker Lakehouse, making your data available in Iceberg format without building custom data pipelines. We showed how you can analyze this DynamoDB data through various compute engines within SageMaker Unified Studio. This streamlined approach alleviates traditional data movement complexities, and enables more efficient data analysis workflows directly from your DynamoDB tables.

Try out this solution for your own use case, and share your feedback in the comments.


About the authors

Narayani Ambashta is an Analytics Specialist Solutions Architect at AWS, focusing on the automotive and manufacturing sector, where she guides strategic customers in developing modern data and AI strategies. With over 15 years of cross-industry experience, she specializes in big data architecture, real-time analytics, and AI/ML technologies, helping organizations implement modern data architectures. Her expertise spans across lakehouse, generative AI, and IoT platforms, enabling customers to drive digital transformation initiatives. When not architecting modern solutions, she enjoys staying active through sports and yoga.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with AWS. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life sciences, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Yadgiri Pottabhathini is a Senior Analytics Specialist Solutions Architect in the media and entertainment sector. He specializes in assisting enterprise customers with their data and analytics cloud transformation initiatives, while providing guidance on accelerating their Generative AI adoption through the development of data foundations and modern data strategies that leverage open-source frameworks and technologies.

Junpei Ozono is a Sr. Go-to-market (GTM) Data & AI solutions architect at AWS in Japan. He drives technical market creation for data and AI solutions while collaborating with global teams to develop scalable GTM motions. He guides organizations in designing and implementing innovative data-driven architectures powered by AWS services, helping customers accelerate their cloud transformation journey through modern data and AI solutions. His expertise spans across modern data architectures including Data Mesh, Data Lakehouse, and Generative AI, enabling customers to build scalable and innovative solutions on AWS.

Use AI agents and the Model Context Protocol with Amazon SES

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/use-ai-agents-and-the-model-context-protocol-with-amazon-ses/

Amazon Simple Email Service (Amazon SES) delivers a cloud-based email solution that empowers businesses to send emails more efficiently and at a larger scale. Its powerful, scalable platform enables organizations from startups to global brands to send personalized, high-volume email communications while maintaining exceptional deliverability and performance.

Amazon SES caters to a wide range of users, from developers and technical marketing professionals to business communicators. In addition to offering robust programmatic access through APIs and SMTP protocols, Amazon SES provides a comprehensive web console and intuitive dashboards that make email configuration and performance monitoring accessible to users with varying technical backgrounds. Historically, navigating email workflows and configuring advanced email capabilities in Amazon SES has required specialized knowledge, resulting in a learning curve for new users. As seen in many other areas, today’s AI tools can offer more intuitive ways to manage Amazon SES to get the most out of your email communications. We have found, however, that these AI tools occasionally produce inconsistent results, often as a result of the underlying large language model’s (LLM’s) training data.

Recognizing the need for a specialized, service-aware, AI-friendly Amazon SES solution, we are introducing the SESv2 MCP Server, a sample Model Context Protocol (MCP) for Amazon SES. We’ve integrated the SESv2 MCP Server sample with the Amazon SES v2 APIs to provide more precise and reliable AI-assisted use, management, and configuration for Amazon SES.

MCP is an open protocol that enables seamless integration between your AI-powered integrated development environment (IDE) or AI assistant, enriching the capabilities of the AI and enabling you to use Amazon SES using natural language. For more info, see the GitHub repo.

We’ve released the SESv2 MCP Server sample on GitHub and invite current and prospective customers to experiment with it in non-production environments. You can use it with your AI tools to explore ways in which AI can be used with Amazon SES to send emails, check configurations, and review deliverability. We’re interested in learning how you use your AI tools and the SESv2 MCP Server to test out email sending in different services or applications. We’re also curious if new customers find it helpful when configuring and learning about their Amazon SES service. No matter how you use it, we are eager for your feedback, comments, and contributions through the GitHub project’s issues.

Solution overview

You can use the SESv2 MCP Server sample with AI assistant applications like Anthropic’s Claude Desktop. You can also integrate it into MCP-compatible agentic AI coding assistants such as Amazon Q Developer, Amazon Q for command line, Cline, Cursor, and Windsurf. When used as an AI coding assistant, the SESv2 MCP Server sample helps developers add Amazon SES email capabilities to their applications and services using plain, natural language prompting. For recommendations from AWS on how to improve your vibe coding experience, refer to Vibe coding tips and tricks.

After you’ve configured the sample and authenticated with your AWS credentials, you can use natural language in your chosen AI tool. For example, an email marketing manager might want to ask Anthropic’s Claude Desktop “provide me with the status of the verified identities in my SES account, along with any recommendations to improve deliverability.” Someone new to Amazon SES can ask the Amazon Q CLI “create a new Amazon SES configuration set for the octank.com identity, enable it for event publishing for bounces and complaints.” Similarly, the developer of an AI-enabled restaurant booking application might ask the Amazon Q CLI “my application needs to send email confirmation of a customers online booking. Can you walk me thru adding this capability to my app using my SES account?”

As you can see from these examples, although it’s helpful to know a bit about email, and Amazon SES in general, with the help of your AI tool and the SESv2 MCP Server sample, you don’t need to be an email or Amazon SES expert. The combination of your creativity, AI tool, and the SESv2 MCP Server sample empowers even non-developers to create, test, and monitor Amazon SES workflows using natural language.

The SESv2 MCP Server sample release uses the open source Smithy Java project, which is still in development. As such, the SESv2 MCP Server is considered a sample, and we do not recommend employing it for production use. When a stable version is available, we might update this post and the GitHub repository accordingly.

Prerequisites

To follow along with the example use cases, make sure you have the following prerequisites set up:

  • AWS credentials with appropriate permissions.
  • An MCP-compatible LLM client (such as Anthropic’s Claude Desktop, Cline, Amazon Q CLI, or Cursor). For this post, we use the Amazon Q Developer CLI. For installation instructions, refer to Installing Amazon Q for command line.
  • Java 21 (or later) runtime (as required by Smithy Java).
  • Access to GitHub.
  • Git installed locally. For instructions, see Getting Started – Installing Git.

Best practices for using MCPs

To maximize the benefits of MCP-assisted development while maintaining security and code quality, we suggest you follow these essential guidelines:

  • Always review generated code for security implications before deployment
  • Use MCP servers as accelerators, not replacements for developer judgment and expertise
  • Keep MCP servers updated with the latest AWS security best practices
  • Follow the principle of least privilege when configuring AWS credentials
  • Run security scanning tools on generated infrastructure code

Configure the AWS CLI

Use the following command to configure the AWS Command Line Interface (AWS CLI) with the AWS credentials for your Amazon SES account and AWS Region:

aws configure

Clone and build the GitHub repository locally

To use macOS or Linux, use the following command to clone and build the GitHub repo:

git clone https://github.com/aws-samples/sample-for-amazon-ses-mcp.git
cd sample-for-amazon-ses-mcp
./build.sh

For Windows, use the following command:

git clone https://github.com/aws-samples/sample-for-amazon-ses-mcp.git
cd sample-for-amazon-ses-mcp
.\build.bat

Copy the absolute path to the .jar file (JAR_PATH_FROM_BUILD_OUTPUT). This will be printed at the end of the build script:

/<your path>/sample-for-amazon-ses-mcp/artifacts/sample-for-amazon-ses-mcp-all.jar

Configure your AI tool to use SESv2 MCP Server

When the build is complete, add SESv2 MCP Server to your AI tool’s MCP configuration:

{
  "mcpServers": {
    "sesv2-mcp-server": {
      "command": "java",
      "args": [
        "-jar",
        "JAR_PATH_FROM_BUILD_OUTPUT"
      ]
    }
  }
}

See MCP configuration for configuration steps. See the Claude Desktop MCP configuration guide for setup instructions.

After you build the SESv2 MCP Server and configure your AWS credentials, you’re ready to interact with Amazon SES. Keep in mind that effective, thoughtful prompting is crucial for successful AI-assisted development. For more information about vibe coding, see Vibe coding tips and tricks.

Example use cases

In this section, we provide some guided examples using the Amazon Q Developer CLI to interact with Amazon SES. Feel free to experiment on your own use cases, and share your comments and ideas through the GitHub project’s issues. Do not disclose any personal, commercially sensitive, or confidential information.

Get information, recommendations, and configurations your Amazon SES account

Open your AI tool; for these examples, we use a macOS terminal and initiate a chat session with the Amazon Q CLI:

q chat

We’ve found it useful to provide your AI tool with some guidance:

You're connected to the SESv2 MCP Server and have access to the AWS SESv2 APIs.

Ask the Amazon Q CLI about your AWS account’s SES email identities:

Tell me about the identities in my account, and also if the account is in the SES sandbox?

The Amazon Q CLI will request permission to use the SESv2 MCP Server (which provides the Amazon Q CLI with the SESv2 APIs ListEmailIdentities and GetAccount) to query your AWS SES account and reply with a detailed summary.

Ask the Amazon Q CLI if it has any recommendations related to improving deliverability for your Amazon SES account:

Do you have any recommendations to improve email deliverability for my SES account?

The Amazon Q CLI will use the SESv2 MCP Server (which provides the CLI with the SESv2 API ListRecommendations) to query your Amazon SES account and reply with a detailed summary.

Ask the Amazon Q CLI to set up Amazon SES click tracking for one of your domains. We have found it helpful to remind the CLI that it has access to additional knowledge of the AWS service APIs. It’s also a good idea to make sure the AI tool doesn’t invent nonexistent APIs.

You also have access to other AWS service APIs via the AWS CLI and your general knowledge, but you may only use known, documented APIs - do not invent or create any APIs or commands.
Set up Amazon SES click tracking with CloudWatch integration for the domain <my verified identity> to monitor email metrics. Use Amazon's default tracking domain (no SSL or https) for the click tracking to ensure immediate functionality without requiring custom domain setup. Include all necessary configuration steps and verify the setup works correctly. Create a test HTML email to <my email address> from <no-reply@verified domain> with subject "Testing SES click tracking". Create an HTML (with fallback to text) body with links and short descriptions taken from the public AWS webpages for Amazon SES, AWS End User Messaging and Amazon Connect. 

Send emails with your Amazon SES account

Using its knowledge of Amazon SES from the SESv2 MCP Server and permissions to use your Amazon SES account (aws configure), you can use your AI tool to create and send emails using Amazon SES.

If your Amazon SES account is in the Amazon SES sandbox, you are limited to sending and receiving email from verified email addresses. You are also limited to 200 messages in 24 hours. For more information about the Amazon SES sandbox, see Request production access (Moving out of the Amazon SES sandbox). If you’re in the sandbox, you can simply ask your AI tool “verify my email address <[email protected]>.”

Ask the Amazon Q CLI to send a test email with a sample HTML body:

Send a test email to <my verified email address> from <verified SES email identity>. Set the from email display name to "MCP testing". Make the email subject "Test sending an email via SES MCP". Use the information found on the Amazon SES website to create an HTML message body with a few sentences and bullet points about SES. Provide a text version of the message body in case of fallback.

Check your email, where you will receive a response.

You can get creative and ask the Amazon Q CLI to create a formatted email template with personalization using a simple table with email recipients, the product they bought, and their postal code:

Use the table below to send each person in the table an html formatted (with fallback) email message. 
-- table --
email,name,product,zipcode
<my verified email address>,Alice,an umbrella,98101
<my verified email address>,Bob,lots of sunscreen,10001
-- end table --
Use the template below. Create a 5-day weather forecast graphic similar to popular weather app graphics based on estimated weather for their ZIP code.
-- template --
"Hi {{name}}, thanks for buying {{product}}; it looks like you'll need it soon based on the 5-day weather forecast for your local area: <5-day weather forecast graphic>.

As we’ve demonstrated, you don’t need to be a seasoned developer to create and test Amazon SES workflows when you have an AI tool and the SESv2 MCP Server sample.

Conclusion

The SESv2 MCP Server sample democratizes the ability to configure, manage, and create sophisticated email automation workflows with Amazon SES.

The examples and guidance in this post demonstrate how even newcomers can use AI tools like the Amazon Q CLI to test out configuring, monitoring, and sending emails with Amazon SES using natural language. More technical users, including developers, can use the SESv2 MCP Server sample to build and test intelligent email applications that use Amazon SES, or to test out building Amazon SES sending into their own application.

We hope you will experiment with the SESv2 MCP Server sample and provide us with your thoughts and feedback, and perhaps contribute to the project through the GitHub project’s issues.

Additional resources

Optimizing ODCR usage through AI-powered capacity insights

Post Syndicated from Ankush Goyal original https://aws.amazon.com/blogs/compute/optimizing-odcr-usage-through-ai-powered-capacity-insights/

Efficient resource management is crucial for organizations seeking to optimize cloud costs while making sure of seamless access to compute capacity. Amazon Elastic Compute Cloud (Amazon EC2) On-Demand Capacity Reservations (ODCRs) provide the flexibility to reserve compute capacity within a specific Availability Zone (AZ) for any duration. This makes sure that critical workloads always have the necessary resources available, minimizing the risk of capacity shortages.

However, managing ODCRs across multiple teams and accounts presents several challenges:

  • Limited visibility across teams: In large organizations, tracking ODCR usage across teams and business units can be difficult, often leading to underused or duplicate reservations.
  • Complexity in optimization: Without clear insights, adjusting, releasing, or modifying reservations to align with changing workloads becomes a cumbersome process.
  • Cost management challenges: Unused or oversized ODCRs contribute to unnecessary expenses, making it essential to continuously monitor and optimize usage.

In this post, we demonstrate how Amazon Bedrock Agents can help organizations gain actionable insights into ODCR usage across their Amazon Web Services (AWS) environment. Unlike traditional approaches that rely on predefined patterns or manual tracking, this solution dynamically retrieves the latest ODCR usage data and provides intelligent, query-based recommendations to fulfill the capacity needs. This solution is serverless and pay-per-use and incurs minimal operational cost.

This approach allows IT leaders, cloud architects, and finance teams to optimize reservations, control costs, and enhance overall resource management—without the complexity of traditional analysis methods.

Solution overview

This solution addresses two specific use cases:

  1. The team must create a new ODCR to accommodate an upcoming project.
  2. The team needs to expand the capacity of their existing ODCR.

The system consists of three essential components:

  • ODCRSupervisorAgent: Functions as the main coordinator, handling user inquiries and managing requests to specialized subordinate agents.
  • CapacityPlanningAgent: Reviews existing ODCRs throughout the organization, recommending SPLIT and MOVE operations to optimize capacity allocation for new projects.
  • AugmentationAgent: Identifies opportunities to increase existing ODCR capacity by recommending MOVE operations from other organizational ODCRs.

The following figure shows the high-level architecture for this solution.

The system architecture uses AWS services for comprehensive ODCR management, as shown in the following figure. The platform combines Amazon Cognito for authentication, AWS Amplify for front-end delivery, and AWS Lambda for serverless computing. AWS Resource Explorer enables efficient discovery and tracking of ODCRs across AWS Regions, while Lambda functions periodically query Resource Explorer to retrieve detailed ODCR information and usage metrics. AWS Identity and Access Management (IAM) plays a crucial role in securing the system, managing fine-grained access controls and permissions across all components. Amazon Bedrock Agents and its multi-agent capabilities generate intelligent recommendations through coordinated agent interactions, allowing organizations to optimize ODCR usage and implement data-driven capacity planning strategies.

How the solution functions

At the heart of this system is an Amazon Bedrock Agent powered by Action Groups, which map user queries to backend tasks using Lambda. These Lambda functions act as the system’s intelligence layer: fetching, filtering, and formatting ODCR data for the agent to reason over. The following three Lambda functions are deployed as part of the AWS CloudFormation template:

get_az_mapping_info Lamba function: This Lambda function takes an AWS account ID and Region as input. It returns a mapping of AZ names to their corresponding AZ IDs for that account and AWS Region. This helps the Capacity Planning Agent align AZs across accounts.

find_eligiable_odcrs Lambda function: This Lambda function supports the Capacity Planning Agent by identifying suitable ODCRs that meet the specified instance type and AZ. It uses Resource Explorer to search for ODCRs across the organization. Then, the function assumes cross-account roles to access ODCRs in different AWS accounts to gather detailed information. After retrieving the necessary data, it filters the ODCRs based on instance type, tenancy, and AZ. Finally, it returns a formatted list of eligible ODCRs, which are sourced from either the specified account or an alternative account with adequate capacity.

find_eligible_odcrs_for_move Lambda function: This Lambda function assists the Augmentation Agent by finding all capacity reservations within the same account as the specified ODCR. It uses Resource Explorer to discover ODCRs and, when necessary, assumes cross-account roles to gather ODCR information. The function filters ODCRs based on matching instance type, tenancy, and AZ, then returns a formatted list of eligible ODCRs that can provide more capacity through MOVE operations.

Each Lambda function receives structured inputs from the Amazon Bedrock Agent, executes its designated task, and returns structured responses. Then, these responses are used by the agent to generate intelligent, actionable answers to user queries.

Together, these Lambda functions extend the capabilities of the Amazon Bedrock Agent by integrating with external data sources and automating complex ODCR management logic—allowing the system to deliver accurate, dynamic recommendations.

Prerequisites

You must have the following in place to implement the solution in this post:

Deploy solution resources using CloudFormation

This solution is designed to be deployed in a single Region. If deploying outside us-east-1, then you must modify the CloudFormation parameters to reference the correct FM version and adjust for Regional availability, such as any necessary cross-Region inference profile configurations.

  • Deploy the Cross Account IAM Role template: To access ODCR data across your Organization, deploy a CloudFormation StackSet from your payer account. During deployment, you must provide the account number where you plan to deploy the ODCR solution. This is the same account number that you used as delegate account for Resource Explorer under the prerequisites. This account number is added to the trust policy of the IAM role, allowing the solution in that account to assume the role created by the StackSet. The deployment automatically provisions a standardized IAM role with the necessary permissions (for example ec2:DescribeCapacityReservations and ec2:DescribeAvailabilityZones) in each target account.
  • Deploy the ODCR Solution template: Deploy the provided template in the delegated administrator account that you chose for Resource Explorer. This sets up the core components of the solution and integrates with your payer account for organization-wide ODCR insights.

Resources deployed by the CloudFormation template

After you have deployed the CloudFormation template, the following AWS resources are created in your account.

Cross Account IAM Role template

  • IAM resource
    • CrossAccountODCRAccessRole

ODCR Solution template

  • Amazon Cognito resources:
    • User Pool: ODCRAgentUserPool
    • App Client: ODCRAgentApp
    • Identity Pool: odcr-identity-pool
    • Groups: ODCRAdmins
    • User: ODCR User
  • IAM resources:
    • IAM roles:
      • GetAZMappingFunctionRole
      • LambdaODCRAccessRole
      • BedrockAgentExecutionRole
      • ODCRSupervisorAgentExecutionRole
      • CognitoAuthRole
  • Lambda functions:
    • get_az_mapping_info
    • find_eligible_odcrs
    • find_eligible_odcrs_for_move
  • Amazon Bedrock Agents
    • ODCRSupervisorAgent
    • CapacityPlanningAgent with action groups:
      • get_az_mapping_info
      • find_eligible_odcrs
    • AugmentationAgent with action group:
      • find_eligible_odcrs_for_move

After you deploy the final CloudFormation template, copy the following from the Outputs tab on the CloudFormation console to use during the configuration of your application after it’s deployed in Amplify:

  • AWSRegion
  • BedrockAgentAliasId
  • BedrockAgentId
  • BedrockAgentName
  • IdentityPoolId
  • UserPoolClientId
  • UserPoolId

The following screenshot shows you what the Outputs tab looks like.

Deploy the Amplify application for front-end

You need to manually deploy the Amplify application using the front-end code found on GitHub. Complete the following steps, as shown in the following figure:

  1. Download the front-end code AWS-Amplify-Frontend.zip from GitHub.
  2. Use the .zip file to manually deploy the application in Amplify.
  3. Return to the Amplify page and use the domain that it automatically generated to access the application.

After completing these steps, your environment is ready to analyze ODCR usage and receive recommendations powered by Amazon Bedrock Agents.

Solution walkthrough

After deploying the application in Amplify, return to the Amplify console. Use the auto-generated domain URL shown there to access the front-end. Upon accessing the application URL, you’re prompted to provide information related to Amazon Cognito and Amazon Bedrock Agents. This information is necessary to securely authenticate users and allow the front end to interact with the Amazon Bedrock Agent. It enables the application to manage user sessions and make authorized API calls to AWS services on behalf of the user.

You can enter information with the values that you collected from the CloudFormation stack outputs, as shown in the following screenshot:

A temporary password was automatically generated during deployment and sent to the email address provided when launching the CloudFormation template. At first sign-in, you’re prompted to reset your password.When you’re signed in, you can begin interacting with the application to analyze and manage your ODCR usage. The interface supports natural language queries to streamline capacity planning. The following are some example questions that you can ask:

  • I require [X units] of capacity for the [INSTANCE TYPE] instance type in Availability Zone [AZ ID], within account [ACCOUNT NUMBER]. Could you please advise if there are any existing On-Demand Capacity Reservations (ODCRs) that can fulfill this requirement? Additionally, what operations would be necessary to utilize the available capacity?

  • I have a requirement to add [X additional units] of capacity to an existing On-Demand Capacity Reservation with ID [cr-0012abcd123456789]. Could you please advise on the steps or operations required to fulfill this request?

Cleaning up

If you decide to discontinue using this solution, then you can follow these steps to remove it, its associated resources deployed using CloudFormation, and the Amplify deployment:

  1. Delete the CloudFormation StackSet:
    1. On the CloudFormation console, choose StackSets in the navigation pane.
    2. Locate your StackSet for the Cross-Account IAM Role (you assigned a name to it).
    3. Choose Delete stacks from StackSet to remove all stack instances.
    4. After all instances are deleted, choose StackSet.
    5. Choose Delete StackSet from the Actions menu.
  2. Delete the CloudFormation stack:
    1. On the CloudFormation console, choose Stacks in the navigation pane.
    2. Locate the stack you created during the deployment process (you assigned a name to it).
    3. Choose the stack and choose Delete.
  3. Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock Agents and AWS services to build an intelligent ODCR management solution. Combining AWS Resource Explorer for organization-wide visibility, Amazon Bedrock Agents for intelligent recommendations, and a secure front-end powered by AWS Amplify, organizations can now make data-driven decisions about their capacity reservations. This solution helps teams optimize ODCR usage, reduce costs, and efficiently manage compute capacity across their AWS Organization. The artificial intelligence (AI)-powered approach eliminates manual tracking and analysis, allowing teams to focus on strategic capacity planning rather than operational overhead.

Get started today by deploying this solution in your AWS environment, and unlock the power of AI-driven capacity insights for more efficient ODCR management.

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/using-aws-glue-data-catalog-views-with-apache-spark-in-emr-serverless-and-glue-5-0/

The AWS Glue Data Catalog has expanded its Data Catalog views feature, and now supports Apache Spark environments in addition to Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it possible to create, share, and query multi-engine SQL views across Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, as well as Athena and Amazon Redshift Spectrum. The multi-dialect views empower data teams to create SQL views one time and query them through supported engines—whether it’s Athena for ad-hoc analytics, Amazon Redshift for data warehousing, or Spark for large-scale data processing. This cross-engine compatibility means data engineers can focus on building data products rather than managing multiple view definitions or complex permission schemes. Using AWS Lake Formation permissions, organizations can share these views within the same AWS account, across different AWS accounts, and with AWS IAM Identity Center users and groups, without granting direct access to the underlying tables. Features of Lake Formation such as fine-grained access control (FGAC) using Lake Formation-tag based access control (LF-TBAC) can be applied to Data Catalog views, enabling scalable sharing and access control across organizations.

In an earlier blog post, we demonstrated the creation of Data Catalog views using Athena, adding a SQL dialect for Amazon Redshift, and querying the view using Athena and Amazon Redshift. In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Benefits of Data Catalog views

The following are key benefits of Data Catalog views for business solutions:

  • Targeted data sharing and access control – Data Catalog views, combined with the sharing capabilities of Lake Formation, enable organizations to provide specific data subsets to different teams or departments without duplicating data. For example, a retail company can create views that show sales data to regional managers while restricting access to sensitive customer information. By applying LF-TBAC to these views, companies can efficiently manage data access across large, complex organizational structures, maintaining compliance with data governance policies while promoting data-driven decision-making.
  • Multi-service analytics integration – The ability to create a view in one analytics service and query it across Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down data silos and promotes a unified analytics approach. This feature allows businesses to use the strengths of different services for various analytics needs. For instance, a financial institution could create a view of transaction data and use Athena for ad-hoc queries, Amazon Redshift for complex aggregations, and EMR Serverless for large-scale data processing—all without moving or duplicating the data. This flexibility accelerates insights and improves resource utilization across the analytics stack.
  • Centralized auditing and compliance – With views stored in the central Data Catalog, businesses can maintain a comprehensive audit trail of data access across connected accounts using AWS CloudTrail logs. This centralization is crucial for industries with strict regulatory requirements, such as healthcare or finance. Compliance officers can seamlessly monitor and report on data access patterns, detect unusual activities, and demonstrate adherence to data protection regulations like GDPR or HIPAA. This centralized approach simplifies compliance processes and reduces the risk of regulatory violations.

These capabilities of Data Catalog views provide powerful solutions for businesses to enhance data governance, improve analytics efficiency, and maintain robust compliance measures across their data ecosystem.

Solution overview

An example company has multiple datasets containing details of their customers’ purchase details mixed with personally identifiable information (PII) data. They categorize their datasets based on sensitivity of the information. The data steward wants to share a subset of their preferred customers data for further analysis downstream by their data engineering team.

To demonstrate this use case, we use sample Apache Iceberg tables customer and customer_address. We create a Data Catalog view from these two tables to filter by preferred customers. We then use LF-Tags to share restricted columns of this view to the downstream engineering team. The solution is represented in the following diagram.

arch diagram

Prerequisites

To implement this solution, you need two AWS accounts with an AWS Identity and Access Management (IAM) admin role. We use the role to run the provided AWS CloudFormation templates and also use the same IAM roles added as Lake Formation administrator.

Set up infrastructure in the producer account

We provide a CloudFormation template that deploys the following resources and completes the data lake setup:

  • Two Amazon Simple Storage Service (Amazon S3) buckets: one for scripts, logs, and query results, and one for the data lake storage.
  • Lake Formation administrator and catalog settings. The IAM admin role that you provide is registered as Lake Formation administrator. Cross-account sharing version is set to 4. Default permissions for newly created databases and tables is set to use Lake Formation permissions only.
    data catalog settings
  • An IAM role with read, write, and delete permissions on the data lake bucket objects. The data lake bucket is registered with Lake Formation using this IAM role.
    data lake locations
  • An AWS Glue database for the data lake.
  • Lake Formation tags. These tags are attached to the database.
    lf-tags
  • CSV and Iceberg format tables in the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are stored in the user account’s data lake bucket.
  • An Athena workgroup.
  • An IAM role and an AWS Lambda function to run Athena queries. Athena queries are run in the Athena workgroup to insert data from CSV tables to Iceberg tables. Relevant Lake Formation permissions are granted to the Lambda role.
    lf-tables
  • An EMR Studio and related virtual private cloud (VPC), subnet, routing table, security groups, and EMR Studio service IAM role.
  • An IAM role with policies for the EMR Studio runtime. Relevant Lake Formation permissions are granted to this role on the Iceberg tables. This role will be used as the definer role to create the Data Catalog view. A definer role is the IAM role with necessary permissions to access the referenced tables, and runs the SQL statement that defines the view.

Complete the following steps in your producer AWS account:

  1. Sign in to the AWS Management Console as an IAM administrator role.
  2. Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

  1. If you’re using the producer account in Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-RuntimeRole.
    data permissions

Create an EMR Serverless application

Complete the following steps to create an EMR Serverless application in your EMR Studio:

  1. On the Amazon EMR console, under EMR Studio in the navigation pane, choose Studios.
  2. Choose GlueViewBlog-emrstudio and choose the URL link of the Studio to open it.
    glueviewblog-emrstudio
  3. On the EMR Studio dashboard, choose Create application.
    emr-studio-dashboard

You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application.

  1. Under Application settings, provide the following information:
    1. For Name, enter a name (for example, emr-glueview-application).
    2. For Type, choose Spark.
    3. For Release version, choose emr-7.8.0.
    4. For Architecture, choose x86_64.
  2. Under Application setup options, select Use custom settings.
  3. Under Interactive endpoint, select Enable endpoint for EMR studio.
  4. Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
  5. Under Network connections, choose emrs-vpc for VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
  6. Choose Create and start the application.

Create an EMR Workspace

Complete the following steps to create an EMR Workspace:

  1. On the EMR Studio console, choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter a Workspace name (for example, emrs-glueviewblog-workspace).
  3. Leave all other settings as default and choose Create Workspace.
  4. Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
  5. After the Workspace is launched, in the navigation pane, choose Compute.
  6. For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime role.
  7. Make sure the kernel attached to the Workspace is PySpark.

Create a Data Catalog view and verify

Complete the following steps:

  1. Download the notebook glueviewblog_producer.ipynb. The code creates a Data Catalog view customer_nonpii_view from the two Iceberg tables, customer_iceberg and customer_address_iceberg, in the database glueviewblog_<account-id>_db.
  2. On your EMR Workspace emrs-glueviewblog-workspace, go to the File browser section and choose Upload files.
  3. Upload glueviewblog_producer.ipynb.
  4. Update the data lake bucket name, AWS account ID, and AWS Region to match your resources.
  5. Update the database_name, table1_name, and table2_name to match your resources.
  6. Save the notebook.
  7. Choose the double arrow icon to restart the kernel and rerun the notebook.

The Data Catalog view customer_nonpii_view is created and verified.

  1. In the navigation pane on the Lake Formation console, under Data Catalog, choose Views.
  2. Choose the new view customer_nonpii_view.
  3. On the SQL definitions tab, verify EMR with Apache Spark shows up for Engine name.
  4. Choose the tab LF-Tags. The view should show the LF-Tag sensitivity=pii-confidential inherited from the database.
  5. Choose Edit LF-Tags.
  6. On the Values dropdown menu, choose confidential to overwrite the Data Catalog view’s key value of sensitivity from pii-confidential.
  7. Choose Save.

With this, we have created a non-PII view to share with the data engineering team from the datasets that has PII information of customers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime role GlueViewBlog-EMRStudio-RuntimeRole, the Admin will have only describe permissions on it as a database creator and Lake Formation administrator. In this step, the Admin will grant itself alter permissions on the view, in order to add the Athena SQL dialect to the view.

  1. On the Lake Formation console, in the navigation pane, choose Data permissions.
  2. Choose Grant and provide the following information:
    1. For Principals, enter Admin.
    2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
    3. For Key, choose sensitivity.
    4. For Values, choose confidential and pii-confidential.
    5. Under Database permissions, select Super for Database permissions and Grantable permissions.
    6. Under Table permissions, select Super for Table permissions and Grantable permissions.
    7. Choose Grant.
  3. Verify the LF-Tags based permissions the Admin.
  4. Open the Athena query editor, choose the Workgroup GlueViewBlogWorkgroup and choose the AWS Glue database glueviewblog_<accountID>_db.
  5. Run the following query. Replace <accountID> with your account ID.
    ALTER VIEW glueviewblog_<accountID>_db.customer_nonpii_view ADD DIALECT
    AS
    select c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
    from glueviewblog__<accountID>_db.customer_iceberg, glueviewblog__<accountID>_db.customer_address_iceberg
    where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  6. Verify the Athena dialect by running a preview on the view.
  7. On the Lake Formation console, verify the SQL dialects on the view customer_nonpii_view.

Share the view to the consumer account

Complete the following steps to share the Data Catalog view to the consumer account:

  1. On the Lake Formation console, in the navigation pane, choose Data permissions.
  2. Choose Grant and provide the following information:
    1. For Principals, select External accounts and enter the consumer account ID.
    2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
    3. For Key, choose sensitivity.
    4. For Values, choose confidential.
    5. Under Database permissions, select Describe for Database permissions and Grantable permissions.
    6. Under Table permissions, select Describe and Select for Table permissions and Grantable permissions.
    7. Choose Grant.
  3. Verify granted permissions on the Data permissions page.

With this, the producer account data steward has created a Data Catalog view of a subset of data from two tables in their Data Catalog, using the EMR runtime role as the definer role. They have shared it to their analytics account using LF-Tags to run further processing of the data downstream.

Set up infrastructure in the consumer account

We provide a CloudFormation template to deploy the following resources and set up the data lake as follows:

  • An S3 bucket for Amazon EMR and AWS Glue logs
  • Lake Formation administrator and catalog settings similar to the producer account setup
  • An AWS Glue database for the data lake
  • An EMR Studio and related VPC, subnet, routing table, security groups, and EMR Studio service IAM role
  • An IAM role with policies for the EMR Studio runtime

Complete the following steps in your consumer AWS account:

  1. Sign in to the console as an IAM administrator role.
  2. Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

  1. If you’re using the consumer account Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole.

Accept AWS RAM shares in the consumer account

You can now log in to the AWS consumer account and accept the AWS RAM invitation:

  1. Open the AWS RAM console with the IAM role that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the producer account.

  1. Accept both invitations.

Create a resource link for the shared view

To access the view that was shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database, table, or view. After you create a resource link to a view, you can use the resource link name wherever you would use the view name. Furthermore, you can grant permission on the resource link to the job runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole to access the view through EMR Serverless Spark.

To create a resource link, complete the following steps:

  1. Open the Lake Formation console as the Lake Formation data lake administrator in the consumer account.
  2. In the navigation pane, choose Tables.
  3. Choose Create and Resource link.
  4. For Resource link name, enter the name of the resource link (for example, customer_nonpii_view_rl).
  5. For Database, choose the glueviewblog_customer_<accountID>_db database.
  6. For Shared table region, choose the Region of the shared table.
  7. For Shared table, choose customer_nonpii_view.
  8. Choose Create.

Grant permissions on the database to the EMR job runtime role

Complete the following steps to grant permissions on the database glueviewblog_customer_<accountID>_db to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Databases.
  2. Select the database glueviewblog_customer_<accountID>_db and on the Actions menu, choose Grant.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the Database permissions section, select Describe.
  5. Choose Grant.

Grant permissions on the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the resource link customer_nonpii_view_rl to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Tables.
  2. Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the Resource link permissions section, select Describe for Resource link permissions.
  5. Choose Grant.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

Grant permissions on the target for the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the target for the resource link customer_nonpii_view_rl to the EMR job runtime role:

  1. On the Lake Formation console, in the navigation pane, choose Tables.
  2. Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant on target.
  3. In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  4. In the View permissions section, select Select and Describe.
  5. Choose Grant.

Set up an EMR Serverless application and Workspace in the consumer account

Repeat the steps to create an EMR Serverless application in the consumer account.

Repeat the steps to create a Workspace in the consumer account. For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-Consumer-RuntimeRole as the runtime role.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to verify access in EMR Studio:

  1. Download the notebook glueviewblog_emr_consumer.ipynb. The code runs a select statement on the view shared from the producer.
  2. In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser section and choose Upload files.
  3. Upload glueviewblog_emr_consumer.ipynb.
  4. Update the data lake bucket name, AWS account ID, and Region to match your resources.
  5. Update the database to match your resources.
  6. Save the notebook.
  7. Choose the double arrow icon to restart the kernel with PySpark kernel and rerun the notebook.

Verify access using interactive notebooks from AWS Glue Studio

Complete the following steps to verify access using AWS Glue Studio:

  1. Download the notebook glueviewblog_glue_consumer.ipynb
  2. Open the AWS Glue Studio console.
  3. Choose Notebook and then choose Upload notebook.
  4. Upload the notebook glueviewblog_glue_consumer.ipynb.
  5. For IAM role, choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
  6. Choose Create notebook.
  7. Update the data lake bucket name, AWS account ID, and Region to match your resources.
  8. Update the database to match your resources.
  9. Save the notebook.
  10. Run all the cells to verify fine-grained access.

Verify access using the Athena query editor

Because the view from the producer account was shared to the consumer account, the Lake Formation administrator will have access to the view in the producer account. Also, because the lake admin role created the resource link pointing to the view, it will also have access to the resource link. Go to the Athena query editor and run a simple select query on the resource link.

The analytics team in the consumer account was able to access a subset of the data from a business data producer team, using their analytics tools of choice.

Clean up

To avoid incurring ongoing costs, clean up your resources:

  1. In your consumer account, delete AWS Glue notebook, stop and delete the EMR application, and then delete EMR Workspace.
  2. In your consumer account, delete the CloudFormation stack. This should remove the resources launched by the stack.
  3. In your producer account, log in to the Lake Formation console and revoke the LF-Tags based permissions you had granted to the consumer account.
  4. In your producer account, stop and delete the EMR application and then delete the EMR Workspace.
  5. In your producer account, delete the CloudFormation stack. This should delete the resources launched by the stack.
  6. Review and clean up any additional AWS Glue and Lake Formation resources and permissions.

Conclusion

In this post, we demonstrated a powerful, enterprise-grade solution for cross-account data sharing and analysis using AWS services. We walked you through how to do the following key steps:

  • Create a Data Catalog view using Spark in EMR Serverless within one AWS account
  • Securely share this view with another account using LF-TBAC
  • Access the shared view in the recipient account using Spark in both EMR Serverless and AWS Glue ETL
  • Implement this solution with Iceberg tables (it’s also compatible other open table formats like Apache Hudi and Delta Lake)

The solution approach with multi-dialect data catalog views provided in this post is particularly valuable for enterprises looking to modernize their data infrastructure while optimizing costs, improve cross-functional collaboration while enhancing data governance, and accelerate business insights while maintaining control over sensitive information.

Refer to the following information about Data Catalog views with individual analytics services, and try out the solution. Let us know your feedback and questions in the comments section.


About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. As part of the SageMaker Lakehouse team, she works with AWS customers and partners to architect lake house solutions, enhance product features, and establish best practices for data governance.

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Dhananjay Badaya is a Software Developer at AWS, specializing in distributed data processing engines including Apache Spark and Apache Hadoop. As a member of the Amazon EMR team, he focuses on designing and implementing enterprise governance features for EMR Spark.

Implementing just-in-time privileged access to AWS with Microsoft Entra and AWS IAM Identity Center

Post Syndicated from Rodney Underkoffler original https://aws.amazon.com/blogs/security/implementing-just-in-time-privileged-access-to-aws-with-microsoft-entra-and-aws-iam-identity-center/

Controlling access to your privileged and sensitive resources is critical for all AWS customers. Preventing direct human interaction with services and systems through automation is the primary means of accomplishing this. For those infrequent times when automation is not yet possible or implemented, providing a secure method for temporary elevated access is the next best option. In a privileged access management solution, there are several elements that should be included:

  • User access should follow the principle of least privileged
  • Users should be granted only the minimum amount of access required to perform their job duties
  • Access granted should persist only for the time necessary to perform the assigned tasks
  • The solution should include:
    • An eligibility process for granting access
    • An approval process for granting access
    • Auditing of the access grants and activities taken

Entra Privileged Identity Management (PIM) is a third-party solution that provides dynamic group management, access control, and audit capabilities that integrate with AWS IAM Identity Center.

In this post, we show you how to configure just-in-time access to AWS using Entra PIM’s integration with IAM Identity Center.

Just-in-time privileged access with Entra PIM and IAM Identity Center

Privileged Identity Management is a Microsoft Entra ID feature that enables management, control, and access monitoring of your important cloud resources. There are many different configuration options when it comes to eligibility and assignment to privileged security groups, including time-bound access with start and end dates, multi-factor authentication (MFA) enforcement, justification tracking, and so on. You can read more about those options in Microsoft’s product documentation.

Figure 1 shows the just-in-time access solution powered by Entra PIM group activation requests. In this solution, Entra PIM is integrated with IAM Identity Center to provide temporary, limited access to AWS resources based on user requests and approvals. Entra ID users can submit requests for specific access to specific AWS permissions sets, which are then automatically granted for a set duration.

Figure 1 – Entra PIM solution integrated with IAM Identity Center

Figure 1 – Entra PIM solution integrated with IAM Identity Center

Prerequisites

To try the solution described in this post, you need to have the following in place:

Step-by-step configuration

In the following steps, you create configurations to enable Entra PIM for Groups to automatically assign users to groups based on approval criteria. The groups will be Entra ID security groups that use direct assignment. Note that, at the time of this writing, dynamic groups and groups that you have synchronized from a self-managed Active Directory cannot be used with Microsoft Entra PIM. While it might be possible to also populate these groups using a third-party synchronization tool, for the purposes of this exercise, we assume that administration is occurring solely within Entra ID.

In the example scenario, the role corresponds to a specific job function within your organization. We use a group called AWS – Amazon EC2 Admin, which corresponds to a DevOps on-call site reliability engineer (SRE) lead.

Step 1: Create a group representing a specific privilege level.

Create a group in Entra ID that represents a specific privilege level that your employees can request for access to the AWS Management Console.

  1. Sign in to the Microsoft Entra admin center with your credentials.
  2. Select Groups and then All groups.
  3. Choose New group.
  4. Specify Security in the Group type dropdown list.
    • In the Group name field, enter AWS - Amazon EC2 Admin.
    • In the Group description field, enter Amazon EC2 administrator permissions.
    • Choose Create.

Step 2: Assign access for the group in Entra ID

Now you need to assign the newly created group to your enterprise application.

  1. Sign in to the Microsoft Entra admin center with your credentials
  2. Select Applications and then Enterprise applications and select the IAM Identity Center application that you created.
  3. Select Users and groups from the Manage menu group and select + Add user/group.
  4. Select the None selected option from the Users and groups section.
  5. Select the AWS – Amazon EC2 Admin group checkbox.
  6. Choose Select and then choose Assign.
  7. Select Provisioning from the Manage menu group and begin synchronizing the empty group by selecting the Start provisioning option.

When you first enable provisioning, the initial Microsoft Entra ID sync is triggered immediately. After that, subsequent syncs are triggered every 40 minutes, with the exact frequency depending on the number of users and groups in the application.

When the initial sync completes, the AWS – Amazon EC2 Admin group will be ready for configuration in IAM Identity Center.

Step 3: Create permission sets in IAM Identity Center

As you prepare to configure your permission set, let’s consider session duration from both the AWS and Entra PIM perspectives. There are two session durations on the AWS side: AWS access portal session duration and permission set session duration. The AWS access portal session duration defines the maximum length of time that a user can be signed in to the AWS access portal without reauthenticating. The default session duration is 8 hours but can be configured anywhere between 15 minutes and 7 days.

Note: Entra does not pass the SessionNotOnOrAfter attribute to IAM Identity Center as part of the SAML assertion. Meaning the duration of the AWS access portal session is controlled by the duration set in IAM Identity Center.

The session duration defined within a permission set specifies the length of time that a user can have a session for an AWS account. The default and minimum value is 1 hour (with a maximum value of 12). Entra PIM allows you to configure an activation maximum duration. The activation maximum duration is the length of time that the specified group will contain the activated user account. The activation maximum duration has a default value of 8 hours but can be configured between 30 minutes and 24 hours.

You should carefully consider the values that you provide for each of these durations. The AWS access portal will display permission sets that the user had access to at the time that they signed in for the duration of the active AWS access portal session.

When you set the permission set session duration, you need to keep in mind that active sessions are not terminated when the Entra PIM activation maximum duration has been reached. Let’s look at an example:

  • AWS access portal session duration: default (8 hours)
  • Session duration defined in the permission set: 1 hour
  • Entra PIM group activation maximum duration: 1 hour

You might be inclined to think that an hour after being added to the group in Entra, the user would no longer have access to AWS resources. This is not necessarily the case. A user could authenticate to the AWS access portal, wait up to 8 hours, and still successfully access AWS through the permission set link. Their session would be active for the duration of the session setting defined in the permission set, which is 1 hour in this case. In this example, we have a potential window of access of 10 hours, as shown in Figure 2 that follows.

Figure 2 – Calculating session duration

Figure 2 – Calculating session duration

With this in mind, configure your test environment with the default setting of 8 hours for the AWS access portal and 1 hour for the permission set session duration value.

  1. Open the IAM Identity Center console.
  2. Under Multi-account permissions, choose Permission sets.
  3. Choose Create permission set.
  4. On the Select permission set type page, under Permission set type, select Custom permission set, and then choose Next.
  5. On the Specify policies and permissions boundary page, expand AWS managed policies.
  6. Search for and select AmazonEC2FullAccess policy, and then choose Next.
  7. On the Specify permission set details page, enter EC2AdminAccess for the Permission set name and choose Next.
  8. On the Review and create page, review the selections, and choose Create.

Step 4: Assign group access in your organization

At this point, you’re ready to assign the Microsoft Entra group to the corresponding permission set in IAM Identity Center. This allows users who are members of the group to be granted the appropriate access level in AWS.

  1. In the navigation pane, under Multi-account permissions, choose AWS accounts.
  2. On the AWS accounts page, select the check box next to one or more AWS accounts to which you want to assign access.
  3. Choose Assign users or groups.
  4. On the Groups tab, select AWS – Amazon EC2 Admin and choose Next
  5. On the Assign permission sets to “<AWS-account-name>” page, select the EC2AdminAccess permission set.
  6. Check that the correct permission set was selected and choose Next.
  7. On the Review and submit page, verify that the correct group and permission set are selected, and choose Submit.

Step 5: Configure Entra PIM

To use this Microsoft Entra group with Entra PIM, you bring the group under the management of PIM by using the Entra admin console to activate the group. You can read more about group management with PIM in the Microsoft documentation. Begin by activating the Entra group that you created.

  1. Sign in to the Microsoft Entra admin center with your credentials.
  2. Select Groups and then All groups
  3. Select the AWS – Amazon EC2 Admin group.
    Figure 3 – Selecting groups for PIM enablement

    Figure 3 – Selecting groups for PIM enablement

  4. Select Privileged Identity Management under the Activity menu list.
  5. Choose Enable PIM for this group.
    Figure 4 – Enable PIM for this group

    Figure 4 – Enable PIM for this group

Now, you will configure the PIM settings for the group. These settings define Member or Owner properties and requirements. It’s here that you can establish MFA requirements, configure notifications, conditional access, approvals, durations, and so on. The Owner role can elevate their permissions using just-in-time access to manage a group, while the Member role is limited to requesting just-in-time membership within the group. In this example, you use the Member properties to demonstrate group membership level temporary elevated access and set a 1-hour duration for the group assignment.

  1. Sign in to the Microsoft Entra admin center with your credentials.
  2. Select Identity Governance, Privileged Identity Management, and then Groups.
  3. Select the AWS – Amazon EC2 Admin group.
    Figure 5 – Selecting groups for PIM configuration

    Figure 5 – Selecting groups for PIM configuration

  4. From the Manage menu select Settings.
  5. Choose Member to view the default role setting details.
    Figure 6 – Settings option for the Member role

    Figure 6 – Settings option for the Member role

  6. Review the default settings. The activation maximum duration should be set to 1 hour and require a justification from the user.
  7. Close the Role setting details – Member blade.
    Figure 7 – Closing the Role setting details – Member blade

    Figure 7 – Closing the Role setting details – Member blade

  8. From the Manage menu select Assignments and choose + Add assignments.
    Figure 8 – Adding eligibility assignments to the PIM enabled groups

    Figure 8 – Adding eligibility assignments to the PIM enabled groups

  9. Select Member from the Select role dropdown menu and choose No member selected. Select the test account, Rich Roe in this example, and then choose Select.
    Figure 9 – Adding the test user as an eligible identity for PIM activation to the group

    Figure 9 – Adding the test user as an eligible identity for PIM activation to the group

  10. Choose Next and leave the default setting of 1 year of eligibility. Duration eligibility defines the period that the user can request activation for the group. Depending on your use case, you will define this as permanent or for a set period. For testing purposes, keep the default setting. Choose Assign.
    Figure 10 – Completing the eligibility assignment

    Figure 10 – Completing the eligibility assignment

Test the configuration

You should now have a test configuration of Entra PIM and IAM Identity Center. Use the test account to verify just-in-time access.

  1. Sign in to the Microsoft Entra admin center using the test account (Rich Roe in this example).
  2. Select Identity Governance, Privileged Identity Management, and then My roles.
    Figure 11 – Browsing to the My Roles section of the Entra admin center

    Figure 11 – Browsing to the My Roles section of the Entra admin center

  3. From the Activate menu list, select Groups. Your eligible group assignments should be listed.
  4. Choose Activate for the AWS – Amazon EC2 Admin group.
    Figure 12 – Activating the just-in-time group membership

    Figure 12 – Activating the just-in-time group membership

  5. In the Activate – Member blade, enter a justification for the access request and choose Activate.
    Figure 13 – Providing a justification for access

    Figure 13 – Providing a justification for access

In this example, there are no approval workflow processes configured for the group, so Entra validates the eligibility requirements and adds the test account to the AWS – Amazon EC2 Admin group. If you want to dive deeper into the approval workflow process, you can read more about it on the Configure PIM for Groups settings page. Because the group is assigned to the enterprise application and configured for provisioning, the updated group membership is automatically synchronized using the SCIM protocol with the connected IAM Identity Center instance. The provisioning time can vary based on the number of PIM enabled users that are activating their memberships within a given 10-second period. In most situations, group memberships are synchronized within 2–10 minutes, but can revert to the standard 40-minute interval if activity runs up against Entra PIM throttling limits. IAM Identity Center responds to SCIM requests as they arrive from Entra ID.

To test access with the newly activated group assignment, use a separate browser or a private window.

  1. Sign in to the My Apps portal with the test user credentials and select the IAM Identity Center app that you created for testing. If you experience an error or don’t see the expected permission set, wait a few minutes until the group membership has synchronized to IAM Identity Center and try again.
    Figure 14 – Accessing IAM Identity Center through the My apps portal

    Figure 14 – Accessing IAM Identity Center through the My apps portal

  2. Expand the associated AWS account and confirm the EC2ReadOnly permission set has been granted.
  3. Close the AWS tab. Wait for the access to be revoked, which has been set to 60 minutes in this example.
    Figure 15 – Just-in-time access to the EC2AdminAccess permission set

    Figure 15 – Just-in-time access to the EC2AdminAccess permission set

  4. Sign back in to the My Apps portal and select the AWS IAM Identity Center app. Notice that the EC2ReadOnly permission set has been revoked.

Conclusion

The combination of AWS IAM Identity Center and Entra PIM provides a robust solution for managing just-in-time elevated access to AWS. By using security groups in Entra and mapping them to permission sets in IAM Identity Center, you can automate the provisioning and deprovisioning of privileged access based on defined policies and approval workflows. This approach helps to make sure the principle of least privilege is enforced, with access granted only for the duration required to complete a task. The detailed auditing capabilities of both services also provide comprehensive visibility into privileged access activities.

For AWS customers seeking a comprehensive, secure, and scalable privileged access management solution, the Entra PIM and IAM Identity Center integration is a common option that’s worth investigating to see if it’s a good fit for your use case.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Rodney Underkoffler

Rodney Underkoffler

Rodney is a Senior Solutions Architect at Amazon Web Services, focused on guiding enterprise customers on their cloud journey. He has a background in infrastructure, security, and IT business practices. He is passionate about technology and enjoys building and exploring new solutions and methodologies.

Aidan Keane

Aidan Keane

Aidan is a Senior Specialist Solutions Architect at Amazon Web Services, focused on Microsoft Workloads. He partners with enterprise customers to optimize their Microsoft environments on AWS and accelerate their cloud journey. Outside of work, he is a sports enthusiast who enjoys golf, biking, and watching Liverpool FC, while also enjoying family time and travelling to Ireland and South America.

Build a centralized observability platform for Apache Spark on Amazon EMR on EKS using external Spark History Server

Post Syndicated from Sri Potluri original https://aws.amazon.com/blogs/big-data/build-a-centralized-observability-platform-for-apache-spark-on-amazon-emr-on-eks-using-external-spark-history-server/

Monitoring and troubleshooting Apache Spark applications become increasingly complex as companies scale their data analytics workloads. As data processing requirements grow, enterprises deploy these applications across multiple Amazon EMR on EKS clusters to handle diverse workloads efficiently. However, this approach creates a challenge in maintaining comprehensive visibility into Spark applications running across these separate clusters. Data engineers and platform teams need a unified view to effectively monitor and optimize their Spark applications.

Although Spark provides powerful built-in monitoring capabilities through Spark History Server (SHS), implementing a scalable and secure observability solution across multiple clusters requires careful architectural considerations. Organizations need a solution that not only consolidates Spark application metrics but extends its features by adding other performance monitoring and troubleshooting packages while providing secure access to these insights and maintaining operational efficiency.

This post demonstrates how to build a centralized observability platform using SHS for Spark applications running on EMR on EKS. We showcase how to enhance SHS with performance monitoring tools, with a pattern applicable to many monitoring solutions such as SparkMeasure and DataFlint. In this post, we use DataFlint as an example to demonstrate how you can integrate additional monitoring features. We explain how to collect Spark events from multiple EMR on EKS clusters into a central Amazon Simple Storage Service (Amazon S3) bucket; deploy SHS on a dedicated Amazon Elastic Kubernetes Service (Amazon EKS) cluster; and configure secure access using AWS Load Balancer Controller, AWS Private Certificate Authority, Amazon Route 53, and AWS Client VPN. This solution provides teams with a single, secure interface to monitor, analyze, and troubleshoot Spark applications across multiple clusters.

Overview of solution

Consider DataCorp Analytics, a data-driven enterprise running multiple business units with diverse Spark workloads. Their Financial Analytics team processes time-sensitive trading data requiring strict processing times and dedicated resources, and their Marketing Analytics team handles customer behavior data with flexible requirements, requiring multiple EMR on EKS clusters to accommodate these distinct workload patterns. As their Spark applications grow in number and complexity across these clusters, data and platform engineers struggle to maintain comprehensive visibility while maintaining secure access to monitoring tools.

This scenario presents an ideal use case for implementing a centralized observability platform using SHS and DataFlint. The solution deploys SHS on a dedicated EKS cluster, configured to read events from multiple EMR on EKS clusters through a centralized S3 bucket. Access is secured through Load Balancer Controller, AWS Private CA, Route 53, and Client VPN, and DataFlint enhances the monitoring capabilities with additional insights and visualizations. The following architecture diagram illustrates the components and their interactions.

Architecture diagram

The solution workflow is as follows:

  1. Spark applications on EMR on EKS use a custom EMR Docker image that includes DataFlint JARs for enhanced metrics collection. These applications generate detailed event logs containing execution metrics, performance data, and DataFlint-specific insights. The logs are written to a centralized Amazon S3 location through the following configuration (note especially the configurationOverrides section). For additional information, explore the StartJobRun guide to learn how to run Spark jobs and review the StartJobRun API reference.
{
  "name": "${SPARK_JOB_NAME}", 
  "virtualClusterId": "${VIRTUAL_CLUSTER_ID}",  
  "executionRoleArn": "${IAM_ROLE_ARN_FOR_JOB_EXECUTION}",
  "releaseLabel": "emr-7.2.0-latest", 
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://${S3_BUCKET_NAME}/app/${SPARK_APP_FILE}",
      "entryPointArguments": [
        "--input-path",
        "s3://${S3_BUCKET_NAME}/data/input",
        "--output-path",
        "s3://${S3_BUCKET_NAME}/data/output"
      ],
       "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.memory=4G --conf spark.kubernetes.driver.limit.cores=1200m --conf spark.executor.cores=2  --conf spark.executor.instances=3  --conf spark.executor.memory=4G"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.container.image": "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${EMR_REPO_NAME}:${EMR_IMAGE_TAG}",
          "spark.app.name": "${SPARK_JOB_NAME}"
          "spark.eventLog.enabled": "true",
          "spark.eventLog.dir": "s3://${S3_BUCKET_NAME}/spark-events/"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED",
      "s3MonitoringConfiguration": {
        "logUri": "s3://${S3_BUCKET_NAME}/spark-events/"
      }
    }
  }
}
  1. A dedicated SHS deployed on Amazon EKS reads these centralized logs. The Amazon S3 location is configured in the SHS to read from the central Amazon S3 location through the following code:
env:
  - name: SPARK_HISTORY_OPTS
    value: "-Dspark.history.fs.logDirectory=s3a://${S3_BUCKET}/spark-events/"
  1. We configure Load Balancer Controller, AWS Private CA, a Route 53 hosted zone, and Client VPN to securely access the SHS UI using a web browser.
  2. Finally, users can access the SHS web interface at https://spark-history-server.example.internal/.

You can find the code base in the AWS Samples GitHub repository.

Prerequisites

Before you deploy this solution, make sure the following prerequisites are in place:

Set up a common infrastructure

Complete the following steps to set up the infrastructure:

  1. Clone the repository to your local machine and set the two environment variables. Replace <AWS_REGION> with the AWS Region where you want to deploy these resources.
git clone [email protected]:aws-samples/sample-centralized-spark-history-server-emr-on-eks.git
cd sample-centralized-spark-history-server-emr-on-eks
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>
  1. Execute the following script to create the common infrastructure. The script creates a secure virtual private cloud (VPC) networking environment with public and private subnets and an encrypted S3 bucket to store Spark application logs.
cd ${REPO_DIR}/infra
./deploy_infra.sh
  1. To verify successful infrastructure deployment, open the AWS CloudFormation console, choose your stack, and check the Events, Resources, and Outputs tabs for completion status, details, and list of resources created.

Set up EMR on EKS clusters

This section covers building a custom EMR on EKS Docker image with DataFlint integration, launching two EMR on EKS clusters (datascience-cluster-v and analytics-cluster-v), and configuring the clusters for job submission. Additionally, we set up the necessary IAM roles for service accounts (IRSA) to enable Spark jobs to write events to the centralized S3 bucket. Complete the following steps:

  1. Deploy two EMR on EKS clusters:
cd ${REPO_DIR}/emr-on-eks
./deploy_emr_on_eks.sh
  1. To verify successful creation of the EMR on EKS clusters using the AWS CLI, execute the following command:
aws emr-containers list-virtual-clusters \
    --query "virtualClusters[?state=='RUNNING']"
  1. Execute the following command for the datascience-cluster-v and analytics-cluster-v clusters to verify their respective states, container provider information, and associated EKS cluster details. Replace <VIRTUAL-CLUSTER-ID> with the ID of each cluster obtained from the list-virtual-clusters output.
aws emr-containers describe-virtual-cluster \
    --id <VIRTUAL-CLUSTER-ID>

Configure and execute Spark jobs on EMR on EKS clusters

Complete the following steps to configure and execute Spark jobs on the EMR on EKS clusters:

  1. Generate custom EMR on EKS image and StartJobRun request JSON files to run Spark jobs:
cd ${REPO_DIR}/jobs
./configure_jobs.sh

The script performs the following tasks:

  • Prepares the environment by uploading the sample Spark application spark_history_demo.py to a designated S3 bucket for job execution.
  • Creates a custom Amazon EMR container image by extending the base EMR 7.2.0 image with the DataFlint JAR for additional insights and publishing it to an Amazon Elastic Container Registry (Amazon ECR) repository.
  • Generates cluster-specific StartJobRun request JSON files for datascience-cluster-v and analytics-cluster-v.

Review start-job-run-request-datascience-cluster-v.json and start-job-run-request-analytics-cluster-v.json for additional details.

  1. Execute the following commands to submit Spark jobs on the EMR on EKS virtual clusters:
aws emr-containers start-job-run \
--cli-input-json file://${REPO_DIR}/jobs/start-job-run/start-job-run-request-datascience-cluster-v.json
aws emr-containers start-job-run \
--cli-input-json file://${REPO_DIR}/jobs/start-job-run/start-job-run-request-analytics-cluster-v.json
  1. Verify the successful generation of the logs in the S3 bucket:

aws s3 ls s3://emr-spark-logs-<AWS_ACCOUNT_ID>-<AWS_REGION>/spark-events/

You have successfully set up an EMR on EKS environment, executed Spark jobs, and collected the logs in the centralized S3 bucket. Next, we will deploy SHS, configure its secure access, and visualize the logs using it.

Set up AWS Private CA and create a Route 53 private hosted zone

Use the following code to deploy AWS Private CA and create a Route 53 private hosted zone. This will provide a user-friendly URL to connect to SHS over HTTPS.

cd ${REPO_DIR}/ssl
./deploy_ssl.sh

Set up SHS on Amazon EKS

Complete the following steps to build a Docker image containing SHS with DataFlint, deploy it on an EKS cluster using a Helm chart, and expose it through a Kubernetes service of type LoadBalancer. We use a Spark 3.5.0 base image, which includes SHS by default. However, although this simplifies deployment, it results in a larger image size. For environments where image size is critical, consider building a custom image with just the standalone SHS component instead of using the complete Spark distribution.

  1. Deploy SHS on the spark-history-server EKS cluster:
cd ${REPO_DIR}/shs
./deploy_shs.sh
  1. Verify the deployment by listing the pods and viewing the pod logs:
kubectl get pods --namespace spark-history
kubectl logs <SHS-PODNAME> --namespace spark-history
  1. Review the logs and confirm there are no errors or exceptions.

You have successfully deployed SHS on the spark-history-server EKS cluster, and configured it to read logs from the emr-spark-logs-<AWS_ACCOUNT_ID>-<AWS_REGION> S3 bucket.

Deploy Client VPN and add entry to Route 53 for secure access

Complete the following steps to deploy Client VPN to securely connect your client machine (such as your laptop) to SHS and configure Route 53 to generate a user-friendly URL:

  1. Deploy the Client VPN:
cd ${REPO_DIR}/vpn
./deploy_vpn.sh
  1. Add entry to Route 53:
cd ${REPO_DIR}/dns
./deploy_dns.sh

Add certificates to local trusted stores

Complete the following steps to add the SSL certificate to your operating system’s trusted certificate stores for secure connections:

  1. For macOS users, using Keychain Access (GUI):
    1. Open Keychain Access from Applications, Utilities, choose the System keychain in the navigation pane, and choose File, Import Items.
    2. Browse to and choose ${REPO_DIR}/ssl/certificates/ca-certificate.pem, then choose the imported certificate.
    3. Expand the Trust section and set When using this certificate to Always Trust.
    4. Close and enter your password when prompted and save.
    5. Alternatively, you can execute the following command to include the certificate in Keychain and trust it:
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain "${REPO_DIR}/ssl/certificates/ca-certificate.pem"
  1. For Windows users:
    1. Rename ca-certificate.pem to ca-certificate.crt.
    2. Choose (right-click) ca-certificate.crt and choose Install Certificate.
    3. Choose Local Machine (admin rights required).
    4. Select Place all certificates in the following store.
    5. Choose Browse and choose Trusted Root Certification Authorities.
    6. Complete the installation by choosing Next and Finish.

Set up Client VPN on your client machine for secure access

Complete the following steps to install and configure Client VPN on your client machine (such as your laptop) and create a VPN connection to the AWS Cloud:

  1. Download, install, and launch the Client VPN application from the official download page for your operating system.
  2. Create your VPN profile:
    1. Choose File in the menu bar, choose Manage Profiles, and choose Add Profile.
    2. Enter a name for your profile. Example: SparkHistoryServerUI
    3. Browse to ${REPO_DIR}/vpn/client_vpn_certs/client-config.ovpn, choose the certificate file, and choose Add Profile to save your configuration.
  3. Select your newly created profile, choose Connect, and wait for the connection confirmation to establish the VPN connection.

When you’re connected, you will have secure access to the AWS resources in your environment.

VPN connection details

Securely access the SHS URL

Complete the following steps to securely access SHS using a web browser:

  1. Execute the following command to get the SHS URL:

https://spark-history-server.example.internal/

  1. Copy this URL and enter it into your web browser to access the SHS UI.

The following screenshot shows an example of the UI.

Spark History Server job summary page

  1. Choose an App ID to view its detailed execution information and metrics.

Spark History Server job detail page

  1. Choose the DataFlint tab to view detailed application insights and analytics.

DataFlint insights page

DataFlint displays various helpful metrics, including alerts, as shown in the following screenshot.

DataFlint alerts page

Clean up

To avoid incurring future charges from the resources created in this tutorial, clean up your environment after completing the steps. To remove all provisioned resources:

  1. Disconnect from the Client VPN.
  2. Run the cleanup.sh script:
cd ${REPO_DIR}/
./cleanup.sh

Conclusion

In this post, we demonstrated how to build a centralized observability platform for Spark applications using SHS and enhance SHS with performance monitoring tools like DataFlint. The solution aggregates Spark events from multiple EMR on EKS clusters into a unified monitoring interface, providing comprehensive visibility into your Spark applications’ performance and resource utilization. By using a custom EMR image with performance monitoring tool integration, we enhanced the standard Spark metrics to gain deeper insights into application behavior. If your environment uses a mix of EMR on EKS, Amazon EMR on EC2, or Amazon EMR Serverless, you can seamlessly extend this architecture to aggregate the logs from EMR on EC2 and EMR Serverless in a similar way and visualize them using SHS.

Although this solution provides a robust foundation for Spark monitoring, production deployments should consider implementing authentication and authorization. SHS supports custom authentication through javax servlet filters and fine-grained authorization through access control lists (ACLs). We encourage you to explore implementing authentication filters for secure access control, configuring user- and group-based ACLs for view and modify permissions, and setting up group mapping providers for role-based access. For detailed guidance, refer to Spark’s web UI security documentation and SHS security features.

While AWS endeavors to apply best practices for security within this example, each organization has its own policies. Please make sure to use the specific policies of your organization when deploying this solution as a starting point for implementing centralized Spark monitoring in your data processing environment.


About the authors

Sri Potluri is a Cloud Infrastructure Architect at AWS. He is passionate about solving complex problems and delivering well-structured solutions for diverse customers. His expertise spans across a range of cloud technologies, providing scalable and reliable infrastructures tailored to each project’s unique challenges.

Suvojit Dasgupta is a Principal Data Architect at AWS. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.

Architecture patterns to optimize Amazon Redshift performance at scale

Post Syndicated from Eddie Yao original https://aws.amazon.com/blogs/big-data/architecture-patterns-to-optimize-amazon-redshift-performance-at-scale/

Tens of thousands of customers use Amazon Redshift as a fully managed, petabyte-scale data warehouse service in the cloud. As an organization’s business data grows in volume, the data analytics need also grows. Amazon Redshift performance needs to be optimized at scale to achieve faster, near real-time business intelligence (BI). You might also consider optimizing Amazon Redshift performance when your data analytics workloads or user base increases, or to meet a data analytics performance service level agreement (SLA). You can also look for ways to optimize Amazon Redshift data warehouse performance after you complete an online analytical processing (OLAP) migration from another system to Amazon Redshift.

In this post, we will show you five Amazon Redshift architecture patterns that you can consider to optimize your Amazon Redshift data warehouse performance at scale using features such as Amazon Redshift Serverless, Amazon Redshift data sharing, Amazon Redshift Spectrum, zero-ETL integrations, and Amazon Redshift streaming ingestion.

Use Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity

To start, let’s review using Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity. The architecture is shown in the following diagram and includes different components within Amazon Redshift Serverless like ML-based workload monitoring and automatic workload management.

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless is a deployment model that you can use to run and scale your Redshift data warehouse without managing infrastructure. Amazon Redshift Serverless will automatically provision and scale your data warehouse capacity to deliver fast performance for even the most demanding, unpredictable, or massive workloads.

Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis. You can optionally configure your Base, Max RPU-Hours, and MaxRPU parameters to modify your warehouse performance costs. This post dives deep into understanding cost mechanisms to consider when managing Amazon Redshift Serverless.

Amazon Redshift Serverless scaling is automatic and based on your RPU capacity. To further optimize scaling operations for large scale datasets, Amazon Redshift Serverless has AI-driven scaling and optimization. It uses AI to scale automatically with workload changes across key metrics such as data volume changes, concurrent users, and query complexity, accurately meeting your price performance targets.

There is no maintenance window in Amazon Redshift Serverless, because software version updates are applied automatically. This maintenance occurs with no interruptions for any existing connections or query executions. Make sure to consult the considerations guide to better understand the operation of Amazon Redshift Serverless.

You can migrate from an existing provisioned Amazon Redshift data warehouse to Amazon Redshift Serverless by creating a snapshot of your current provisioned data warehouse and then restoring that snapshot in Amazon Redshift Serverless. Amazon Redshift will automatically convert interleaved keys to compound keys when you restore a provisioned data warehouse snapshot to a Serverless namespace. You can also get started with a new Amazon Redshift Serverless data warehouse.

Amazon Redshift Serverless use cases

You can use Amazon Redshift Serverless for:

  • Self-service analytics
  • Auto scaling for unpredictable or variable workloads
  • New applications
  • Multi-tenant applications

With Amazon Redshift, you can access and query data stored in Amazon S3 Tables – fully managed Apache Iceberg tables optimized for analytics workloads. Amazon Redshift also supports querying data stored using Apache Iceberg tables, and other open table formats like Apache Hudi and Linux Foundation Delta Lake, for more information see External tables for Redshift Spectrum and Expand data access through Apache Iceberg using Delta Lake UniForm on AWS.

You can also use Amazon Redshift Serverless with Amazon Redshift data sharing, which can automatically scale your large dataset in independent datashares and maintain workload isolation controls.

Amazon Redshift data sharing to share live data between separate Amazon Redshift data warehouses

Next, we will look at an Amazon Redshift data sharing architecture pattern, shown in below diagram, to share data between a hub Amazon Redshift data warehouse and spoke Amazon Redshift data warehouses , and to share data across multiple Amazon Redshift data warehouses with each other.

Amazon Redshift data sharing architecture patterns diagram

Amazon Redshift data sharing architecture patterns diagram

With Amazon Redshift data sharing, you can securely share access to live data between separate Amazon Redshift data warehouses without manually moving or copying the data. Because the data is live, all users can see the most up-to-date and consistent information in Amazon Redshift as soon as it’s updated using separate dedicated resources. Because the compute accessing the data is isolated, you can size the data warehouse configurations to individual workload price performance requirements rather than the aggregate of all workloads. This also provides additional flexibility to scale with new workloads without affecting the workloads already being run on Amazon Redshift.

A datashare is the unit of sharing data in Amazon Redshift. A producer data warehouse administrator can create datashares and add datashare objects to share data with other data warehouses, referred to as outbound shares. A consumer data warehouse administrator can receive datashares from other data warehouses, referred to as inbound shares.

To get started, a producer data warehouse needs to add all objects (and potential permissions) that need to be accessed by another data warehouse to a datashare, and share that datashare with a consumer. After that consumer creates a database from the datashare, the shared objects can be accessed using three-part notation consumer_database_name.schema_name.table_name on the consumer, using the consumer’s compute.

Amazon Redshift data sharing use cases

Amazon Redshift data sharing, including multi-warehouse writes in Amazon Redshift, can be used to:

  • Support different kinds of business-critical workloads, including workload isolation and chargeback for individual workloads.
  • Enable cross-group collaboration across teams for broader analytics, data science, and cross-product impact analysis.
  • Deliver data as a service.
  • Share data between environments to improve team agility by sharing data at different granularity levels such as development, test, and production.
  • License access to data in Amazon Redshift by listing Amazon Redshift data sets in the AWS Data Exchange catalog so that customers can find, subscribe to, and query the data in minutes.
  • Update business source data on the producer. You can share data as a service across your organization, but then consumers can also perform actions on the source data.
  • Insert additional records on the producer. Consumers can add records to the original source data.

The following articles provide examples of how you can use Amazon Redshift data sharing to scale performance:

Amazon Redshift Spectrum to query data in Amazon S3

You can use Amazon Redshift Spectrum to query data in , as shown in below diagram using AWS Glue Data Catalog.

Amazon Redshift Spectrum architecture diagram

Amazon Redshift Spectrum architecture diagram

You can use Amazon Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to directly load data into Amazon Redshift tables. Using the large, parallel scale of the Amazon Redshift Spectrum layer, you can run massive, fast, parallel queries against large datasets while most of the data remains in Amazon S3. This can significantly improve the performance and cost-effectiveness of massive analytics workloads, because you can use the scalable storage of Amazon S3 to handle large volumes of data while still benefiting from the powerful query processing capabilities of Amazon Redshift.

Amazon Redshift Spectrum uses separate infrastructure independent of your Amazon Redshift data warehouse, offloading many compute-intensive tasks, such as predicate filtering and aggregation. This means that you can use significantly less data warehouse processing capacity than other queries. Amazon Redshift Spectrum can also automatically scale to potentially thousands of instances, based on the demands of your queries.

When implementing Amazon Redshift Spectrum, make sure to consult the considerations guide which details how to configure your networking, external table creation, and permissions requirements.

Review this best practices guide and this blog post, which outlines recommendations on how to optimize performance including the impact of different file types, how to design around the scaling behavior, and how you can efficiently partition files. You can check out an example architecture in Accelerate self-service analytics with Amazon Redshift Query Editor V2.

To get started with Amazon Redshift Spectrum, you define the structure for your files and register them as an external table in an external data catalog (AWS Glue, Amazon Athena, and Apache Hive metastore are supported). After creating your external table, you can query your data in Amazon S3 directly from Amazon Redshift.

Amazon Redshift Spectrum use cases

You can use Amazon Redshift Spectrum in the following use cases:

  • Huge volume but less frequently accessed data, build lake house architecture to query exabytes of data in an S3 data lake
  • Heavy scan- and aggregation-intensive queries
  • Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Zero-ETL to unify all data and achieve near real-time analytics

You can use Zero-ETL integration with Amazon Redshift to integrate with your transactional databases like Amazon Aurora MySQL-Compatible Edition, so you can run near real-time analytics in Amazon Redshift, or BI in Amazon QuickSight, or machine learning workload in Amazon SageMaker AI, shown in below diagram.

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift removes the undifferentiated heavy lifting to build and manage complex extract, transform, and load (ETL) data pipelines; unifies data across databases, data lakes, and data warehouses; and makes data available in Amazon Redshift in near real time for analytics, artificial intelligence (AI) and machine learning (ML) workloads.

Currently Amazon Redshift supports the following zero-ETL integrations:

To create a zero-ETL integration, you specify an integration source, such as an Amazon Aurora DB cluster, and an Amazon Redshift data warehouse, such as Amazon Redshift Serverless workgroup or a provisioned data warehouse (including Multi-AZ deployment on RA3 clusters to automatically recover from any infrastructure or Availability Zone failures and help ensure that your workloads remain uninterrupted), as the target. The integration replicates data from the source to the target and makes data available in the target data warehouse within seconds. The integration also monitors the health of the integration pipeline and recovers from issues when possible.

Make sure to review considerations, limitations, and quotas on both the data source and target when using zero-ETL integrations with Amazon Redshift.

Zero-ETL integration use cases

You can use zero-ETL integration with Amazon Redshift as an architecture pattern to boost analytical query performance at scale, enable a straightforward and secure way to create near real-time analytics on petabytes of transactional data, with continuous change-data-capture (CDC). Plus, you can use other Amazon Redshift capabilities such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes. You can see more other zero-ETL integrations use cases at What is ETL.

Ingest streaming data into Amazon Redshift data warehouse for near real-time analytics

You can ingest streaming data with Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) to Amazon Redshift and run near real-time analytics in Amazon Redshift, as shown in the following diagram.

Amazon Redshift data streaming architecture diagram

Amazon Redshift data streaming architecture diagram

Amazon Redshift streaming ingestion provides low-latency, high-speed data ingestion directly from Amazon Kinesis Data Streams or Amazon MSK to an Amazon Redshift provisioned or Amazon Redshift Serverless data warehouse, without staging data in Amazon S3. You can connect to and access the data from the stream using standard SQL and simplify data pipelines by creating materialized views in Amazon Redshift on top of the data stream. For best practices, you can review these blog posts:

To get started on Amazon Redshift streaming ingestion, you create an external schema that maps to the streaming data source and create a materialized view that references the external schema. For details on how to set up Amazon Redshift streaming ingestion for Amazon KDS, see Getting started with streaming ingestion from Amazon Kinesis Data Streams. For details on how to set up Amazon Redshift streaming ingestion for Amazon MSK, see Getting started with streaming ingestion from Apache Kafka sources.

Amazon Redshift streaming ingestion use cases

You can use Amazon Redshift streaming ingestion to:

  • Improve gaming experience by analyzing real-time data from gamers
  • Analyze real-time IoT data and use machine learning (ML) within Amazon Redshift to improve operations, predict customer churn, and grow your business
  • Analyze clickstream user data
  • Conduct real-time troubleshooting by analyzing streaming data from log files
  • Perform near real-time retail analytics on streaming point of sale (POS) data

Other Amazon Redshift features to optimize performance

There are other Amazon Redshift features that you can use to optimize performance.

  • You can resize Amazon Redshift provisioned clusters to optimize data warehouse compute and storage use.
  • You can use concurrency scaling, where Amazon Redshift provisioning automatically adds additional capacity to process increases in read, such as dashboard queries; and write operations, such as data ingestion and processing.
  • You can also consider materialized views in Amazon Redshift, applicable to both provisioned and serverless data warehouses, which contains a precomputed result set, based on an SQL query over one or more base tables. They are especially useful for speeding up queries that are predictable and repeated.
  • You can use auto-copy for Amazon Redshift to set up continuous file ingestion from your Amazon S3 prefix and automatically load new files to tables in your Amazon Redshift data warehouse without the need for additional tools or custom solutions.

Cloud security at AWS is the highest priority. Amazon Redshift offers broad security-related configurations and controls to help ensure information is appropriately protected. See Amazon Redshift Security Best Practices for a comprehensive guide to Amazon Redshift security best practices.

Conclusion

In this post, we reviewed Amazon Redshift architecture patterns and features that you can use to help scale your data warehouse to dynamically accommodate different workload combinations, volumes, and data sources to achieve optimal price performance. You can use them alone or together—choosing the best infrastructural set up for your use case requirements—and scale to accommodate for any future growth.

Get started with these Amazon Redshift architecture patterns and features today by following the instructions provided in each section. If you have questions or suggestions, leave a comment below.


About the authors

Eddie Yao is a Principal Technical Account Manager (TAM) at AWS. He helps enterprise customers build scalable, high-performance cloud applications and optimize cloud operations. With over a decade of experience in web application engineering, digital solutions, and cloud architecture, Eddie currently focuses on Media & Entertainment (M&E) and Sports industries and AI/ML and generative AI.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Scott St. Martin is a Solutions Architect at AWS who is passionate about helping customers build modern applications. Scott uses his decade of experience in the cloud to guide organizations in adopting best practices around operational excellence and reliability, with a focus the manufacturing and financial services spaces. Outside of work, Scott enjoys traveling, spending time with family, and playing piano.

Analyze media content using AWS AI services

Post Syndicated from Jack Bradham original https://aws.amazon.com/blogs/architecture/analyze-media-content-using-aws-ai-services/

Organizations managing large audio and video archives face significant challenges in extracting value from their media content. Consider a radio network with thousands of broadcast hours across multiple stations and the challenges they face to efficiently verify ad placements, identify interview segments, and analyze programming patterns.

In this post, we demonstrate how you can automatically transform unstructured media files into searchable, analyzable content. By combining Amazon Transcribe, Amazon Bedrock, Amazon QuickSight, and Amazon Q, organizations can achieve the following:

  • Process and transcribe media files upon upload
  • Identify commercials, interviews, and program segments
  • Extract insights using foundation models (FMs)
  • Create a searchable knowledge base
  • Generate rich visualizations for decision-making
  • Enable natural language queries across their media archive
  • Visualize complex information with intuitive graphics

In the following sections, we explore how these AWS services work together to help organizations unlock the full potential of their media content, whether for advertising compliance, content analysis, or discovering specific segments within thousands of hours of recordings.

Solution overview

This solution provides an event-driven media analysis pipeline that transforms how you manage and extract value from your content:

  • Streamline content management – Automatically process media files the moment they’re uploaded, saving time and reducing manual work
  • Unlock deeper insights – Generate accurate transcriptions that capture not just words, but the full context of your content—including speakers, timing, and key moments
  • Harness AI – Automatically extract meaningful insights and uncover hidden patterns in your media without extensive manual review
  • Build a searchable knowledge base – Turn scattered media files into a discoverable catalog that your entire team can use
  • Build a customizable interface – Create a customizable UI to search the catalog
  • Create powerful visualizations – Bring your insights to life with intuitive visualizations that make complex information immediately understandable

The following diagram illustrates our architecture.

Media Analysis Architecture

This event-driven architecture automatically processes and analyzes multimedia content using AWS services. The workflow consists of the following steps:

  1. A user uploads media files to an Amazon Simple Storage Service (Amazon S3) bucket. A “New Media” event triggers the first AWS Step Functions workflow. This workflow handles the initial cataloging based on values in the file name and launches the transcription process.
  2. Amazon Transcribe converts the audio into accurate, readable text. The transcribed content is securely saved to an S3 bucket for further analysis. A “Transcription Complete” event triggers the next step.
  3. A second Step Functions workflow processes the transcription. Using predefined prompts, Amazon Bedrock analyzes the transcripts to extract meaningful information. Key insights extracted from the transcript are stored in an S3 data lake.
  4. The processed results are organized systematically, structured by date (year/month/day) and tagged with relevant attributes. This organized data enables natural language queries through Amazon Q when used as a knowledge base, interactive visualizations using QuickSight, and straightforward content discovery and analysis.
  5. Amazon Athena serves as the data exploration tool to query the data lake. Athena is used as the data source in QuickSight, which turns complex data into clear, compelling visuals.

This architecture automatically transforms raw media content into searchable, analyzable data while maintaining an organized hierarchy for efficient access and analysis. The event-driven design provides automatic processing of new content as it arrives, and the combination of AWS AI services enables deep content understanding and insight extraction. Each AWS service plays a crucial role in transforming your media content:

  • Amazon Bedrock – Reviews content after transcription for insights and entity extraction:
    • Uses advanced FMs to analyze transcripts
    • Identifies commercials, interviews, and program segments
    • Extracts meaningful insights from content
  • Amazon EventBridge – Triggers actions in the cataloging workflow:
    • Monitors for new media files and completed transcriptions
    • Automatically triggers Step Functions workflows
  • AWS Lambda – Handles custom code actions needed in the workflow:
    • Facilitates interaction with Amazon Bedrock
    • Executes custom prompts on transcripts
    • Enables flexible, scalable processing
  • Amazon Q – Serves as the frontend and Retrieval Augmented Generation (RAG) engine:
    • Addresses enterprise generative AI needs by providing a turnkey solution with built-in security features like single sign-on (SSO) integration and responsible AI governance policies
    • Allows businesses to quickly deploy AI assistance while maintaining compliance, data privacy, and security standards
    • Enables natural language queries across the media archive
    • Links results to the source media files
    • Provides conversational access to content
  • Amazon QuickSight – Turns insights in beautiful visualization for better consumption:
    • Creates interactive dashboards and visualizations
    • Displays comprehensive media analytics
    • Helps track advertising, programming, and content patterns
  • Amazon S3 – Stores assets and the catalog:
    • Securely stores raw media files, transcripts, and processed data
    • Automatically triggers processing when new content is uploaded
  • AWS Step Functions – Orchestrates the entire content processing workflow:
    • Manages transcription and AI analysis steps
    • Provides robust error handling and automatic retries
  • Amazon Transcribe – Converts speech to accurate, readable text:
    • Identifies speakers and timestamps
    • Provides accurate transcriptions of audio content

Security considerations

Although this post focuses on the technical implementation of media content analysis, it’s important to acknowledge that production deployments should include comprehensive security measures:

  • Data storage security (Amazon S3):
    • Enable server-side encryption using AWS Key Management Service (AWS KMS) keys
    • Apply bucket policies restricting access to authorized principals only
    • Enable Amazon S3 Block Public Access at account and bucket levels
    • Enable versioning for data recovery
    • Implement lifecycle policies for data retention
    • Enable S3 access logging
    • Use presigned URLs for temporary access
  • Identity and Access Management (IAM):
    • Create dedicated service roles with minimum required permissions for:
      • Step Functions execution
      • Amazon Transcribe jobs
      • Amazon Bedrock API calls
      • Athena queries
    • Implement role-based access control
    • Regularly rotate credentials
    • Enable multi-factor authentication (MFA) for all users
    • Use AWS Organizations for multi-account management
  • Network security:
    • Deploy virtual private cloud (VPC) endpoints for:
      • Amazon S3
      • Athena
      • QuickSight
    • Implement network access control lists (ACLs) and security groups
    • Enable VPC Flow Logs
    • Use AWS PrivateLink where applicable
    • Configure route tables to control traffic flow
  • Data encryption:
    • Implement AWS KMS encryption for S3 objects
    • Use TLS 1.2+ for all API communications
    • Enable automatic key rotation in AWS KMS
    • Implement envelope encryption for sensitive data
  • Monitoring and detection:
    • Enable AWS CloudTrail for API activity logging
    • Configure Amazon GuardDuty for threat detection
    • Set up Amazon CloudWatch:
      • Metrics for service health
      • Alarms for security events
      • Log groups for application logs
    • Enable S3 server access logging
    • Configure VPC Flow Logs
  • Access controls:
    • Implement fine-grained access controls for:
      • Amazon Bedrock model access
      • Athena query permissions
      • QuickSight dashboard sharing
    • Conduct regular access reviews

Additionally, compliance requirements and data governance policies might impact how you implement this solution in your environment.

These security considerations are crucial but beyond the scope of this post. We recommend consulting AWS security best practices and working with your security team to implement appropriate measures for your specific use case. For more information on AWS security best practices, refer to Best Practices for Security, Identity, & Compliance.

The following sections walk you through setting up each component of the architecture to help you transform raw media into actionable insights.

Prerequisites

The following are the prerequisites to follow along this post:

Create S3 buckets

For this solution, we create three distinct buckets to support the media analytics workflow:

  • Raw media bucket for incoming files
  • Transcription outputs bucket
  • Processed insights bucket

For instructions on creating buckets, refer to Creating a general purpose bucket.

Configure EventBridge

You can enable event notifications on the raw media bucket to trigger your automated workflow through EventBridge. Establish your automation backbone by monitoring S3 bucket activities. When new media arrives or transcription completes, EventBridge will trigger the appropriate workflow, providing continuous processing. For further instructions, refer to Creating rules that react to events in Amazon EventBridge.

The following are two example triggers that can be used to filter events and trigger Step Functions workflows. The following is an example filter for new media files:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["rawinputbucket"]
    }
  }
}

The following is an example filter for new transcripts added to the data lake:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["business-data-lake"]
    },
    "object": {
      "key": [{
        "suffix": ".transcription"
      }]
    }
  }
}

Create Step Functions workflows

We design the orchestration layer with two key workflows. The first handles media intake and transcription, and the second manages AI analysis. Each workflow includes safeguards for potential failures and retry mechanisms. For further instructions, refer to Learn how to get started with Step Functions.

The following diagram shows an example of processing new media uploads for indexing and transcription.

Media Analysis - Step Function Example

The following diagram shows an example of the Step Functions workflow that analyzes the transcription.

Set up Amazon Transcribe

To create an Amazon Transcribe job, you need permissions to do so. You can implement a speech-to-text conversion with powerful features like language detection, speaker identification, and custom vocabulary support to provide accurate transcription of your media content. For further instructions, refer to How Amazon Transcribe works.

Configure Amazon Bedrock

You can power your AI analysis engine by setting up precise prompts that extract meaningful insights. Amazon Bedrock processes transcripts to identify key segments, speakers, and topics, transforming raw text into structured data. For instructions, refer to Design a prompt.

The following is a sample prompt:

You will be reviewing a radio transcription to identify advertisements and extract relevant details. Your task is to analyze the provided transcript and output the results in a specific JSON format based on a given schema.
Please follow these steps to complete the task:
1. Carefully read through the entire transcript.
2. Identify all advertisements within the transcript. Look for clear indicators such as product mentions, promotional language, or transitions from regular content to commercial content.
3. For each advertisement you identify, determine the following information:
    - Company: The name of the company being advertised
    - Start time: The timestamp in the transcript where the ad begins
    - End time: The timestamp in the transcript where the ad ends
    - Product: The specific product or service being advertised
4. Format your findings into a JSON object that follows the provided schema. Each advertisement should be a separate object within an array.
5. Ensure these fields in your response are provided for each advertisement.. All are required fields: company, starttime, endtime, product. 
6. Use precise timestamps for start and end times. If exact times are not available, make a best estimate based on the transcript's context.
7. If a particular field is unclear or not explicitly mentioned in the transcript, you may use "Unknown" as the value.
8. Only respond with json and nothing else. Do not provide comments or explain your answer. 
9. Surround the JSON response with standard ```json markers
Here's an example of how your output should be formatted:
{
"advertisements": [
        {
            "company": "TechGadgets Inc.",
            "starttime": "00:05:30",
            "endtime": "00:06:15",
            "product": "SmartHome Hub"
        },
        {
            "company": "FreshFoods Market",
            "starttime": "00:15:45",
            "endtime": "00:16:30",
            "product": "Organic Produce Delivery Service"
        }
    ]
}
Do not add any fields that are not specified in the schema, and ensure all required fields are present for each advertisement.

Create a structured data lake

We create a hierarchical data organization strategy that enables efficient access and analysis. You can use AWS Glue crawlers to automatically discover and catalog your media metadata. For instructions, refer to Using crawlers to populate the Data Catalog. Configure Athena tables to enable SQL-based querying of your media insights:

CREATE OR REPLACE VIEW "commercials_view" AS 
SELECT
  metadata.market market
, metadata.station_call station_call
, metadata.format_type format_type
, CAST(metadata.timestamp AS timestamp) timestamp
, ads.company adCompany
, ads.product adProduct
, ads.starttime
, ads.endtime
FROM
  (commercials
CROSS JOIN UNNEST(advertisements) t (ads))

Set Up Amazon Q

You can enable natural language interaction with your media archive using Amazon Q Business. Configure the knowledge base and metadata to make your content searchable and accessible through conversational queries. Use the processed insights S3 buckets to configure the knowledge base. For instructions, refer to Getting started with Amazon Q Business.

The following screenshot shows example conversations with an AI assistant.

Build QuickSight dashboards

With QuickSight, you can create visual analytics that bring your data to life. Connect to Athena views to display advertising patterns, content analysis, and performance metrics in interactive dashboards. For more information, refer to Tutorial: Create an Amazon QuickSight dashboard.

The following screenshots are a few examples of dashboards created for a fictitious radio station as part of our use case.

Validate and optimize your media analytics solution

After you implement your media analytics architecture, follow these critical steps to achieve robust performance and alignment with your organization’s needs. First, configure a comprehensive testing approach. Imagine you’re preparing to launch your media analysis solution. Your testing journey begins with accuracy validation:

  • Compare transcription outputs against original media
  • Verify AI-generated insights for precision
  • Use representative sample sets from your content library

You start by taking a recently processed radio show and comparing its transcription against the original broadcast. Your team meticulously reviews the AI-generated insights, checking if key moments like ad transitions or interview segments are correctly identified. To make sure your system works across all content types, you select diverse samples from your library—perhaps a morning talk show, an evening news segment, and a weekend sports broadcast. Next, you delve into performance benchmarking:

  • Measure processing time for different media types
  • Evaluate resource utilization across AWS services
  • Identify potential bottlenecks in the workflow

Time how long it takes to process different types of media files, from short commercial segments to lengthy program recordings. As you watch how your AWS services respond under various loads, you can monitor resource consumption patterns. This helps you identify processing bottlenecks—for instance, you might discover that certain file types take longer to transcribe or that concurrent processing needs optimization. Finally, you put yourself in your users’ shoes for a thorough user experience assessment:

  • Test natural language queries with Amazon Q
  • Validate search result relevance
  • Gather feedback from potential end-users

Team members can interact with Amazon Q, asking questions they would naturally pose when searching for specific content. For example, you can test whether searching for “interviews about climate change last week” returns relevant results. Gathering feedback from potential users—perhaps a content manager with different needs than a compliance officer—provides invaluable insights. Their real-world experiences guide your refinements and make sure the system serves its intended purpose. This comprehensive testing approach, combining structured evaluation with real-world scenarios, sets the stage for a robust and user-friendly media analysis solution. As your media analysis solution moves from initial deployment to production, optimizing its performance becomes crucial for both cost-efficiency and user satisfaction. A radio network processing thousands of hours of content weekly might find that even small improvements in transcription accuracy or processing speed can lead to significant cost savings and better content discoverability. Similarly, a marketing team analyzing ad placements across multiple stations needs precise insights to make data-driven decisions about advertising effectiveness. With these business imperatives in mind, consider the following configuration optimization strategies:

  • Transcription refinement:
    • Adjust language models for domain-specific terminology
    • Fine-tune speaker identification settings
    • Implement custom vocabularies for improved accuracy
  • AI insight generation:
    • Refine prompts for more targeted analysis
    • Experiment with different AI models
    • Align extraction parameters with business objectives
  • Scalability considerations:
    • Test workflow performance with increasing media volumes
    • Implement appropriate auto scaling configurations
    • Monitor cost-effectiveness of your architecture
  • Continuous improvement:
    • Establish regular review cycles
    • Track key performance metrics
    • Iterate on your solution based on real-world usage

We recommend starting with a pilot implementation and gradually expanding your media analytics capabilities.

Clean up

To avoid incurring ongoing charges, clean up the resources you created as part of this solution:

  1. Delete QuickSight resources:
    1. Delete dashboards created for media analytics.
    2. Delete the datasets connected to Athena.
    3. If no longer needed, delete the QuickSight Enterprise subscription.
  2. Delete S3 buckets:
    1. Empty and delete the raw media bucket, transcription outputs bucket, and processed insights bucket.
  3. Remove EventBridge rules:
    1. Delete the rules created for monitoring S3 bucket activities.
    2. Remove targets associated with these rules.
  4. Delete Step Functions workflows:
    1. Delete the media intake and transcription workflow.
    2. Delete the AI analysis workflow.
  5. Remove Lambda functions:
    1. Delete Lambda functions created for interaction with Amazon Bedrock.
    2. Remove associated IAM roles and policies.
  6. Clean up data lake components:
    1. Delete Athena views and tables.
    2. Remove AWS Glue crawlers and databases.
    3. Delete stored query results.
  7. Remove Amazon Q configurations:
    1. Delete knowledge bases created.
    2. Remove custom configurations.
  8. Remove Amazon Bedrock settings:
    1. Remove custom prompts.
    2. Disable access to FMs if no longer needed.
  9. Delete Amazon Transcribe settings:
    1. Remove custom vocabularies.
    2. Delete stored transcription jobs.
  10. Remove IAM resources:
    1. Delete custom IAM roles created for this solution.
    2. Remove associated IAM policies.
  11. Complete additional cleanup:
    1. Delete CloudWatch Logs groups associated with Lambda functions.
    2. Remove CloudWatch alarms or metrics created for monitoring.
    3. Delete saved queries in Athena.

Common use cases

Organizations in different sectors can use this architecture to unlock value from their audio and video content. You can adapt this solution to meet your specific needs, such as managing broadcast media, corporate communications, educational materials, and more. Let’s explore how different industries might apply this technology:

  • Media and broadcasting:
    • Track advertising compliance
    • Verify media placement accuracy
    • Analyze broadcast content at scale
  • Corporate and enterprise:
    • Convert meeting recordings into searchable knowledge bases
    • Identify key decisions and action items
    • Enhance organizational knowledge management
  • Education and training:
    • Create comprehensive, topic-based course catalogs
    • Index training materials for quick retrieval
    • Support continuous learning initiatives
  • Legal services:
    • Generate precise, timestamped transcripts
    • Develop searchable legal proceeding archives
    • Improve document review efficiency
  • Healthcare:
    • Extract critical medical insights from consultations
    • Categorize patient interaction data
    • Support clinical documentation processes
  • Government and public sector:
    • Build comprehensive public meeting archives
    • Implement automated topic categorization
    • Enhance transparency and accessibility
  • Customer service:
    • Analyze call recordings for quality improvement
    • Identify service trends and customer pain points
    • Drive continuous customer experience enhancement

This media analytics architecture demonstrates notable versatility. By using AI, organizations can transform raw audio and video content into structured, meaningful insights that drive decision-making across industries.

Conclusion

In this post, we demonstrated how to use AWS services to convert unstructured media content into actionable intelligence. By combining Amazon Transcribe, Amazon Bedrock, QuickSight, and Amazon Q, you can create a scalable, automated solution for media analysis that adapts to your organizational needs.

This solution offers the following key architectural advantages:

  • Automated media file processing at scale
  • AI-powered insight generation
  • Natural language search capabilities
  • Interactive decision-making visualizations
  • Flexible, maintainable infrastructure

Organizations can now convert content into searchable knowledge, extract insights automatically, develop data-driven content strategies, and enhance operational efficiency through automation.

As audio and video content generation continues to accelerate, the ability to efficiently process and extract value becomes increasingly critical. This architecture provides a robust foundation for current needs while remaining adaptable to future technological innovations.

We invite you to explore how this media analytics solution can address your organization’s unique challenges. Consider your specific use cases and unlock the insights waiting to be discovered in your media archives.


About the authors

Enhancing data durability in Amazon EMR HBase on Amazon S3 with the Amazon EMR WAL feature

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/enhancing-data-durability-in-amazon-emr-hbase-on-amazon-s3-with-the-amazon-emr-wal-feature/

Apache HBase, an open source NoSQL database, enables quick access to massive datasets. Amazon EMR, from version 5.2.0, lets you use HBase on Amazon Simple Storage Service (Amazon S3). This combines HBase’s speed with the durability advantages of Amazon S3. Also, it helps achieve the data lake architecture benefits such as the ability to scale storage and compute requirements separately. We see our customers choosing Amazon S3 over Hadoop Distributed File Systems (HDFS) when they want to achieve greater durability, availability, and simplified storage management. Amazon EMR continually improves HBase on Amazon S3, focusing on performance, availability, and reliability.

Despite these durability benefits of HBase on Amazon S3 architecture, a critical concern remains regarding data recovery when the Write-Ahead Log (WAL) is lost. Within the EMR framework, HBase data attains durability when it’s flushed, or written, to Amazon S3. This flushing process is triggered by reaching specific size and time thresholds or through manual initiation. Until data is successfully flushed to S3, it persists within the WAL, which is stored in HDFS. In this post, we dive deep into the new Amazon EMR WAL feature to help you understand how it works, how it enhances durability, and why it’s needed. We explore several scenarios that are well-suited for this feature.

HBase WAL overview

Each RegionServer in HBase is responsible for managing data from multiple tables. These tables are horizontally partitioned into regions, where each region represents a contiguous range of row keys. A RegionServer can host multiple such regions, potentially from different tables. At the RegionServer level, there is a single, shared WAL that records all write operations across all regions and tables in a sequential, append-only manner. This shared WAL makes sure durability is maintained by persisting each mutation before applying it to in-memory structures, enabling recovery in case of unexpected failures. Within each region, the memory structure of the MemStore is further divided by column families, which are the fundamental units of physical storage and I/O in HBase. Each column family maintains:

  • Its own MemStore, which holds recently written data in memory for fast access and buffering before it flushes to disk.
  • A set of HFiles, which are immutable data files stored on HDFS (or Amazon S3 in HBase on S3 mode) that hold the persistent, flushed data.

Although all column families within a region are served by the same RegionServer process, they operate independently in terms of memory buffering, flushing, and compaction. However, they still share the same WAL and RegionServer-level resources, which introduces a degree of coordination, hence they operate semi-independently within the broader region context. This architecture is shown in the following diagram.

Architecture diagram of HBase Region Server showing WAL, two regions with in-memory Memstores and persistent HFiles

Understanding the HBase write process: WAL, MemStore, and HFiles

The HBase write path initiates when a client issues a write request, typically through an RPC call directed to the appropriate RegionServer that hosts the target region. Upon receiving the request, the RegionServer identifies the correct HBase region based on the row key and forwards the KeyValue pair accordingly. The write operation follows a two-step process. First, the data is appended to the WAL, which promotes durability by recording every change before it’s committed to memory. The WAL resides on HDFS by default and exists independently on each RegionServer. Its primary purpose is to provide a recovery mechanism in the event of a failure, particularly for edits that have not yet been flushed to disk. When the WAL append is successful, the data is written to the MemStore, an in-memory store for each column family within the region. The MemStore accumulates updates until it reaches a predefined size threshold, controlled by the hbase.hregion.memstore.flush.size parameter (default is 128 MB). When this threshold is exceeded, a flush is triggered.Flushing is handled asynchronously by a background thread in the RegionServer. The thread writes the contents of the MemStore to a new HFile, which is then persisted to long-term storage. In Amazon EMR, the location of this HFile depends on the deployment mode: for HBase on Amazon S3, HFiles are stored in Amazon S3, but for HBase on HDFS, they’re stored in HDFS.This workflow is shown in the following diagram.

HBase write process workflow showing data path through WAL, Memstore, and HFile with AWS services

A region server serves multiple regions, and they all share a common WAL. The WAL records all data changes, storing them in local HDFS. Puts and deletes are initially logged to the WAL by the region server before being recorded in the MemStore for the affected store. Scan and get operations in HBase don’t require the use of the WAL. In the event of a region server crash or unavailability before MemStore flushing, the WAL is crucial for replaying data changes, which promotes data integrity. Because this log by default resides on a replicated filesystem, it enables an alternate server to access and replay the log, requiring nothing from the physically failed server for a complete recovery. When a RegionServer fails abruptly, HBase initiates an automated recovery process orchestrated by the HMaster. First, the ZooKeeper session timeout detects the RegionServer failure, notifying the HMaster. The HMaster then identifies all regions previously hosted on the failed RegionServer and marks them as unassigned. The WAL files from the failed RegionServer are split by region, and these split WAL files are distributed to the new RegionServers that will host the reassigned regions. Each new RegionServer replays its assigned WAL segments to recover the MemStore state that existed before the failure, preventing data loss. When WAL replay is complete, the regions become operational on their new RegionServers, and the recovery process concludes.

HBase recovery workflow from RegionServer failure through WAL splitting to regions online

The effectiveness of the HDFS WAL model relies on the successful completion of the write request in the WAL and the subsequent data replication in HDFS. In cases where some nodes are terminated, HDFS can still recover from the WAL files, allowing HBase to autonomously heal by replaying data from the WALs and rebalancing the regions. However, if all CORE nodes are simultaneously terminated, achieving complete cluster recovery is a challenge because the data to replay from the WAL is lost. The issue arises when WALs are lost due to CORE node shutdown (for example, all three replicas of a file block). In this scenario, HBase enters a loop attempting to replay data from the WALs. Unfortunately, the absence of available blocks in this case causes the HBase server crash procedure to fail and retry indefinitely.

Amazon EMR WAL

To address the mentioned challenge of HDFS WAL and to provide data durability in HBase, Amazon EMR introduces a new EMR WAL feature starting from versions emr-7.0 and emr-6.15. This feature facilitates the recovery of data that hasn’t been flushed to Amazon S3 (HFile). Using this feature provides thorough backup for your HBase clusters. Behind the scenes, the RegionServer writes WAL data to EMR WAL, which is a service outside the EMR cluster. With this feature enabled, concerns about loss of WAL data in HDFS are alleviated. Also, in the event of cluster or Availability Zone failure issues, you can create a new cluster, directing it to the same Amazon S3 root directory and EMR WAL workspace. This enables the automatic recovery of data in the WAL in the order of minutes. Recovery of unflushed data is supported for a duration of 30 days, after which remaining unflushed data is deleted. This workflow is shown in the following diagram.

Detailed sequence diagram of HBase write operations with EMR WAL service integration and S3 storage

Key benefits

Upon enabling EMR WAL, the WALs are located external to the EMR cluster. The key benefits are:

  • High availability – You can remain confident about data integrity even in the face of Availability Zone failures. Their HFiles are stored in Amazon S3, and the WALs are externally stored in EMR WAL. This setup enables cluster recovery and WAL replay in the same or a different Availability Zone within the region. However, for true high availability with zero downtime, relying solely on EMR WAL is not sufficient because recovery still involves brief interruptions. To provide seamless failover and uninterrupted service, HBase replication across multiple Availability Zones is essential along with EMR WAL, providing robust zero-downtime high availability.
  • Data durability improvement – Customers no longer need to concern themselves with potential data loss in scenarios involving WAL data corruption in HDFS or the removal of all replicas in HDFS due to instance terminations.

The following flow diagram compares the sequence of events with and without EMR WAL enabled.

HBase write process and failure recovery workflow with EMR WAL and S3 integration

Key EMR WAL features

In this section, we explore the key enhancements introduced in the EMR WAL service across recent Amazon EMR versions. From grouping multiple HBase regions into a single EMR WAL to advanced configuration options, these new capabilities address specific usage scenarios.

Grouping multiple HBase regions into a single Amazon EMR WAL

In Amazon EMR versions up to 7.2, a separate EMR WAL is created for each region, which can become expensive due to the EMR-WAL-WALHours pricing model, especially when the HBase cluster contains many regions. To address this, starting from Amazon EMR 7.3, we introduced the EMR WAL grouping feature, which enables consolidating multiple HBase regions per EMR WAL, offering significant cost savings (over 99% cost savings in our sample evaluation) and improved operational efficiency. By default, each HBase RegionServer has two Amazon EMR WALs. If you have many regions per RegionServer and want to increase throughput, you can customize the number of WALs per RegionServer by configuring the hbase.wal.regiongrouping.numgroups property. For instance, to set 10 EMR WALs per HBase RegionServer, you can use the following configuration:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "hbase.wal.regiongrouping.numgroups": "10"
    }
  }
]

The two HBase system tables hbase:meta and hbase:master (masterstore) don’t participate in the WAL grouping mechanisms.

In a performance test using m5.8xlarge instances with 1,000 regions per RegionServer, we observed a significant increase in throughput as the number of WALs grew from 1 to 20 per RegionServer (from 1,570 to 3,384 operations per sec). This led to a 54% improvement in average latency (from 40.5 ms to 18.8 ms) and a 72% reduction in 95th percentile latency (from 231 ms to 64 ms). However, beyond 20 WALs, we noted diminishing returns, with only slight performance improvements between 20 and 50 WALs, and average latency stabilized around 18.7ms. Based on these results, we recommend maintaining a lower region density (around 10 regions per WAL) for optimal performance. Nonetheless, it’s crucial to fine-tune this configuration according to your specific workload characteristics and performance requirements and conduct tests in your lower environment to validate the best setup.

Configurable maximum record size in EMR WAL

Until Amazon EMR version 7.4, the EMR WAL had a record size limit of 4 MB, which was insufficient for some customers. Starting from EMR 7.5, the maximum record size in EMR WAL is configurable through the emr.wal.max.payload.size property. The default value is set to 1 GB. The following is an example of how to set the maximum record size to 2 GB:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "emr.wal.max.payload.size": "2147483648"
    }
  }
]

AWS PrivateLink support

EMR WAL supports AWS PrivateLink, if you want to keep your connection within the AWS network. To set it up, create a virtual private cloud (VPC) endpoint using the AWS Management Console or AWS Command Line Interface (AWS CLI) and select the service labeled com.amazonaws.region.emrwal.prod. Make sure your VPC endpoint uses the same security groups as the EMR cluster. You have two DNS configuration options: enabling private DNS, which uses the standard endpoint format and automatically routes traffic privately, or using the provided VPC endpoint-specific DNS name for more explicit control. Regardless of the DNS option chosen, both methods mean that traffic remains within the AWS network, enhancing security. To implement this in the EMR cluster, update your cluster configuration to use the PrivateLink endpoint, as shown in the following code sample (for private DNS):

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "emr.wal.client.endpoint": "https://prod.emrwal.region.amazonaws.com"
        }
    }
]

For more details, refer to Access Amazon EMR WAL through AWS PrivateLink in the Amazon EMR documentation.

Encryption options for WAL in Amazon EMR

Amazon EMR automatically encrypts data in transit in the EMR WAL service. You can enable server-side encryption (SSE) for WAL (data at rest) with two key management options:

  • SSE-EMR-WAL: Amazon EMR manages the encryption keys
  • SSE-KMS-WAL: You use an AWS Key Management Service (AWS KMS) key for encryption policies

EMR WAL cross-cluster replication

From EMR 7.5, EMR WAL supports cross-cluster replay, allowing clusters in an active-passive HBase replication setup to use EMR WAL.

For more details on the setup, refer to EMR WAL cross-cluster replication in the Amazon EMR documentation.

EMR WAL enhancement: Minimizing CPU load from HBase sync threads

Starting from EMR 7.9, we’ve implemented code optimizations in EMR WAL to address the high CPU utilization caused by sync threads used by HBase processes to write WAL edits, leading to improved CPU efficiency.

Sample use cases benefitting from this feature

Based on our customer interactions and feedback, this feature can help in the following scenarios.

Continuity during service disruptions

If your business demands disaster recovery with no data loss for an HBase on an S3 cluster due to unexpected service disruptions, such as an Availability Zone failure, the newly introduced feature means you don’t have to rely on a persistent event store solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis. Without EMR WAL, you had to set up a complex event-streaming pipeline to retain the most recently ingested data and enable replay from the point of failure. This new feature eliminates that dependency by storing Hbase WALs in the EMR WAL service.

Note: During an Availability Zone (AZ) failure or service-level issue, make sure to fully terminate the original Hbase cluster before launching a new one that points to the same S3 root directory. Running two active Hbase clusters that access the same S3 root can lead to data corruption.

Upgrading to the latest EMR releases or cluster rotations

Without EMR WAL, moving to the latest EMR version or managing cluster rotations with HBase on Amazon S3 necessitated manual interruptions for data flushing to S3. With the new feature, the requirement for data flushing is eliminated. However, during cluster termination and the subsequent launch of a new HBase cluster, there is an inevitable service downtime, during which data producers or ingestion pipelines must handle write disruptions or buffer incoming data until the system is fully restored. Also, the downstream services should account for temporary unavailability, which can be mitigated using a read replica cluster.

Overcoming HDFS challenges during HBase auto scaling

Without EMR WAL feature, having HDFS for your WAL files was a requirement. When implementing custom auto scaling for your HBase clusters, it sometimes resulted in WAL data corruption due to issues linked to HDFS. This is because, to prevent data loss, data blocks had to be moved to different HDFS nodes when one HDFS node was being decommissioned. When nodes continued to be terminated swiftly during scale-down process without allowing sufficient time for graceful decommissioning, it could result in WAL data corruption issues, primarily attributed to missing blocks.

Addressing HDFS disk space issues due to old WALs

When a WAL file is no longer required for recovery, indicating that HBase has made sure all data within the WAL file has been flushed, it’s transferred to the oldWALs folder for archival purposes. The log remains in this location until all other references to the WAL file are completed. In HBase use cases with high write activity, some customers have expressed concerns about the oldWALs directory (/usr/hbase/oldWALs) expanding and occupying excessive disk space and eventually causing disk space issues. With the complete relocation of these WALs to an external EMR WAL service, you will no longer encounter this issue.

Assessing HBase in Amazon EMR clusters with and without EMR WAL for fault tolerance

We conducted a data durability test employing two scripts. The first was for installing YCSB, creating a pre-split table, and loading 8 million records on the master node. The second was for terminating a core node every 90 seconds after a 3-minute wait, totaling five terminations. Two EMR clusters with eight core nodes each were created, one configured with EMR WAL enabled and the other as a standard EMR HBase cluster with the WAL stored in HDFS. After completion of EMR steps, a count was run on the HBase table. In the EMR cluster with EMR WAL enabled, all records were successfully inserted without corruption. In the cluster not using EMR WALs, regions in HBase remained “OPENING” if the node hosting the meta was terminated. For other core node terminations, inserts failed, resulting in a lower record count during validation.

Understanding when EMR WAL read charges apply in HBase

In HBase, standard table read operations such as Get and Scan don’t access WALs. Therefore, EMR WAL read (GiB) charges are only incurred during operations that involve reading from WALs, such as:

  • Restoring data from EMR WALs in a newly launched cluster
  • Replaying WALs to recover data on a crashed RegionServer
  • Performing HBase replication, which involves reading WALs to replicate data across clusters

In a normal scenario, you’re billed only for the following two components related to EMR WAL usage:

  • EMR-WAL-WALHours – Represents the hourly cost of WAL storage, calculated based on the number of WALs maintained. You can use the EMRWALCount metric in Amazon CloudWatch to monitor the number of WALs and track associated usage over time.
  • EMR-WAL-WriteRequestGiB – This reflects the volume of data written to the WAL service, charged by the amount of data written in GiB.

For further details on pricing, refer to Amazon EMR pricing and Amazon EMR Release Guide.

To monitor and analyze EMR WAL related costs in the AWS Cost and Usage Reports (CUR), look under product_servicecode = ‘ElasticMapReduce’, where you’ll find the following product_usagetype entries associated with WAL usage:

  • USE1-EMR-WAL-ReadRequestGiB
  • USE1-EMR-WAL-WALHours
  • USE1-EMR-WAL-WriteRequestGiB

The prefix USE1 indicates the Region (in this case, us-east-1) and will vary depending on where your EMR cluster is deployed.

Summary

This new EMR WAL feature allows you to improve durability of your Amazon EMR HBase on S3 clusters, addressing critical workload scenarios by eliminating the need for streaming solutions for Availability Zone level service disruptions, streamlining processes for upgrading or rotating clusters, preventing data corruption during HBase auto scaling or node termination events, and resolving disk space issues associated with old WALs. Because many of the EMR WAL features are added on the latest releases of Amazon EMR, we recommend that customers use Amazon EMR version 7.9 or later to fully benefit from these improvements.


About the authors

Suthan Phillips is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to ensure efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking and exploring the Pacific Northwest.

How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Large organizations increasingly adopt software as a service (SaaS) solutions to focus on business priorities, reduce infrastructure management overhead, and optimize costs. These organizations expect SaaS vendors to provide customizability facilities for tailoring the solution behavior according to their needs. Although traditional approaches like feature flags and webhooks offer some flexibility, they often fall short of providing a high degree of customizability. A new emerging pattern in this space is tenant-supplied custom code execution, which allows tenants to inject their own code into specific workflow points, enabling deep customization while preserving the core SaaS solutions’ integrity and security.

In this post, we share how Pegasystems (Pega) built Launchpad, its new SaaS development platform, to solve a core challenge in multi-tenant environments: enabling secure customer customization. By running tenant code in isolated environments with AWS Lambda, Launchpad offers its customers a secure, scalable foundation, eliminating the need for bespoke code customizations.

Solution overview

Launchpad, which is built on AWS, is an end-to-end platform on which software providers can build, launch, and operate workflow-centric B2B SaaS applications and AI solutions. It provides a managed, secure, scalable cloud environment for hosting multi-tenant applications and data. It accelerates the build experience with generative AI-powered low code tools, prebuilt capabilities, and subscriber-level configuration. Being a multi-tenant platform at its core, Launchpad had to maintain stringent isolation across tenants in its architecture.

One of the requirements Launchpad had was to allow their tenants to augment the workflows natively by providing custom code. Some common scenarios included communicating with external systems with proprietary non-industry-standard protocols, reuse of existing business logic, and SDK-based custom code development.

The solution necessitated the ability for tenants to provide custom code that would implement the required business logic, which Launchpad would be executing. This required architecting a secure runtime environment for custom code execution that maintains the highest degree of cross-tenant isolation within the multi-tenant architecture, at the same time allowing sufficient access to platform APIs and services. It was essential to build an architecture that would decouple the environment running tenant code from the core SaaS platform, as illustrated in the following diagram.

Architecting the solution topology

To achieve the required high level of compute isolation for running code provided by different tenants, Launchpad has adopted Lambda functions in its architecture as the secure ephemeral compute environment. Each untrusted code snippet provided by tenants is bootstrapped as a stand-alone Lambda function, with strong Firecracker-based isolation across different functions and execution environments addressing Launchpad’s requirements. This isolation provides dedicated resources, customizable access permissions, independent monitoring and operations, and automatic scaling for each function, while maintaining complete separation from other functions and their execution environments, as illustrated in the following diagram.

With Lambda being a serverless compute service, adopting it for the Launchpad architecture yielded several significant benefits. The major business benefit was that tenants could implement thousands of custom workflow augmentations on their own simply by providing code snippets, instead of the Launchpad engineering team being responsible for implementing them in the core platform code. Other benefits included:

  • Managed runtimes – AWS handles patching and updating the underlying infrastructure, operating system, and runtimes for customers, reducing the potential attack surface.
  • Fine-grained permissions – Each function can have its own set of access policies to tightly control what resources and actions it can access.
  • No need to pre-provision and pay for overprovisioned capacity – Lambda functions scale up and down automatically based on traffic patterns.
  • Built-in monitoring – Lambda functions emit detailed metrics, logs, and traces through Amazon CloudWatch and AWS X-Ray out of the box, making it straightforward to monitor tenant code execution.

To further reduce risks, Launchpad runs these Lambda functions with untrusted code in a dedicated AWS account. This account is separated from the core SaaS platform account. When end-users create a new function in the Launchpad authoring portal, they upload their code and specify the code handler to be executed during the invocation. Users can also map function input and output to Launchpad fields for further processing to enable an even higher degree of customizability and integration. The multi-tenant authoring service is a Control Plane component that runs as a microservice on the Amazon Elastic Kubernetes Service (Amazon EKS) cluster and uses the Lambda API for function lifecycle management, as illustrated in the following diagram. After a function resource is created, it can be used for further invocations.

Runtime architecture

At runtime, when Launchpad needs to invoke a function, it calls the Lambda Invoke API. Before the function is invoked, the multi-tenant runtime service performs a tenancy check to make sure the request is coming from an authorized tenant by doing the token validation. After a successful validation, the service invokes the required Lambda function. To invoke functions hosted in a different AWS account, the multi-tenant runtime service uses an AWS Identity and Access Management (IAM) role to assume the required permissions and invokes the Lambda service using the AWS SDK. The sequence of interactions is shown in the following architecture diagram.

The workflow consists of the following steps:

  1. An incoming user request reaches the application gateway service.
  2. The application gateway authenticates the request using the tenancy security service.
  3. After it’s authenticated, the request is forwarded to the multi-tenant runtime service
  4. The multi-tenant runtime service validates the supplied token and performs a tenancy check. This makes sure tenants can only invoke own functions they have permissions for (for example, functions they own).
  5. The multi-tenant runtime service pod assumes the IAM role required for invoking the tenant-specific Lambda function in a different AWS account.
  6. The multi-tenant runtime service pod invokes the required Lambda function.

Invoking the platform API from custom code is as straightforward as connecting to any external API. The custom code can authenticate with the platform using OAuth2. To facilitate the authentication, the developer can pass along the credentials as input parameters to the function from the core platform. Then the developer can create a corresponding record (isolated by tenant) in the platform that stores the credentials per function, and pass credentials as input parameters during invocation.

Distributed architecture observability

Operating a distributed architecture that runs untrusted code across multiple AWS accounts requires a comprehensive observability strategy. Launchpad’s approach combines centralized logging and monitoring with cross-account aggregation to provide a unified operational view of the platform.

The monitoring architecture uses CloudWatch Metrics to observe the Lambda functions, aggregating them through a centralized observability layer. This setup empowers platform operators to correlate Lambda function metrics with the core platform services running on Amazon EKS. Launchpad also collects per-function telemetry like function invocations, error rates, and execution time, which allows them to observe per-tenant metrics. These telemetry dimensions enable both a platform-wide and tenant-specific monitoring perspective.

For logging and troubleshooting, Launchpad implements a unified logging pipeline that aggregates Lambda function logs with application gateway and runtime service logs. Each request flowing through the system carries a correlation ID, so operators can trace execution paths across the core SaaS services and into the tenant functions running in the AWS account running tenant Lambda functions.

With this multi-layer observability architecture, Launchpad can maintain operational excellence while running tenant code securely at scale. Regular operational reviews drive continuous improvements in monitoring coverage and incident response procedures. Having per-tenant Lambda functions make it possible for Launchpad to use tenant-specific cost allocation tags, further empowering them to understand the cost footprint of running tenant custom code.

Best practices

When building a SaaS solution, maintaining a unified core code base is essential for scalability and manageability. Implementing per-tenant variations within the core platform code can lead to maintenance complexity and technical debt. Instead, architect your SaaS solution to have extension points, which allow your tenants to inject their custom code at specific points in the workflow, enabling customization without compromising the platform’s maintainability. This pattern makes sure the core SaaS platform remains clean and standardized while offering the flexibility that customers demand.

Additional best practices include:

  • Use separate accounts for running Lambda functions with untrusted tenant-provided code to make sure it’s isolated from your core SaaS platform code.
  • Grant absolute minimum required access permissions to the execution role assigned to the function. The custom code running within the execution environment gets permissions defined in the execution role when making requests to AWS API endpoints. If the function doesn’t need to reach out to AWS API endpoints, remove all permissions from the execution role and add an explicit AWSDenyAll policy.
  • Use separate Lambda functions for each code snippet and each tenant. This will provide the highest degree of cross-tenant isolation. Resources are not reused across different functions and execution environments.
  • Use Lambda layers in case you need to add a layer of vendor-provided code in order to keep it separated from the untrusted tenant-provided code.
  • Implement additional security controls, such as using Amazon Virtual Private Cloud (Amazon VPC) constructs to restrict network access and VPC Flow Logs for network activity monitoring.

Conclusion

The implementation of a secure untrusted code execution environment within SaaS platforms addresses a critical need for tenant customization while maintaining architectural integrity. Lambda offers a built-in isolation model, fine-grained security controls, and serverless scalability, so SaaS providers such as Launchpad can address the requirements of executing tenant-provided code in a multi-tenant environment and offer robust customization capabilities while maintaining strict security boundaries and operational efficiency. This architectural pattern enables providers to focus on core platform development while confidently supporting tenant-specific workflows through the secure and scalable Lambda execution environment.

To learn more, refer to the Security Overview of AWS Lambda white paper. For additional serverless architectural patterns, see Serverlessland.com.


About the authors

PackScan: Building real-time sort center analytics with AWS Services

Post Syndicated from Sairam Vangapally original https://aws.amazon.com/blogs/big-data/packscan-building-real-time-sort-center-analytics-with-aws-services/

Amazon manages a complex logistics network with multiple touch points, from fulfillment centers to sort centers to final customer delivery. Among these, sort centers play a crucial role in the middle mile, providing faster and more efficient package movement. Within Amazon’s Middle Mile operations, high-volume sort centers process millions of packages daily, making immediate access to operational data essential for optimizing efficiency and decision-making. Real-time visibility into key metrics—such as package movements, container statuses, and associate productivity—is critical for smooth logistics operations. To address the need for real-time operational planning, the Amazon Middle Mile team developed PackScan, a cloud-based platform designed to provide instant insights across the network. By significantly reducing data latency, PackScan enables proactive decision-making, so teams can monitor inbound package flows, optimize outbound shipments based on live data, track associate productivity, identify bottlenecks, and enhance overall operational efficiency—all in real time.

In this post, we explore how PackScan uses Amazon cloud-based services to drive real-time visibility, improve logistics efficiency, and support the seamless movement of packages across Amazon’s Middle Mile network.

Prerequisites

This post assumes a foundational understanding of the following services and concepts:

Although hands-on experience is not required, a conceptual understanding of these services will help in understanding the architecture, design patterns, and components discussed throughout the article.

Business challenges

Amazon’s sort centers handle over 15 million packages daily across more than 120 facilities in North America. Given this scale, even minor delays in operational insights can lead to inefficiencies, increased costs, and escalations. Traditionally, data latencies of up to an hour have restricted the ability to make proactive decisions, directly affecting productivity, resource allocation, and responsiveness—especially during peak periods like holiday seasons and big deal days.

Without immediate visibility into package movements, container statuses, and associate performance, operational teams face challenges in identifying and resolving bottlenecks in real time. The lack of timely insights can disrupt the flow of packages, leading to shipment delays, reduced throughput, and suboptimal facility performance. Addressing these inefficiencies required a solution capable of delivering real-time, high-fidelity data to support rapid decision-making.

To bridge this gap, Amazon’s Middle Mile organization needed a scalable platform that could enhance visibility, minimize latency, and provide up-to-the-minute insights into logistics operations. PackScan was designed to meet these demands, giving teams access to the real-time data necessary to optimize workflows, mitigate bottlenecks, and improve overall efficiency.

Data flow

In 2024, PackScan was deployed across 80 sort centers in the USA, enabling real-time package analytics. The solution powers Grafana dashboards, which refresh every 10 seconds by fetching live package data from OpenSearch Service. With this near real-time visibility, operations teams can monitor package movement and sorting efficiency across sort centers. The following diagram outlines how package scan data is ingested, processed, and made actionable.

Each sort center is equipped with hardware at inbound stations where packages arrive from trailers. Integrated barcode scanners automatically scan each package as it enters the sorting process. Every scan generates an SNS event, capturing key attributes such as the package ID, dimensions, the associate who performed the scan, and the timestamp and location of the scan.

After they’re generated, these SNS events are ingested into Data Firehose through a Lambda function, where the data undergoes real-time enrichment. During this process, additional attributes are appended, including the business logic rules. The enriched data is then streamed into OpenSearch Service, where events are indexed to enable fast and efficient querying. With the indexed package scan events available in OpenSearch Service, real-time analytics and monitoring become possible. The Grafana dashboards query this data every 10 seconds, providing operational insights into package inflow metrics and associate performance.

Solution overview

PackScan was implemented using a structured and scalable approach, using AWS cloud-based services to enable high-frequency data ingestion, real-time processing, and actionable insights. The architecture is designed to minimize latency while providing reliability, scalability, and operational efficiency. The solution is built around a serverless, event-driven architecture that dynamically scales based on data ingestion volumes. The architecture—illustrated in the following figure—enabled us to build a real-time data solution, utilizing the advantages of various AWS services to provide low-latency analytics, high scalability, and real-time operational insights across Amazon’s sort centers.

The following are the key components and features of the solution:

  • Real-time data processing – Lambda functions serve as the processing backbone of the system, handling 500,000 scan events per second. Each incoming event is processed by applying data transformations, enrichment, and validation before passing it downstream.
  • High-frequency data ingestion and streaming – Data Firehose is the primary ingestion pipeline, handling millions of scan events daily from thousands of barcode scanners across multiple sort centers. The Firehose streams handle incoming data of 12,000 PUT requests per second, maintaining smooth ingestion and low-latency streaming. Data retention policies are set to buffer and forward enriched events every 60 seconds or upon reaching 5 MB batch size, optimizing storage and processing efficiency.
  • Optimized querying and operational insights – OpenSearch Service is used to index and store the processed scan events, providing real-time querying and anomaly detection. The OpenSearch cluster consists of 12 data nodes (r5.4xlarge.search) and 3 primary nodes (r5.large.search), processing up to 10 GB of data per day with a rolling index strategy, where indexes are rotated every 24 hours to maintain query performance. The system supports concurrent queries per second, enabling logistics teams to perform rapid lookups and gain instant visibility into package movements.
  • Live visualization and dashboarding – Grafana, hosted on an m5.12xlarge EC2 instance, provides real-time visualization of key logistics metrics. The dashboards refresh every 10 seconds, querying OpenSearch and displaying up-to-the-minute package analytics. The setup includes multiple preconfigured dashboards, monitoring package flow at different inbound stations, and workforce efficiency. These dashboards support concurrent users, enabling supervisors and associates to track and optimize operations proactively. The following screenshot shows one of the real-time dashboards, with details of package flow by different routes within sort centers.

The entire PackScan architecture is designed for automatic scaling, adjusting dynamically based on data ingestion volume to maintain efficiency during peak and off-peak operations. This approach provides cost-effective resource utilization while maintaining high availability and performance.

Business outcomes

The implementation of PackScan has led to measurable improvements in operational efficiency, workforce productivity, and real-time decision-making across Amazon’s sort centers. By reducing data latency and enabling real-time insights, PackScan has transformed logistics operations in meaningful ways:

  • Widespread deployment – PackScan was deployed across 80 sort centers, supporting approximately 1,000 display monitors that provide real-time operational insights.
  • Significant reduction in data latency – Data latency dropped from approximately 1 hour to less than 1 minute, allowing for real-time operational responsiveness and minimizing workflow disruptions.
  • Proactive operational management – With dynamic workload balancing and instant bottleneck identification, supervisors can now address issues as they arise, leading to smoother operations and fewer escalations.
  • Boost in workforce productivity – The real-time performance feedback has enhanced associate engagement, resulting in a 25% increase in throughput per hour and 12% reduction in labor hours.

Overall, PackScan has redefined real-time logistics visibility within Amazon’s Middle Mile operations, empowering operational teams with actionable insights, enhanced workforce efficiency, and a data-driven approach to package movement and sort center performance.

Lessons learned and best practices

The deployment and scaling of PackScan provided valuable insights into optimizing real-time logistics visibility. Several key lessons and best practices emerged from this implementation:

  • Cloud architecture drives efficiency – Adopting Amazon technologies provides seamless scalability, reduced operational overhead, and lower infrastructure costs, while maintaining high reliability. The following table shows an approximate breakdown of monthly service costs observed in production. This is an estimation based on current pricing; we recommend checking the respective AWS service pricing pages to generate the most up-to-date quote. This architecture demonstrates that with combination of provisioned and serverless design, production-ready solutions can be built and scaled at a fraction of the cost of traditional infrastructure.
AWS Service Description Estimated Monthly Cost
Amazon EC2 Three EC2 instances of type m5.12xlarge hosting Grafana $1,700
AWS Lambda Streams SNS events to Data Firehose $4,000
Amazon Data Firehose Real-time data delivery with 12,000 records streaming to OpenSearch Service $1,500
Amazon OpenSearch Service Indexing and querying package scan events $28,000
  • Real-time visibility is a game changer – Immediate access to operational data enhances agility, enabling teams to make timely, data-driven decisions that prevent bottlenecks and improve throughput.
  • Continuous monitoring enhances decision-making – Operational dashboards should evolve with business needs. Regular monitoring and updates provide accuracy, usability, and relevance in driving informed decision-making.

By applying these best practices, PackScan has set a foundation for scalable, real-time logistics management, making sure that Amazon’s Middle Mile operations remain proactive, efficient, and highly responsive to changing business demands.

Conclusion

PackScan has successfully transformed real-time operational visibility within Amazon’s sort centers, addressing critical challenges in data latency, workforce productivity, and logistics efficiency. By using AWS services, particularly Data Firehose for real-time data delivery and OpenSearch Service for analytics, PackScan has enabled proactive decision-making, streamlined operations, and enhanced throughput in high-volume sort environments. Looking ahead, future enhancements will focus on further elevating operational intelligence and scalability, including:

  • Integrating predictive analytics to anticipate workflow bottlenecks and optimize resource allocation
  • Scaling the solution across additional operational scenarios, providing greater resilience and adaptability to dynamic logistics environments

With these advancements, PackScan will continue to drive operational excellence, cost-efficiency, and real-time decision-making capabilities, reinforcing Amazon’s commitment to innovation in logistics and supply chain management.

For those interested in implementing similar solutions, we recommend exploring AWS Serverless Architecture Patterns and the AWS Architecture Blog for additional insights and best practices in building scalable, real-time analytics solutions.


About the authors

Sairam Vangapally is a Data Engineer at Amazon with extensive experience architecting real-time, large-scale data platforms that power critical logistics operations across North America. He has led the design and deployment of end-to-end data pipelines, enabling high-throughput ingestion, transformation, and analytics at scale. He is passionate about building resilient data infrastructure and driving cross-functional collaboration to deliver solutions that accelerate operational insights and business impact.

Nitin Goyal serves as a Data Engineering Manager in Amazon’s Sort Center organization, where he leads initiatives to optimize operational efficiency across North American facilities. With over nine years of tenure at Amazon spanning multiple teams, he specializes in architecting high-performance data systems, with particular emphasis on real-time streaming pipelines, artificial intelligence, and low-latency solutions. His expertise drives the development of sophisticated operational workflows that enhance sort center productivity and effectiveness.

Modernizing applications with AWS AppSync Events

Post Syndicated from Ricardo Marques original https://aws.amazon.com/blogs/compute/modernizing-applications-with-aws-appsync-events/

In today’s fast-paced digital world, organizations are facing challenges for modernizing their applications. A common problem is the smooth shift from synchronous to asynchronous communication without substantial client or frontend alterations. When modernizing applications, it is often necessary to move from a synchronous communication model to an asynchronous one. However, this transition can be complex, especially when the client or frontend communicates synchronously. Adapting the current code for asynchronous communication demands significant time and resources.

AWS AppSync Events helps address this challenge by enabling you to build event-driven APIs that can bridge between synchronous and asynchronous communication models. With AppSync Events, you can modernize your backend architecture to leverage asynchronous patterns while maintaining compatibility with existing synchronous clients.

Overview

The solution comprises an API that converts client synchronous requests to asynchronous backend requests using AppSync Events.

For demonstrating the integration between the API and the backend, I’m simulating the backend processing using an asynchronous AWS Step Functions workflow. This workflow receives a Name and Surname event, waits 10 seconds, and posts a full-name event to the AppSync Event channel. To receive event notifications, the API subscribes to the AppSync channel. At the same time, the backend handles events asynchronously.

Figure 1: Representation of an API integrating a synchronous frontend with an asynchronous backend using AWS AppSync Events.

Figure 1: Representation of an API integrating a synchronous frontend with an asynchronous backend using AWS AppSync Events.

  1. The Amazon API Gateway makes a synchronous request to AWS Lambda and waits for the response.
  2. Lambda function starts the execution of the asynchronous workflow.
  3. After starting the workflow execution, Lambda connects to AppSync and creates a channel to receive asynchronous notifications (channels are ephemeral and unlimited. Here it creates one channel per request using the workflow execution ID).
  4. The workflow executes asynchronously, calling other workflows.
  5. Upon completion of the main workflow, it sends a POST request to the AppSync events API with the processing result. The POST is made to the channel that was created by the Lambda function using the workflow execution ID.
  6. AppSync receives the POST request and sends a notification to the subscriber, which in this case is the Lambda function. The entire process must be finished within the Lambda functions’s timeout limit you defined.
  7. Lambda sends the response to the API Gateway, which has been waiting for the synchronous response.

To better understand the Event API WebSocket Protocol used in this solution, refer to this AppSync documentation.

You can access the GitHub repo through this link: AppSync_Sync_Async_Integration.

The repository includes a comprehensive README file that walks you through the process of setting up and configuring the preceding solution.

Prerequisites

To follow this walkthrough, you need the following prerequisites:

With the full code, including API Gateway and Step Functions, on GitHub, this post only covers the core components: the AppSync Events API and the Lambda function.

Walkthrough

The following steps walk you through this solution.

Creating an AppSync event API with API Key Authorization

An AppSync Event API allows calls using API key, Amazon Cognito user pools, Lambda authorizer, OIDC, or AWS identity and Access Management (IAM). This solution uses API Key.

The infrastructure as code (IaC) has been created using Terraform. However, as of writing this post, there weren’t Terraform AppSync Event API resource available. Therefore, the AppSync Event API resources were made with AWS CloudFormation, which is imported and implemented by Terraform.

In the resource AWS:AppSync:Api, define the API name and Auth method:

Resources:
  #Creating the AppSync Events API
  EventAPI:
    Type: AWS::AppSync::Api
    Properties:
      Name: SyncAsyncAPI
      EventConfig:
        AuthProviders:
          - AuthType: API_KEY
        ConnectionAuthModes:
          - AuthType: API_KEY
        DefaultPublishAuthModes:
          - AuthType: API_KEY
        DefaultSubscribeAuthModes:
          - AuthType: API_KEY
#Creating the Events API Namespace
  DefaultNamespace:
    Type: AWS::AppSync::ChannelNamespace
    Properties:
      Name: AsyncEvents
      ApiId: !GetAtt EventAPI.ApiId
  
  #Creating the Events API APIKey
  EventAPIKey:
    Type: AWS::AppSync::ApiKey
    Properties:
      ApiId: !GetAtt EventAPI.ApiId
      Expires: 1748950672
      Description: 'API Key for Event API'

  #Creating the SecretsManager to store the APIKey
  SecretsManagerAPIKey:
    Type: AWS::SecretsManager::Secret
    Properties:
      Name: 'AppSyncEventAPIKEY'
      SecretString: !GetAtt EventAPIKey.ApiKey

To have the Host DNS, Realtime Endpoint, and Secret Manager created referenced by the Terraform template, output them:

Outputs:
  ApiARN:
    Description: 'The ARN ID'
    Value: !GetAtt EventAPI.ApiArn

  AppSyncHost:
    Description: 'The API Endpoint'
    Value: !GetAtt EventAPI.Dns.Http

  AppSyncRealTimeEndpoint:
    Description: 'The Real-time Endpoint'
    Value: !GetAtt EventAPI.Dns.Realtime

  SecretsManagerARN:
    Description: 'The ARN of the Secrets Manager entry'
    Value: !Ref SecretsManagerAPIKey

The key information needed from the AppSync Event API is:

  1. Host DNS: This DNS is used to send events to the API Channel through HTTP Post requests.
  2. Realtime endpoint: This endpoint is a WebSocket endpoint where the Lambda function connects to receive the events posted in the AppSync Channel.
  3. API Key: This key is used not only in the Post HTTP requests, but also to connect and subscribe to the AppSync channel.

Lambda Sync/Async API

In this solution, the Lambda function runs two tasks:

  1. Start an asynchronous workflow
  2. Subscribe to an event channel through WebSocket

To handle the WebSocket connection, use the websocket-client lib, which is a powerful Python lib developed for working with WebSockets.

Request isolation is maintained by using the same UUID for workflow name and AppSync channel name.

try:
        handler = WebSocketHandler()
        sfn_response = wf.start_workflow_async(event["body"])
        
        if sfn_response["status"] == "started":
            handler.execution_name = sfn_response["id"]
            handler.start_websocket_connection()
            
            return {
                'statusCode': 200,
                'body': json.dumps({ 
                        "id": handler.execution_name,
                        "nome completo": handler.final_name
                        })
            }
        else:
            raise ValueError("Workflow failed to start")

First, to initialize the WebSocket Connection, the subprotocols must be defined:

  • WEBSOCKET_PROTOCOL
  • Headers:
    • Host: The AppSync Host DNS (even with a WebSocket Connection, the HTTP Host must be sent)
    • x-api-key: The API key create fot the Event API.
    • Sec-Websocket-Protocol: WEBSOCKET_PROTOCOL
def start_websocket_connection(self) -> None:
        try: 
            """Initialize and start WebSocket connection."""
            header_str = self._create_connection_header()
            
            self.ws = websocket.WebSocketApp(
                os.environ["API_URL"],
                subprotocols=[WEBSOCKET_PROTOCOL, f'header-{header_str}'],
                on_open=self.on_open,
                on_message=self.on_message,
                on_error=self.on_error,
                on_close=self.on_close
)
            self.ws.run_forever()
        except Exception as e:
            return e
def _create_connection_header(self) -> str:
        """Create and encode connection header."""
        connection_header = {
            "host": os.environ["API_HOST"],
            "x-api-key": APIKEY,
            "Sec-WebSocket-Protocol": WEBSOCKET_PROTOCOL
        }
        return base64.b64encode(json.dumps(connection_header).encode()).decode()

Once the WebSocket connection is established, a first message with the type CONNECTION_INIT_TYPE must be sent.

To subscribe to the channel by which our function is notified when the Step Functions workflow finishes, send a second message with the type SUBSCRIBE_TYPE, an ID, the channel name and authorization.

For more information about types of message, read this AppSync documentation.

def on_open(self, ws: websocket.WebSocketApp) -> None:
        try:
            """Handle WebSocket connection opening and send initial messages."""
            logger.info("Connection opened")
            
            # Send connection initialization
            connection_init = {"type": CONNECTION_INIT_TYPE}
            ws.send(json.dumps(connection_init))

            # Send subscription
            subscription_msg = {
                "type": SUBSCRIBE_TYPE,
                "id": self.execution_name,
                "channel": f"{os.environ["APPSYNC_NAMESPACE"]}/{self.execution_name}",
                "authorization": {
                    "x-api-key": APIKEY,
                    "host": os.environ["API_HOST"]
                }
            }
            
            logger.info("Sending subscription")
            ws.send(json.dumps(subscription_msg))
        except Exception as e:
            self.on_error = e

After receiving the message confirming the subscription, wait for messages with the type data. Whenever a message with this type arrives, execute the logic to identify if the workflow was successfully executed, and then close the connection.

def on_message(self, ws: websocket.WebSocketApp, message: str) -> None:
        """Handle incoming WebSocket messages."""
        logger.info("Message received: %s", message)
        try:
            message_dict = json.loads(message)
            required_keys = ["id", "type", "event"]
            
            if all(key in message_dict for key in required_keys):
                event_json = json.loads(message_dict["event"])
                
                if (message_dict["id"] == self.execution_name and 
                    message_dict["type"] == "data"):
                    
                    self.final_name = event_json["nome_completo"]
                    logger.info("Message received: %s", self.final_name)
                    logger.info("Successfully received return message")
                    logger.info("Ending processing")
                    
                    self.message_queue = {
                        "status": SUCCESS_STATUS,
                        "executionID": message_dict["id"]
                    }
                    ws.close()
        except json.JSONDecodeError as e:
            logger.error("Failed to parse message: %s", str(e))
        except Exception as e:
            logger.error("Error processing message: %s", str(e))

Conclusion

In this post, you learned how to use event-driven architectures and the capabilities of AWS AppSync Events to integrate synchronous and asynchronous communication patterns in your applications. This allows you to modernize your systems without the need for extensive modifications to your existing frontend codebase. Explore the demonstrations and documentation provided in the GitHub repository to gain a deeper understanding of how AppSync Events can be applied to your specific use cases.

To learn more about serverless architectures and asynchronous invocation patterns, see Serverless Land.

Enhancing multi-account activity monitoring with event-driven architectures

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-multi-account-activity-monitoring-with-event-driven-architectures/

Enterprise cloud environments are growing increasingly complex as they scale, with organizations managing hundreds to thousands of Amazon Web Services (AWS) accounts across multiple business units and AWS Regions. Organizations need efficient ways to collect, transport, and analyze activity data for threat detection and compliance monitoring. This presents unique challenges for enterprise Application Security (AppSec) teams, cloud security vendors, and DevSecOps professionals, because traditional polling-based monitoring approaches struggle to provide real-time activity insights needed for modern cloud operations.

In this post, you will learn to use AWS CloudTrail and Amazon EventBridge for real-time cloud activity monitoring and automated response.

Overview

As organizations expand their cloud footprint, account activity monitoring that comprehensively tracks user actions and successfully identifies security threats becomes crucial for threat detection and compliance. Although AWS provides native tools—such as CloudTrail for API activity capture, EventBridge for real-time event routing, AWS Organizations for multi-account management, and AWS Config for resource evaluation—many enterprises struggle with the volume of activities while maintaining efficiency and controlling costs. Organizations need to carefully architect solutions to effectively use these tools as their environments scale.

Traditional polling-based techniques, which worked well for smaller environments, can become unsustainable when scaled to enterprise deployments, where the volume of activity data grows exponentially with each new account and service. API polling limitations, growing data volumes, and increasing demand for real-time analysis are pushing teams to rethink their architectural approach.

Figure 1. Poll model, periodically retrieving the latest state.

Adopting push-based event-driven architectures offers a compelling solution for AppSec teams and cloud security vendors facing these challenges. Using AWS services, such as CloudTrail and EventBridge, allows these teams and vendors to build scalable activity monitoring solutions that overcome the limitations of traditional polling-based approaches and provide real-time notifications across thousands of AWS accounts. This approach not only enables security use cases but also supports broader real-time operational monitoring, compliance reporting, and automation requirements.

“By integrating AWS CloudTrail and Amazon EventBridge, we’ve built a scalable architecture to monitor activity across thousands of AWS accounts. This provides the visibility needed to detect threats in real time and protect large, distributed AWS environments.” — Rob Solomon, Senior Cloud Solution Architect, CrowdStrike

Solution components

Enterprise AppSec teams and cloud security vendors share common requirements when building multi-account monitoring solutions. They need to efficiently collect activity data across thousands of accounts, transport it to a centralized location for analysis, and process it in real-time to detect threats and compliance violations. The solution must scale seamlessly from dozens to thousands of accounts while remaining highly-performant and cost-efficient. At its core, a scalable multi-account activity monitoring solution consists of three components: activity data collection, cross-account transport to a centralized location, and processing. In the following sections, you will learn how AppSec teams and cloud security vendors can implement each step efficiently while avoiding common pitfalls.

Figure 2. Push model. Account activity is collected at the source, and pushed to the AppSec or cloud security vendor account for further processing.

Data collection strategies

Many teams begin their cloud activity monitoring journey by polling the resource status through service management APIs. Although this approach works good for retrieving the latest resource state on-demand, its fundamental limitation is inability to detect state changes efficiently, necessitating continuous querying of all resources at fixed intervals. Consider a scenario where you’re monitoring 1,000 accounts, with 100 resources in each account. A single polling cycle would necessitate 100,000 API requests, consuming over 28 million API calls daily if running at five-minute intervals. This inefficiency compounds as environments grow, leading to throttling issues, increased costs, and scaling challenges.

AWS Config improved upon this by offering continuous resource configuration tracking without manual polling. Although this works excellent for configuration compliance and a history of changes for auditing, AWS Config reports changes on a best-effort basis and is not optimal for real-time threat detection.

To overcome this constraint, your solution can use services such as CloudTrail and EventBridge as primary data sources, complemented by intelligent on-demand targeted API polling. CloudTrail records API activity across AWS services, providing a detailed history of actions taken by users, roles, and AWS services in your accounts. Over 250 AWS services automatically report their activity and API calls to CloudTrail and EventBridge in real-time. This allows you to capture this information, providing a detailed history of actions taken in your accounts, and enabling security analysis, resource state change tracking, and compliance audit.

Figure 3. Over 250 AWS services are automatically emitting activity events to CloudTrail.

When a resource state changes, commonly as a result of a management API call, the affected service sends an event to CloudTrail and EventBridge. Your monitoring solution can examine the event payload to determine if polling for supplementary data is necessary, particularly when the initial payload lacks complete information. This provides you with comprehensive service coverage with reduced maintenance effort. This hybrid approach guarantees delivery of activity data to eliminate monitoring blind spots, while significantly reducing AWS management API quota consumption.

Cross-account data transport

Your solution should transport activity data from thousands of tenant accounts into a small number of centralized accounts, such as a regional AppSec account, for further processing and analysis. The solution must be secure, scalable, resilient, and cost-efficient while maintaining real-time delivery.

The most direct way to achieve it is to enable Amazon S3 event notifications for new objects that are created in the CloudTrail trails S3 bucket. When you receive the notification, you can retrieve and process new activities.

Figure 4. Exporting CloudTrail events into an S3 bucket and retrieving after receiving a notification.

This direct way to consume CloudTrail events has one important consideration: typically it can take an average of five minutes to deliver events to Amazon S3. Teams and vendors looking for lower mean-time-to-detect (MTTD) and mean-time-to-respond (MTTR) should evaluate transporting CloudTrail events across accounts with EventBridge, which provides close-to-real-time delivery.

Transporting events with EventBridge

EventBridge is a serverless event router that connects applications. It receives events from various sources, such as CloudTrail, and routes them to multiple targets based on defined rules.

Using EventBridge for cross-account data transport comes with several major benefits:

There are two approaches you can take for delivering cross-account events with EventBridge: direct service-to-service or service-to-API-endpoint.

The first approach uses the EventBridge direct bus-to-bus and bus-to-service delivery capabilities. This method is most suitable when you want AWS to handle data ingestion on the receiving end. The delivery target is always either an EventBridge bus, or another AWS service, such as an Amazon Simple Queue Service (Amazon SQS) queue, Amazon Kinesis Data Streams stream, or an AWS Lambda function. With support for up to 18,750 target invocations per second and native AWS Identity and Access Management (IAM) integration, this method is particularly suitable for large multi-account deployments.

The second approach uses the EventBridge API destinations feature. This method is most suitable when you have existing HTTP-based ingestion endpoints in place. Although it offers lower throughput, it provides greater flexibility for ingestion endpoint and authentication methods implementation, making it attractive for AppSec teams and cloud security vendors integrating with existing ingestion infrastructure.

Figure 5. Emitting CloudTrail events in real-time through EventBridge.

The following table summarizes two approaches for transporting events across accounts with EventBridge.

Direct bus-to-bus or bus-to-service API destinations
Data ingestion implementation effort Minimal Needed
Default target invocations per second (TPS) quotas Up to 18,750 (region dependent) Up to 300 (region dependent)
Can the TPS quota be increased Yes Yes
Authorization support Native AWS IAM, fully handled by AWS Basic, OAuth2, API Key. You’re responsible for implementing credentials validation during ingestion.
Cross-account delivery costs $1 per million events $0.20 per million events

Go to the EventBridge quotas and pricing pages for more details.

Processing architecture

Processing would commonly be done by existing products and services the AppSec team or cloud security vendor provides for activity analysis. The architecture for event processing pipeline operating at enterprise scale must consider design decisions to handle large and potentially irregular event volumes while maintaining high performance, as shown in the following figure.

Figure 6. An activity event processing pipeline, with priority-based processing.

Use the following best practices for a robust processing architecture:

  • Buffer ingested events Use services such as Amazon SQS, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka to buffer incoming events, handle traffic surges, and make sure of reliable processing.
  • Use serverless services that scale automatically, or invest in automated scaling mechanisms that adjust processing capacity based on event volume
  • Minimize polling: Resort to intelligent on-demand polling, only poll when you need additional data that is not available in the CloudTrail event payload.
  • Routing and classification: Rather than processing all events equally, implement intelligent classification and routing early in your pipeline. Security-related events such as IAM changes or security group modifications should take priority over routine activities or data events. This approach helps to control costs while maintaining rapid detection of important security events.
  • Cost optimization: At the enterprise scale, cost optimization becomes crucial. Use EventBridge rules in source accounts to filter out irrelevant events before they enter your processing pipeline. Consider implementing regional collection points to optimize data transfer costs. When using Lambda functions for data processing, use batch processing to reduce invocation costs. Evaluate which event types must be delivered in real-time through EventBridge, which event types can be delayed and collected through S3 bucket export, and which events should be discarded.
  • Observability: Monitor the ingestion and processing throughput to react to potential slowdowns early. Detect when source accounts are approaching EventBridge quotas. Consider using AWS Service Quotas to request quota increases automatically through APIs.
  • Cross-Region considerations: Design your architecture to support efficient cross-Region event collection while respecting data sovereignty requirements. Consider implementing regional processing nodes with centralized aggregation for global security analysis.
  • Integration patterns: Modern security solutions must integrate with existing security tools and workflows. Implement standardized output formats that allow seamless integration with SIEM systems, ticketing platforms, and automation frameworks. Consider publishing security findings back to EventBridge buses to enable automated remediation workflows. If you’re a cloud security vendor, then consider integrating with EventBridge as an SaaS partner.

Conclusion

Event-driven architectures present a powerful opportunity for building scalable multi-account activity monitoring solutions. Using services such as AWS CloudTrail and Amazon EventBridge allows teams to overcome the limitations of traditional polling-based approaches while achieving close to real-time delivery.The shift to event-driven security monitoring isn’t just an architectural choice—it’s becoming a necessity for teams operating at enterprise scale. This approach enables security teams to achieve the real-time threat detection capabilities needed in today’s cloud environments while maintaining operational efficiency and cost control.

Unlock self-serve streaming SQL with Amazon Managed Service for Apache Flink

Post Syndicated from Sofie Zilberman original https://aws.amazon.com/blogs/big-data/unlock-self-serve-streaming-sql-with-amazon-managed-service-for-apache-flink/

This post is co-written with Gal Krispel from Riskified.

Riskified is an ecommerce fraud prevention and risk management platform that helps businesses optimize online transactions by distinguishing legitimate customers from fraudulent ones.

Using artificial intelligence and machine learning (AI/ML), Riskified analyzes real-time transaction data to detect and prevent fraud while maximizing transaction approval rates. The platform provides a chargeback guarantee, protecting merchants from losses due to fraudulent transactions. Riskified’s solutions include account protection, policy abuse prevention, and chargeback management software, making it a comprehensive tool for reducing risk and enhancing customer experience. Businesses across various industries, including retail, travel, and digital goods, use Riskified to increase revenue while minimizing fraud-related losses. Riskified’s core business of real-time fraud prevention makes low-latency streaming technologies a fundamental part of its solution.

Businesses often can’t afford to wait for batch processing to make critical decisions. With real-time data streaming technologies like Apache Flink, Apache Spark, and Apache Kafka Streams, organizations can react instantly to emerging trends, detect anomalies, and enhance customer experiences. These technologies are powerful processing engines that perform analytical operations at scale. However, unlocking the full potential of streaming data often requires complex engineering efforts, limiting accessibility for analysts and business users.

Streaming pipelines are in high demand from Riskified’s Engineering department. Therefore, a user-friendly interface for creating streaming pipelines is a critical feature to increase analytical precision for detecting fraudulent transactions.

In this post, we present Riskified’s journey toward enabling self-service streaming SQL pipelines. We walk through the motivations behind the shift from Confluent ksqlDB to Apache Flink, the architecture Riskified built using Amazon Managed Service for Apache Flink, the technical challenges they faced, and the solutions that helped them make streaming accessible, scalable, and production-ready.

Using SQL to create streaming pipelines

Customers have a range of open source data processing technologies to choose from, such as Flink, Spark, ksqlDB, and RisingWave. Each platform offers a streaming API for data processing. SQL streaming jobs offer a powerful and intuitive way to process real-time data with minimal complexity. These pipelines use SQL, a widely known and declarative language, to perform real-time transformations, filtering, aggregations, and joins in continuous data streams.

To illustrate the power of streaming SQL in ecommerce fraud prevention, consider the concept of velocity checks, which are a critical fraud detection pattern. Velocity checks are a type of security measure used to detect unusual or rapid activity by monitoring the frequency and volume of specific actions within a given timeframe. These checks help identify potential fraud or abuse by analyzing repeated behaviors that deviate from normal user patterns. Common examples include detecting multiple transactions from the same IP address in a short time span, monitoring bursts of account creation attempts, or tracking the repeated use of a single payment method across different accounts.

Use case: Riskified’s velocity checks

Riskified implemented a real-time velocity check using streaming SQL to monitor purchasing behavior based on user identifier.

In this setup, transaction data is continuously streamed through a Kafka topic. Each message contains user agent information originating from the browser, along with the raw transaction data. Streaming SQL queries are used to aggregate the number of transactions originating from a single user identifier within short time windows.

For example, if the number of transactions from a given user identifier exceeds a certain threshold within a 10-second period, this might signal fraudulent activity. When that threshold is breached, the system can automatically flag or block the transactions before they are completed. The following figure and accompanying code provide a simplified example of the streaming SQL query used to detect this behavior.

Velocity check SQL flow

SELECT userIdentifier,TUMBLE_START(createdAt, INTERVAL '10' SECONDS) 
  AS windowStart,TUMBLE_END(createdAt, INTERVAL '10' SECONDS) 
  AS windowEnd, COUNT(*) AS paymentAttempts
FROM transactions
  WINDOW TUMBLING (SIZE 10 SECONDS)
GROUP BY userIdentifier;

Although defining SQL queries over static datasets might appear straightforward, developing and maintaining robust streaming applications introduces unique challenges. Traditional SQL operates on bounded datasets, which are finite collections of data stored in tables. In contrast, streaming SQL is designed to process continuous, unbounded data streams resembling the SQL syntax.

To address these challenges at scale and make streaming job creation accessible to engineering teams, Riskified implemented a self-serve solution based on Confluent ksqlDB, using its SQL interface and built-in Kafka integration. Engineers could define and deploy streaming pipelines using SQL, chaining ksqlDB streams from source to sink. The system supported both stateless and stateful processing directly on Kafka topics, with Avro schemas used to define the structure of streaming data.

Although ksqlDB provided a fast and approachable starting point, it eventually revealed several limitations. These included challenges with schema evolution, difficulties in managing compute resources, and the absence of an abstraction for managing pipelines as a cohesive unit. As a result, Riskified began exploring alternative technologies that could better support its expanding streaming use cases. The following sections outline these challenges in more detail.

Evolving the stream processing architecture

In evaluating alternatives, Riskified focused on technologies that could address the specific demands of fraud detection while preserving the simplicity that made the original approach appealing. The team encountered the following challenges in maintaining the previous solution:

  • Schemas are managed in Confluent Schema Registry, and the message format is Avro with FULL compatibility mode enforced. Schemas are constantly evolving according to business requirements. They are version controlled using Git with a strict continuous integration and continuous delivery (CI/CD) pipeline. As schemas grew more complex, ksqlDB’s approach to schema evolution didn’t automatically incorporate newly added fields. This behavior required dropping streams and recreating them to add new fields instead of just restarting the application to incorporate new fields. This approach caused inconsistencies with offset management due to the stream’s tear-down.
  • ksqlDB enforces a TopicNameStrategy schema registration strategy, which provides 1:1 schema-to-topic coupling. This means the exact schema definition has to be registered multiple times, one time for each topic it is used for. Riskified’s schema registry deployment uses RecordNameStrategy for schema registration. It’s an efficient schema registry strategy that allows for sharing schemas across multiple topics, storing fewer schemas, and reducing registry management overhead. Having mixed strategies in the schema registry caused errors with Kafka consumer clients attempting to decode messages, because the client implementation expected a RecordNameStrategy according to Riskified’s standard.
  • ksqlDB internally registers schema definitions in specific ways where fields are interpreted as nullable, and Avro Enum types are converted to Strings. This behavior caused deserialization errors when attempting to migrate native Kafka consumer applications to use the ksqlDB output topic. Riskified’s code base uses the Scala programming language, where optional fields in the schema are interpreted as Option. Transforming every field as optional in the schema definition required heavy refactoring, treating all Enum fields as Strings, and handling the Option data type for every field that requires safe handling. This cascading effect made the migration process more involved, requiring additional time and resources to achieve a smooth transition.

Managing resource contention in ksqlDB streaming workloads

ksqlDB queries are compiled into a Kafka Streams topology. The query definition defines the topology’s behavior.

Streaming query resources are shared rather than isolated. This approach typically leads to the overallocation of cluster resources. Its tasks are distributed across nodes in a ksqlDB cluster. This architecture means processing tasks with no resource isolation, and a specific task can impact other tasks running on the same node.

Resource contention between tasks on the same node is common in a production-intensive environment when using a cluster architecture solution. Operation teams often fine-tune cluster configurations to maintain acceptable performance, frequently mitigating issues by over-provisioning cluster nodes.

Challenges with ksqlDB pipelines

A ksqlDB pipeline is a chain of individual streams and lacks flow-level abstraction. Imagine a complex pipeline where a consumer publishes to multiple topics. In ksqlDB, each topic (both input and output) must be managed as a separate stream abstraction. However, there is no high-level abstraction to represent an entire pipeline that chains these streams together. As a result, engineering teams must manually assemble individual streams into a cohesive data flow, without built-in support for managing them as a single, complete pipeline.

This architectural approach particularly impacts operational tasks. Troubleshooting requires examining each stream separately, making it difficult to monitor and maintain pipelines that contain dozens of interconnected streams. When issues occur, the health of each stream needs to be checked individually, with no logical data flow component to help understand the relationships between streams or their role in the overall pipeline. The absence of a unified view of the data flow significantly increased operational complexity.

Flink as an alternative

Riskified began exploring alternatives for its streaming platform. The requirements were clear: a strong processing technology that combines a rich low-level API and a streaming SQL engine, backed by a strong open source community, proven to perform in the most demanding production environments.

Unlike the previous solution, which supported only Kafka-to-Kafka integration, Flink offers an array of connectors for various databases and Streaming platforms. It was quickly recognized that Flink had the potential to handle complex streaming use cases.

Flink offers multiple deployment options, including standalone clusters, native Kubernetes deployments using operators, and Hadoop YARN clusters. For enterprises seeking a fully managed option, cloud providers like AWS offer managed Flink services that help alleviate operational overhead, such as Managed Service for Apache Flink.

Benefits of using Managed Service for Apache Flink

Riskified decided to implement a solution using Managed Service for Apache Flink. This choice offered several key advantages:

  • It offers a quick and reliable way to run Flink applications and reduces the operational overhead of independently managing the infrastructure.
  • Managed Service for Apache Flink provides true job isolation by running each streaming application in its dedicated cluster. This means you can manage resources separately for each job and reduce the risk of heavy streaming jobs inflicting resource starvation for other running jobs.
  • It offers built-in monitoring using Amazon CloudWatch metrics, application state backup with managed snapshots, and automatic scaling.
  • AWS offers comprehensive documentation and practical examples to help accelerate the implementation process.

With these features, Riskified could focus on what truly matters—getting closer to the business goal and starting to write applications.

Using Flink’s streaming SQL engine

Developers can use Flink to build complex and scalable streaming applications, but Riskified saw it as more than just a tool for experts. They wanted to democratize the power of Flink into a tool for the entire organization, to solve complex business challenges involving real-time analytics requirements without needing a dedicated data professional.

To replace their previous solution, they envisioned maintaining a “build once, deploy many” application, which encapsulates the complexity of the Flink programming and allows the users to focus on the SQL processing logic.

Kafka was maintained as the input and output technology for the initial migration use case, which is similar to the ksqlDB setup. They designed a single, flexible Flink application where end-users can modify the input topics, SQL processing logic, and output destinations through runtime properties. Although ksqlDB primarily focuses on Kafka integration, Flink’s extensive connector ecosystem enables it to expand to diverse data sources and destinations in future phases.

Managed Service for Apache Flink provides a flexible way to configure streaming applications without modifying their code. By using runtime parameters, you can change the application’s behavior without modifying its source code.

Using Managed Service for Apache Flink for this approach includes the following steps:

  1. Apply parameters for the input/output Kafka topic, a SQL query, and the input/output schema ID (assuming you’re using Confluent Schema Registry).
  2. Use AvroSchemaConverter to convert an Avro schema into a Flink table.
  3. Apply the SQL processing logic and save the output as a view.
  4. Sink the view results into Kafka.

The following diagram illustrates this workflow.
Streaming SQL system diagram

Performing Flink SQL query compilation without a Flink runtime environment

Providing end-users with significant control to define their pipelines makes it critical to verify the SQL query defined by the user before deployment. This validation prevents failed or hanging jobs that could consume unnecessary resources and incur unnecessary costs.

A key challenge was validating Flink SQL queries without deploying the full Flink runtime. After investigating Flink’s SQL implementation, Riskified discovered its dependency on Apache Calcite – a dynamic data management framework that handles SQL parsing, optimization, and query planning independently of data storage. This insight enabled using Calcite directly for query validation before job deployment.

You must know how the data is structured to validate a Flink SQL query on a streaming source like a Kafka topic. Otherwise, unexpected errors might occur when attempting to query the streaming source. Although an expected schema is used with relational databases, it’s not enforced for streaming sources.

Schemas guarantee a deterministic structure for the data stored in a Kafka topic when using a schema registry. A schema can be materialized into a Calcite table that defines how data is structured in the Kafka topic. It allows inferring table structures directly from schemas (in this case, Avro format was used), enabling thorough field-level validation, including type checking and field existence, all before job deployment. This table can later be used to validate the SQL query.

The following code is an example of supporting basic field types validation using Calcite’s AbstractTable:

public class FlinkValidator {
    public static void validateSQL(String sqlQuery, Schema avroSchema) throws Exception {
        SqlParser.Config sqlConfig = SqlParser.config()
                .withCaseSensitive(true);
        SqlParser sqlParser = SqlParser.create(sqlQuery, sqlConfig);
        SqlNode parsedQuery = sqlParser.parseQuery();
        RelDataTypeFactory typeFactory = new SqlTypeFactoryImpl(RelDataTypeFactory.DEFAULT);
        CalciteSchema rootSchema = createSchemaWithAvro(avroSchema);
        SqlValidator validator = SqlValidatorUtil.newValidator(
                Frameworks.newConfigBuilder().build().getOperatorTable(),
                rootSchema.createCatalogReader(Collections.emptyList(), typeFactory),
                typeFactory,
                SqlValidator.Config.DEFAULT
        );
        validator.validate(parsedQuery);
    }
    private static CalciteSchema createSchemaWithAvro(Schema avroSchema) {
        CalciteSchema rootSchema = CalciteSchema.createRootSchema(true);
        rootSchema.add("TABLE", new SimpleAvroTable(avroSchema));
        return rootSchema;
    }
    private static class SimpleAvroTable extends org.apache.calcite.schema.impl.AbstractTable {
        private final Schema avroSchema;
        public SimpleAvroTable(Schema avroSchema) {
            this.avroSchema = avroSchema;
        }
        @Override
        public RelDataType getRowType(RelDataTypeFactory typeFactory) {
            RelDataTypeFactory.Builder builder = typeFactory.builder();
            for (Schema.Field field : avroSchema.getFields()) {
                builder.add(field.name(), convertAvroType(field.schema(), typeFactory));
            }
            return builder.build();
        }
        private RelDataType convertAvroType(Schema schema, RelDataTypeFactory typeFactory) {
            switch (schema.getType()) {
                case STRING:
                    return typeFactory.createSqlType(SqlTypeName.VARCHAR);
                case INT:
                    return typeFactory.createSqlType(SqlTypeName.INTEGER);
                default:
                    return typeFactory.createSqlType(SqlTypeName.ANY);
            }
        }
    }
}

You can integrate this validation approach as an intermediate step before creating the application. You can create a streaming job programmatically with the AWS SDK, AWS Command Line Interface (AWS CLI), or Terraform. The validation occurs before submitting the streaming job.

Flink SQL and Confluent Avro data type mapping limitation

Flink provides several APIs designed for different levels of abstraction and user expertise:

  • Flink SQL sits at the highest level, allowing users to express data transformations using familiar SQL syntax, which is ideal for analysts and teams comfortable with relational concepts.
  • The Table API offers a similar approach but is embedded in Java or Python, enabling type-safe and more programmatic expressions.
  • For more control, the DataStream API exposes low-level constructs to manage event time, stateful operations, and complex event processing.
  • At the most granular level, the ProcessFunction API provides full access to Flink’s runtime features. It’s suitable for advanced use cases that demand detailed control over state and processing behavior.

Riskified initially used the Table API to define streaming transformations. However, when deploying their first Flink job to a staging environment, they encountered serialization errors related to the avro-confluent library and Table API. Riskified’s schemas rely heavily on Avro Enum types, which the avro-confluent integration doesn’t fully support. As a result, Enum fields were converted to Strings, leading to mismatches during serialization and errors when attempting to sink processed data back to Kafka using Flink’s Table API.

Riskified developed an alternative approach to overcome the Enum serialization limitations while maintaining schema requirements. They discovered that Flink’s DataStream API could correctly handle Confluent’s Avro records serialization with Enum fields, unlike the Table API. They implemented a hybrid solution combining both APIs because the pipeline only required SQL processing on the source Kafka topic. It can sink to the output without any additional processing. The Table API is used for data processing and transformations, only converting to the DataStream API at the final output stage.

Managed Service for Apache Flink supports Flink APIs. It can switch between the Table API and the DataStream API.
A MapFunction can convert the Row type of the Table API into a DataStream of GenericRecord. The MapFunction maps Flink’s Row data type into GenericRecord types by iterating over the Avro schema fields and building the GenericRecord from the Flink Row type, casting the Row fields into the correct data type according to the Avro schema. This conversion is required to overcome the avro-confluent library limitation with Flink SQL.

The following diagram and illustrates this workflow.

Flink Table and DataStream APIs

The following code is an example query:

// SQL Query for filtering
Table queryResults = tableEnv.sqlQuery(
       "SELECT * FROM InputTable");
// 1. Convert query results from Table API to a DataStream<Row> and use DataStream API to sink query results to Kafka topic
DataStream<Row> rowStream = tableEnv.toDataStream(queryResults);
// Fetch the schema string from the schema registry
String schemaString = fetchSchemaString(schemaRegistryURL, schemaSubjectName);
// 2. Convert Row to GenericRecord with explicit TypeInformation, using custom AvroMapper
TypeInformation<GenericRecord> typeInfo = new GenericRecordAvroTypeInfo(avroSchema);
DataStream<GenericRecord> genericRecordStream = rowStream
       .map(new AvroMapper(schemaString))
       .returns(typeInfo); // Explicitly set TypeInformation
// 3. Define Kafka sink using ConfluentRegistryAvroSerializationSchema
KafkaSink<GenericRecord> kafkaSink = KafkaSink.<GenericRecord>builder()
       .setBootstrapServers(bootstrapServers)
       .setRecordSerializer(
               KafkaRecordSerializationSchema.builder()
                       .setTopic(sinkTopic)
                       .setValueSerializationSchema(
                               ConfluentRegistryAvroSerializationSchema.forGeneric(
                                       schemaSubjectName,
                                       avroSchema,
                                       schemaRegistryURL
                               )
                       )
                       .build()
       )
       .build();
// Sink to Kafka
genericRecordStream.sinkTo(kafkaSink);

CI/CD With Managed Service for Apache Flink

With Managed Service for Apache Flink, you can run a job by selecting an Amazon Simple Storage Service (Amazon S3) key containing the application JAR. Riskified’s Flink code base was structured as a multi-module repository to support additional use cases besides supporting self-service SQL. Each Flink job source code in the repository is an independent Java module. The CI pipeline implemented a robust build and deployment process consisting of the following steps:

  1. Build and compile each module.
  2. Run tests.
  3. Package the modules.
  4. Upload the artifact to the artifacts bucket twice: one JAR under <module>-<version>.jar and the second as <module>-latest.jar, resembling a Docker registry like Amazon Elastic Container Registry (Amazon ECR). Managed Service for Apache Flink jobs uses the latest tag artifact in this case. However, a copy of old artifacts is kept for code rollback reasons.

A CD process follows this process:

  1. When merged, it lists all jobs for each module using the AWS CLI for Managed Service for Apache Flink.
  2. The application JAR location is updated for each application, which triggers a deployment.
  3. When the application is in a running state with no errors, the following application will be continued.

To allow safe deployment, this process is done gradually for every environment, starting with the staging environment.

Self-service interface for submitting SQL jobs

Riskified believes an intuitive UI is crucial for system adoption and efficiency. However, developing a dedicated UI for Flink job submission requires a pragmatic approach, because it might not be worth investing in unless there’s already a web interface for internal development operations.

Investing in UI development should align with the organization’s existing tools and workflows. Riskified had an internal web portal for similar operations, which made the addition of Flink job submission capabilities a natural extension of the self-service infrastructure.

An AWS SDK was installed on the web server to allow interaction with AWS components. The client receives user input from the UI and translates it into runtime properties to adjust the behavior of the Flink application. The web server then uses the CreateApplication API action to submit the job to Managed Service for Apache Flink.

Although an intuitive UI significantly enhances system adoption, it’s not the only path to accessibility. Alternatively, a well-designed CLI tool or REST API endpoint can provide the same self-service capabilities.

The following diagram illustrates this workflow.

Flow sequence diagram

Production experience: Flink’s implementation upsides

The transition to Flink and Managed Service for Apache Flink proved efficient in numerous aspects:

  • Schema evolution and data handling – Riskified can either periodically fetch updated schemas or restart applications when schemas evolve. They can use existing schemas without self-registration.
  • Resource isolation and management – Managed Service for Apache Flink runs each Flink job as an isolated cluster, reducing resource contention between jobs.
  • Resource allocation and cost-efficiency – Managed Service for Apache Flink enables minimum resource allocation with automatic scaling, proving to be more cost-efficient.
  • Job management and flow visibility – Flink provides a cohesive data flow abstraction through its job and task model. It manages the entire data flow in a single job and distributes the workload evenly over multiple nodes. This unified approach enables better visibility into the entire data pipeline, simplifying monitoring, troubleshooting, and optimizing complex streaming workflows.
  • Built-in recovery mechanism – Managed Service for Apache Flink automatically creates checkpoints and savepoints that enable stateful Flink applications to recover from failures and resume processing without data loss. With this feature, streaming jobs are durable and can recover safely from errors.
  • Comprehensive observability – Managed Service for Apache Flink exposes CloudWatch metrics that monitor Flink application performance and statistics. You can also create alarms based on these metrics. Riskfied decided to use the Cloudwatch Prometheus Exporter to export these metrics to Prometheus and build PrometheusRules to align Flink’s monitoring to the Riskified standard, which uses Prometheus and Grafana for monitoring and alerting.

Next steps

Although the initial focus was Kafka-to-Kafka streaming queries, Flink’s wide range of sink connectors offers the possibility of pluggable multi-destination pipelines. This versatility is on Riskfied’s roadmap for future enhancements.

Flink’s DataStream API provides capabilities that extend far beyond self-serving streaming SQL capabilities, opening new avenues for more sophisticated fraud detection use cases. Riskified is exploring ways to use DataStream APIs to enhance ecommerce fraud prevention strategies.

Conclusions

In this post, we shared how Riskified successfully transitioned from ksqlDB to Managed Service for Apache Flink for its self-serve streaming SQL engine. This move addressed key challenges like schema evolution, resource isolation, and pipeline management. Managed Service for Apache Flink offers features such as including isolated jobs environments, automatic scaling, and built-in monitoring, which proved more efficient and cost-effective. Although Flink SQL limitations with Kafka required workarounds, using Flink’s DataStream API and user-defined functions resolved these issues. The transition has paved the way for future expansion with multi-targets and advanced fraud detection capabilities, solidifying Flink as a robust and scalable solution for Riskified’s streaming needs.

If Riskified’s journey has sparked your interest in building a self-service streaming SQL platform, here’s how to get started:


About the authors

Gal Krispel is a Data Platform Engineer at Riskified, specializing in streaming technologies such as Apache Kafka and Apache Flink. He focuses on building scalable, real-time data pipelines that power Riskified’s core products. Gal is particularly interested in making complex data architectures accessible and efficient across the organization. His work spans real-time analytics, event-driven design, and the seamless integration of stream processing into large-scale production systems.

Sofia ZilbermanSofia Zilberman works as a Senior Streaming Solutions Architect at AWS, helping customers design and optimize real-time data pipelines using open-source technologies like Apache Flink, Kafka, and Apache Iceberg. With experience in both streaming and batch data processing, she focuses on making data workflows efficient, observable, and high-performing.

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-centered, data-intensive systems for over 25 years, working across industries both through consultancies and product companies. He has used open-source technologies extensively and contributed to several projects, including Apache Flink, and is the maintainer of the Flink Prometheus connector.

How to use the new AWS Secrets Manager Cost Allocation Tags feature

Post Syndicated from Jirka Fajfr original https://aws.amazon.com/blogs/security/how-to-use-the-new-aws-secrets-manager-cost-allocation-tags-feature/

AWS Secrets Manager is a service that you can use to manage, retrieve, and rotate database credentials, application credentials, API keys, and other secrets throughout their lifecycles. You can use Secrets Manager to replace hard-coded credentials in application source code with a runtime call to the Secrets Manager service to retrieve credentials dynamically when you need them. Storing the credentials in Secrets Manager helps avoid unintended access by anyone who inspects your application’s source code, configuration, or components.

Until today, your AWS bill would reflect the total cost of Secrets Manager in any given account, and you had no option to break out the cost per secret to a given cost center or organization.

In this post, we introduce a new feature—Secrets Manager Costs Allocation Tags—and walk through how you can use them for improved visibility into your Secrets Manager costs. Before getting into the details of this new feature, we want to give you primer about cost allocation tags.

A tag is a key-value pair that you assign to an AWS resource. In AWS Cost Explorer, you can activate tags as cost allocation tags. With tags activated, you can categorize and track costs by cost allocation tags. For example, you can create a tag named Group with value Engineering and assign it to resources owned by the engineering team of your company. After activating the Group tag as a cost allocation tag, you can track charges with this tag, filter or group by tags in Cost Explorer, and add tags to reports such as cost and usage reports for further analysis and visualization.

Cost allocation in AWS is a four step process:

  1. Create the required cost allocation tags
  2. Attach cost allocation tags to your resources
  3. Activate your tags in the Cost Allocation Tags section of the AWS Billing console
  4. Filter the tags, group by tags in Cost Explorer, and create cost categories

After you create and attach the tags to resources, they appear in the AWS Billing console Cost Allocation Tags section under User-defined cost allocation tags within 24 hours. You must activate these tags for AWS to start tracking them for your resources and for the tags to show up in Cost Explorer. When the tags appear under Tags in the Filter or Group by fields in Cost Explorer, you can start filtering or grouping by tag to view usage and charges by tag.

AWS Secrets Manager now supports cost allocation tags

Secrets Manager now supports cost allocation tags, giving you more granular control and visibility into your Secrets Manager costs. You can use this feature to categorize and track your Secrets Manager usage charges at a more detailed level, helping you to better understand and manage your AWS spending and assign costs per secret back to cost centers or organizations.

Solution overview: Enhanced cost visibility and management

With this new capability, you can:

  • Break down Secrets Manager costs by department, project, environment, or other dimensions important to your organization
  • View itemized Secrets Manager usage in Cost Explorer as well as in cost and usage reports
  • Improve cost allocation and chargeback processes across your business units and organizations

Prerequisites

To walk through the examples in this post, you need to have the following:

  1. An AWS account
  2. Access to the AWS Management Console or the AWS Command Line Interface (AWS CLI) version 2
  3. An existing tagging and cost allocation strategy
  4. Existing secrets inside Secrets Manager

Create user-defined tags for cost allocation purposes using the console

In this example, we assume that you want to manage the cost of your secrets by different cost centers in your organization. Here, we create a tag with CostCenter as a key and the value equal to the cost center codes of the teams that are using secrets.

You’ll walk through two examples, the first one with a cost center for Engineering and a second one with a cost center for Finance. You will reuse those examples throughout this post.

In this example, start by creating and assigning a tag called cost allocation center with the key name: CostCenter and assign a cost center value of 7263 for the engineering department to an existing or new secret.

To assign a user-defined tag to a new or existing secret:

  1. In the Secrets Manager console, choose Secrets from the navigation pane.
  2. In the list of available secrets, select the secret to edit the tags or choose Store A New Secret to create a new secret.
  3. When the secret is displayed, select the Tags option and choose Edit Tags to add new or edit existing tags.
    Figure 1: Assign a user-defined tag to an existing secret

    Figure 1: Assign a user-defined tag to an existing secret

  4. Repeat the process, but assign the cost center value of 7263 for the engineering department and 1121 for the finance department to a second secret.

Create user-defined tags for cost allocation purposes using the AWS CLI

Optionally, you can use the AWS CLI to create the same tags as in the preceding example.

To use the AWS CLI to create tags:

  1. Use the following AWS CLI command to create the first tag:
    aws secretsmanager tag-resource \
        --secret-id prod/mastersecret \
        --tags Key=Role,Value=admin
    

  2. Create the second tag using the following AWS CLI command:
    aws secretsmanager tag-resource \
        --secret-id prod/mastersecret \
        --tags Key=CostCenter,Value=7263
    

    This command produces no output in case of a successful execution.

  3. Use the second AWS CLI command with a value of 1121 for the second secret.

Turn tags into cost allocation tags using the AWS Billing and Cost Management console

The next step is to activate the user-defined tags within the AWS Billing and Cost Management console so they can be used.

To activate cost allocation tags:

  1. Go to the AWS Billing and Cost Management console and choose Cost allocation tags in the navigation pane.
  2. Select the option for user-defined cost allocation tags.
  3. Select the tag keys that you want to activate.
  4. Choose Activate.

Note: After you create and apply user-defined tags to your resources, it can take up to 24 hours for the tag keys to appear on your cost allocation tags page for activation. It can then take up to 24 hours for tag keys to activate.

Figure 2: Activate cost allocation tags

Figure 2: Activate cost allocation tags

Turn tags into cost allocation tags using the AWS CLI

You can also activate user-defined tags within the AWS Billing and Cost Management Console using the AWS CLI.

To activate tags using the AWS CLI:

  1. For activation of the first user-defined tag use the following AWS CLI command:
    aws ce update-cost-allocation-tags-status \
        --cost-allocation-tags-status TagKey=Role,Status=Active
    

  2. To activate the second user-defined tag use the following AWS CLI command:
    aws ce update-cost-allocation-tags-status \
        --cost-allocation-tags-status TagKey=CostCenter,Status=Active
    

View the results in Cost Explorer

The last step is to view the results for secrets in Cost Explorer. When the tag CostCenter shows up under Tags in the Filter or Group By fields in Cost Explorer, you can start filtering or grouping by the tag to view usage and charges by tag.

When applying the tag filter for Secrets Manager, Cost Explorer displays charges only for resources tagged with the selected tag values. And when grouped by a particular tag, the charges are grouped by each value of the selected tag.

To view results:

As an example, use the following parameters to view results.

  1. In the Cost Center console, select the right arrow (>) icon to open the report parameters options to the right of the billing dashboard.
  2. On the Report parameters window:
    1. For Date Range, enter the desired time range.
    2. Under Group by, for Dimension, select Tag, and for Tag select Cost Center.
    3. For Filters, Service, select Secrets Manager.
      Figure 3: Configure report parameters

      Figure 3: Configure report parameters

You can use the resulting report to clearly identify the cost and usage of the two secrets, broken down into the two cost centers: engineering 7263 and finance 1121. Now, you can cross-charge secrets to the corresponding cost centers in your organization and provide a report similar to Figure 4.

Figure 4: Cost and usage report

Figure 4: Cost and usage report

Conclusion

In this post, we introduced the AWS Secrets Manager Cost Allocation Tags feature and showed you how to use AWS Cost Explorer Costs and Usage Reports to gain secrets usage insights. You can now allocate the cost for every secret to one or more cost centers and charge them accordingly. See the AWS Secrets Manager Cost Allocation Tag documentation to learn more about how you can use Secrets Manager Cost Allocation Tags in your accounts.

Go to the AWS Secrets Manager console to get started. For more information, see AWS Secrets Manager.

To learn more about how to build an effective tagging strategy for cost allocation and financial management, see the Tags for cost allocation and financial management whitepaper.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Jirka Fajfr

Jirka Fajfr

Jirka is a Software Engineer at AWS working in the Cryptography organization focusing on AWS Secrets Manager. He’s passionate about helping customers secure their applications and manage sensitive information and contributes to building and improving secure, scalable solutions for secrets management in the cloud. Before joining AWS, he applied neural networks to predict electricity load demand and price, bringing together data science and utility infrastructure.

Marc Luescher

Marc Luescher

Marc is a Senior Solutions Architect helping enterprise customers be successful, focusing strongly on threat detection, incident response, and data protection. His background is in networking, security, and observability. Previously, he worked in technical architecture and security hands-on positions within the healthcare sector as an AWS customer. Outside of work, Marc enjoys his two dogs, three cats, twenty chickens, and his huge yard.

Control instance placement using Asset Level Capacity Management for AWS Outposts

Post Syndicated from Brianna Rosentrater original https://aws.amazon.com/blogs/compute/control-instance-placement-using-asset-level-capacity-management-for-aws-outposts/

AWS Outposts supports self-service capacity management at the entire Outpost level, or at the individual asset level, making it easy for you to view and manage compute capacity on your Outposts. This feature supports both Outposts rack (such as the recently announced second-generation Outposts rack) and Outposts server. A default capacity configuration for each new Outpost is determined during the ordering process. This default configuration can subsequently be modified to create a range of Amazon Elastic Compute Cloud (Amazon EC2) instance sizes and quantities to meet your changing business needs. For more information on performing Outposts level multi-asset reconfigurations, go to Dynamically reconfigure your AWS Outposts capacity using Capacity Tasks.

The release of Asset Level Capacity Management allows you to control the configuration of specific assets within your Outpost, which can be useful when planning strategies for EC2 Auto Scaling groups and host-level high availability. An Outpost asset can be a single server within an Outposts rack, or an Outposts server. This post focuses on how to use Asset Level Capacity Management to perform single-host reconfigurations, and how this can be used with Amazon EC2 placement groups to control instance placement on your Outpost.

Overview

When you place an Outposts order, you determine the capacity configuration of each Outpost based on the anticipated workload requirements. You can scale your Outposts up or out as needed during your commitment term. For further details on Outpost capacity planning including best practices, refer to the Capacity Planning – AWS Outposts High Availability Design and Architecture whitepaper. We recommend planning spare capacity for N+M host availability per instance family when making modifications to your Outpost capacity configuration for workloads that need to be highly available. To calculate, take the number of assets (N) you need to run all your workloads, and then add (M) additional assets to meet your requirements for server availability during failure and maintenance events.

You also need to plan for instance level high availability when deciding to reconfigure particular assets. For example, say you have two C5 assets, and each one is configured homogeneously to provide C5.2xlarge instances. If you have an Auto Scaling group that specifies C5.2xlarge in its launch template, and you perform an asset level reconfiguration of one of your C5 assets so that it only offers C5.4xlarge instances, then your Auto Scaling group can only launch instances on the one C5 host configured to provide C5.2xlarge instances. If that host fails, then the Auto Scaling group is unable to launch new C5.2xlarge instances on the other host unless the Auto Scaling group launch template is modified. Understanding failure scenario behavior and how much capacity you want to reserve for high availability is key to capacity management and disaster recovery planning. For highly available workloads, we recommend spreading your instances across as many assets as possible.

Understanding EC2 placement groups on AWS Outposts

Outposts rack supports EC2 placement groups, and two placement group options are available only on Outposts: rack level spread, and host level spread. This allows you to spread out instances across underlying hardware on an Outpost at your site. To use a rack level spread placement group, you must have two or more physical Outpost racks. Each spread strategy can be used to create resilient Outposts architectures that can withstand a rack or host failure depending on the respective strategy used.

Rack level spread

Figure 1: Outposts rack showing a rack level spread EC2 placement group

Figure 1: Outposts rack showing a rack level spread EC2 placement group

Using a multi-rack Outpost, you can spread your EC2 instances across multiple racks with a rack level spread EC2 placement group. When used with Auto Scaling groups, this allows you to withstand an individual rack or multi-asset failure. When your Auto Scaling group detects you’ve lost instances on one of your racks, it automatically relaunches the instances using the assets on your other racks if you have available capacity. To use this strategy to increase your workload resiliency, each rack would need to have assets that can support the instance type (C5 is used in the preceding figure) and size used in your Auto Scaling group launch template. The expanded functionality that asset level capacity management brings to capacity tasks allows you to configure your Outpost so that each rack has at least one asset that can support the instances used in your Auto Scaling groups. Configure your assets on each rack to meet your resiliency goals for host failure tolerance as well. This configuration can be done in an on-demand, self-service fashion to meet the needs of your evolving workloads if instance requirements change over time.

Host level spread

Figure 2: Outposts rack showing a host level spread EC2 placement group

Although rack level spread EC2 placement groups need a multi-rack Outpost, host level spread EC2 placement groups can be used within a single rack Outpost to provide resiliency for your workloads at the asset level. When used with Auto Scaling groups, this allows you to withstand an individual asset or multi-asset failure depending on your Outpost configuration. When your Auto Scaling group detects you’ve lost instances on one of your assets due to a hardware failure, it automatically relaunches the instances using your other assets on your Outpost if you have available capacity. To use this strategy to increase your workload resiliency, you would need to have at least two assets within your Outposts rack that can support the instance type (R5 and M5 are used in the preceding figure) and size used in your Auto Scaling group launch template. Outposts also supports using attribute-based instance type selection if multiple instance types meet your workload needs based on some minimum resource requirements. With the expanded functionality that asset level capacity management brings to capacity tasks, you can configure your Outposts rack so that each asset type can support the instance size used in your Auto Scaling groups. This configuration can be done in an on-demand, self-service fashion to meet the needs of your evolving workloads if instance requirements change over time.

Using asset level capacity tasks

Asset Level Capacity Management allows you to target a specific Outpost asset to change its capacity configuration directly, allowing granular control over instance capacity pool configurations. Outpost assets are referred to by a unique ten-digit Asset ID. The first step in this process is identifying a suitable asset on which to perform the capacity task. To do this, you can use the rack view within the Outposts console page to view each asset, its current capacity configuration, and its current usage. Choosing an asset with fewer running instances may increase the chances of the capacity task being successful without needing instances to be stopped.

In the following example, the rack view has been filtered by the R5 family resulting in the two R5 assets being displayed. The Show instance details option has also been chosen to show the instance IDs of the running instances on our Outposts rack.

Figure 3: Rack view of the Outposts console

When you have identified the asset to target for the capacity task, you can either choose the Modify option in the top right of the asset itself or go to Capacity Tasks from the console menu and choose the asset ID directly from the dropdown menu.

Figure 4: Capacity tasks console experience

From here, you have the option to use the capacity configuration builder to interactively modify your Outposts capacity layout, or you can upload a capacity configuration plan JSON document with the necessary configuration. When building the capacity task, you have two options to choose from when handling instances that are blocking the task from executing. The default option is set to fail the capacity task if this occurs. However, this can be set to wait for the instances to be stopped so that the task can continue. If this option is chosen, then the asset is placed into an isolated state until either the capacity task completes or is cancelled, thus preventing any further instances launches on the impacted asset.

If there are instances on the asset that can’t be stopped to complete the capacity task, then they can be chosen from the Instances to keep as-is section. Only the instances running on the impacted asset are listed. If a capacity task can’t be completed while leaving the chosen instances running, the capacity task fails.

In the following example, the capacity configuration requested for the asset results in the removal of one r5.4xlarge and two r5.2xlarge instances, which creates sufficient space for the creation of 12 r5.large instances. This asset also has three instances running on it which have all been chosen to keep as-is during the execution of the task.

Figure 5: Capacity task example showing r5 asset level capacity management

You can also execute capacity tasks programmatically If you prefer through CLI or API calls. For example, using the start-capacity-task CLI to submit the same configuration would look as follows:

aws outposts start-capacity-task \
--outpost-id op-07f6f537e0607d3f1 \
--asset-id 1702928095\
--instances-to-exclude '{
    "Instances": ["i- 03f53189ffedcc72c", "i-044383b9051299b50", "i-0dfd88574237a68a4"],
    "AccountIds": ["450360193046", "450360193046", "450360193046"],
    "Services": ["EC2", "EC2", "EC2"]
}' \
--task-action-on-blocking-instances FAIL_TASK \
--instance-pools '[
    {
        "InstanceType": "r5.large",
        "Count": 12
    },
    {
        "InstanceType": "r5.xlarge",
        "Count": 6
    },
    {
        "InstanceType": "r5.2xlarge",
        "Count": 4
    },
    {
        "InstanceType": "r5.4xlarge",
        "Count": 1
    }
]'

After defining the capacity task, you are presented with an overview of the requested changes before submitting the task for execution. When it’s submitted, the task first enters a Requested status while the configuration is evaluated, before either being moved to In Progress if the task is valid or Failed if it’s invalid or blocked by running instances.

When the capacity task has successfully completed and the capacity pools for the asset are updated, you can validate this by returning to the rack view within the Outpost console, or by using the CLI/API. The following is an example using the list-assets CLI command:

aws outposts list-assets --outpost-identifier op-07f6f537e0607d3f1 --query "Assets[?AssetId=='1702928095']"

[
    {
        "AssetId": " 1702928095",
        "RackId": "1702928115",
        "AssetType": "COMPUTE",
        "ComputeAttributes": {
            "State": "ACTIVE",
            "InstanceFamilies": [
                "R5"
            ],
            "InstanceTypeCapacities": [
                {
                    "InstanceType": "r5.2xlarge",
                    "Count": 4
                },
                {
                    "InstanceType": "r5.4xlarge",
                    "Count": 1
                },
                {
                    "InstanceType": "r5.xlarge",
                    "Count": 6
                },
                {
                    "InstanceType": "r5.large",
                    "Count": 12
                }
            ],
            "MaxVcpus": 96
        },
        "AssetLocation": {
            "RackElevation": 27.0
        }
    }
]

Only a single capacity task for an asset can be executing at any given time. If you attempt to create a second capacity task for the same asset while the original is still in a Requesting or In Progress status, then the submission of the task fails. However, you can submit multiple capacity tasks for unique assets within the same Outpost. For example, using the CLI commands, you could execute a single script to change the capacity configuration of all assets within an Outpost through individual asset level capacity tasks.

Considerations

  • Make sure that if you’re specifying instance type in your launch templates, then this instance type is available on multiple assets if your workload needs to be resilient against host failures.
  • Understand which failure scenarios could exist within your environment, and plan for how each one should be handled. Failure planning is essential for maintaining workload uptime in production environments.
  • Capacity tasks can only be executed from the AWS account that owns the Outpost. If Outpost resources are shared to workload accounts through AWS Resource Access Manager (AWS RAM), then these accounts can’t submit capacity tasks.
  • You can manipulate your capacity configuration to control instance placement at launch. If only certain assets support the instance size and type you want to deploy, then your instance must be launched on one of those assets.
  • If executing capacity tasks through CLI commands, make sure that your CLI has been updated to the latest version. We have updated our CLI with this feature release to include commands for capacity tasks, and they fail if running on outdated versions.

Conclusion

This post demonstrates how to use Asset Level Capacity Management with your AWS Outposts, and reviews considerations for maintaining a highly available capacity configuration. For more information on how to manage and monitor your capacity configuration on Outposts, see the Capacity management for AWS Outposts user guide and the Capacity planning section of the Outposts High Availability Design and Architecture Considerations whitepaper. Reach out to your AWS account team, or fill out this form to learn more about Outposts and self-service capacity management.

Transforming Maya’s API management with Amazon API Gateway

Post Syndicated from Arthi Jaganathan original https://aws.amazon.com/blogs/architecture/transforming-mayas-api-management-with-amazon-api-gateway/

In this post, you will learn how Amazon Web Services (AWS) customer, Maya, the Philippines’ leading fintech company and digital bank, built an API management platform to address the growing complexities of managing multiple APIs hosted on Amazon API Gateway. API Gateway is a fully managed service that you can use to create RESTful and WebSocket APIs.

At Maya, different teams build APIs to expose their services to merchants. As the number of applications grew, the overhead of managing APIs increased. An API platform is a set of tools to simplify and standardize across API management concerns such as security, governance, automated deployments, observability, and integrations with multiple AWS accounts. This frees up application teams to focus on features while offloading management concerns to the API platform.

Initial state

Prior to implementing the API platform, Maya used a decentralized API management approach, which created significant challenges. Individual teams operated independent API gateways, resulting in fragmented infrastructure, leading to several issues:

  1. Lack of standardization: Implementing consistent API standards across the organization proved difficult. Each team maintained its own configurations and practices, leading to inconsistencies in security and documentation.
  2. Security posture maintenance: While Maya maintained a strong security posture, doing so across the numerous independent gateways was unsustainable. The overhead of applying consistent security policies and updates across all gateways was becoming increasingly burdensome.
  3. Inconsistent operational visibility: Observability wasn’t inherently limited, rather inconsistently applied. Having multiple, different gateways makes it challenging to enforce a unified observability strategy and correlate data across the entire API ecosystem.

Solution overview

To address these challenges, Maya implemented an API platform, code-named Unified API Gateway. This centralized API management helps enforce consistent standards and improve overall security and observability. The following image illustrates the architecture of the Unified API Gateway and how it integrates with backend services managed and owned by different teams across different AWS accounts.

Enterprise-level AWS architecture diagram showing secured API gateway with multi-account EKS service distribution

API Platform Architecture

Maya chose to host all APIs in a central API account to centralize governance. This is managed by a dedicated shared services cloud team. Amazon CloudFront with AWS WAF and AWS Shield Advanced integration provides perimeter security. An AWS Lambda authorizer provides application security by managing authentication, authorization, and session management. This mitigates against the OWASP top 10 API security risks.

Integration to backend services is configured through API Gateway private integration and AWS Transit Gateway. In a decentralized API deployment strategy where APIs are co-hosted with the service in the respective AWS account, the integration will be simpler because you won’t need cross-account network connectivity. You will still benefit from the API management techniques covered in this post.

Standardization through structured service on-boarding

OpenAPI Specification (OAS) provides a structured definition for APIs. As shown in the following figure, service teams define the API OAS specification. This is embedded in Terraform infrastructure-as-code template for API Gateway. These are checked into source code repository and deployed using GitLab CI.

End-to-end API infrastructure pipeline showing specification integration through GitLab CI to AWS API Gateway

API Gateway Infrastructure-as-code (IaC) Pipeline

A configuration file used as a Terraform template supplies parameters for components of the solution such as backend integration, Lambda authorizer details, and additional headers for auditing. The following OAS snippets demonstrate this.

  1. Integration with the backend service
    x-amazon-apigateway-integration:
       type: "http_proxy"
       connectionId: "${vpc_link_id}"
       httpMethod: "GET"
       uri: "http://$${stageVariables.url}:11620/v1/api/endpoint/{id}" # double $ is not a typo
  2. Adding headers to the request
    x-amazon-apigateway-integration:
       type: "http_proxy"
       connectionId: "${vpc_link_id}"
       httpMethod: "GET"
       uri: "http://$${stageVariables.url}:11620/v1/api/endpoint/{id}"
       requestParameters:
          integration.request.header.x-requesting-service-id: "'api-gw'"
          integration.request.header.x-org-customer-id: "context.authorizer.x-org-customer-id"
  3. Security definition
    securitySchemes:
       lambda-authorizer:
          type: "${authorizer_type}"  
          name: "${authorizer_name}"
          x-amazon-apigateway-authtype: "custom"
          x-amazon-apigateway-authorizer:
             type: "request"
             authorizerUri: "${authorizer_uri}"
             authorizerCredentials: "${authorizer_credentials}"
             identitySource: "${authorizer_identity_source}"

API Gateway supports most of the OpenAPI 2.0 specification and the OpenAPI 3.0 specification but there are a few exceptions. Maya uses a custom plugin in the pipeline to enforce necessary limiting rules to help ensure compatibility with API Gateway.

To simplify deployment for development teams, a custom Terraform module abstracts away the API Gateway implementation details.

module "test-microservice-api-gateway" {
  # module version parameters
  source = "gitlabinstance.com/platform-engineering/apigw-terraform-module/aws"
  version = "1.2.7"

  # module deployed infrastructure parameters
  api_name = var.api_name
  api_mapping_path = var.api_mapping_path
  environment = var.environment
  aws_region = var.aws_region
  account_id = var.account_id
  tags = var.tags
  domain_name = var.domain_name
  stage_name = var.stage_name

  oas_path = var.oas_path # this value is populated via environment variable in Gitlab CI/CD

  providers = {
     aws = aws.apigw
  }
  authorizer_credentials = var.authorizer_credentials
  authorizer_uri = var.authorizer_uri
  vpc_link_id = var.vpc_link_id
  endpoint_url = var.endpoint_url
}

To use multi-level prefixes for custom domains with REST API Gateway, you need the Terraform module for API Gateway v2.

resource "aws_api_gateway_rest_api" "apigw" {
   name = "${var.environment}-${var.api_name}"
   body = templatefile(
     local.oasFilePath,
     {
       vpc_link_id = var.vpc_link_id
       authorizer_uri = var.authorizer_uri
       authorizer_credentials = var.authorizer_credentials
     }
  )
  description = "API Gateway for ${var.api_name}"
  endpoint_configuration {
    types = ["REGIONAL"]
  }

   # Default endpoint needs to be disabled if CloudFront is used as entry point to API Gateway
  disable_execute_api_endpoint = true
  tags = local.tags
  }

  # Use apigatewayv2 in order to have multi level base path ex. /v1/service_name
  resource "aws_apigatewayv2_api_mapping" "this" {
     domain_name = var.domain_name
    api_id = aws_api_gateway_rest_api.apigw.id
    stage = aws_api_gateway_stage.apigw.stage_name
    api_mapping_key = var.api_mapping_path
  }

Simplify API security with automation

Maya’s Unified API Gateway implements a robust, multi-layered security strategy. This approach helps ensure comprehensive protection from external threats and enforces stringent access control policies.

AWS WAF inspects and filters incoming traffic to protect against common web exploits, including OWASP Top 10, such as SQL injection and cross-site scripting attacks. A combination of custom and managed rule sets blocks malicious requests and enforces security policies. AWS Shield Advanced mitigates distributed denial of service (DDoS) attacks and provides 24/7 access to the AWS Shield Response Team (SRT) for expert support during attack events. This helps ensure high availability and resiliency.

API Gateway is integrated with a Lambda authorizer for authentication and authorization. The custom function implements fine-grained access control based on several factors such as identity, roles, and scopes.

To help ensure the consistency and integrity of the API configurations, all updates and deployments are strictly managed through an automated infrastructure-as-code (IaC) pipeline. This helps eliminate the risk of unauthorized or accidental manual changes to the API Gateway and any underlying infrastructure. The IaC pipeline makes sure that all API configurations, including security settings, are deployed through a controlled and auditable process. This prevents configuration drift and makes sure that security policies are consistently applied across all APIs. This also means that all changes are subject to code reviews and version control, adding another layer of security and traceability.

End-to-end visibility with observability

Maya’s Unified API Gateway prioritizes comprehensive observability to proactively monitor API performance, identify potential issues, and provide a seamless user experience. It uses a combination of AWS services and integrated tools to achieve this.

Amazon CloudWatch is used to monitor key performance metrics, including latency, error rates, and requests counts. CloudWatch provides real-time insights into the health and performance of APIs. Alerts on P95 and P99 values help identify and address performance bottlenecks, ensuring responsiveness.

CloudWatch metrics are streamed to Dynatrace, an application performance monitoring (APM) tool. The centralized view helps correlate data from various sources, create custom dashboards, and configure intelligent alerts based on predefined thresholds.

To help ensure complete visibility into API activity, the Lambda authorizer and API Gateway access logs are centralized in Splunk. This provides a comprehensive audit trail to track authentication and authorization events, identify security incidents, and troubleshoot API requests. Headers generated after authentication and authorization are done are passed down to the backend services for proper log correlation.

Future roadmap

The Unified API Gateway will continue to evolve to meet the growing needs of the organization and its partners and customers. The following are the key future enhancements that will further streamline API management, improve the developer experience, and enhance security.

  1. Integration with the internal developer portal: This will provide a self-service UI for bootstrapping new APIs from scratch and further empower developers. This will also simplify documentation and discovery by cataloging all APIs
  2. A modular, extension-based design for enhanced processing: This will introduce custom processing of requests in-line in the gateway account before integrating with backend services. Examples include digital signature verification, message transformation, and custom business logic. A modular design will offer a flexible and scalable way to enhance the functionality of Maya’s APIs without modifying backend services.
  3. Bring your own (BYO) authorizer: Support a wider range of identity providers and authentication protocols, providing greater flexibility and control over API access.
  4. Centralizing schema validation: Moving schema validation to API Gateway to bring consistency and improve the robustness and security of APIs by preventing malformed or malicious requests from being processed.
  5. API monetization: Create new revenue streams by adding support for usage-based billing, tiered pricing, and subscription models.

Conclusion

This post has described the creation of Maya’s robust API management and governance solution, using a combination of native AWS services and powerful partner tools such as Terraform and Dynatrace. We’ve demonstrated how this Unified API Gateway has streamlined and automated core API processes, transforming Maya’s previously fragmented infrastructure into a secure and observable ecosystem. By establishing clear guardrails, the API solution team empowers developers to rapidly deploy APIs while maintaining consistent standards.

With the recent implementation of this solution across more teams, Maya is focused on defining and tracking key performance indicators (KPIs). We anticipate measuring critical metrics such as API onboarding efficiency, developer experience, API latency, and security incident rates. These insights will serve as a foundation for continuous improvement and optimization, ensuring the solution’s sustained effectiveness and evolution.

Visit the API platform guidance on Serverlessland to learn more about building API platforms. See the API Gateway pattern collection to learn more about designing REST API integrations on AWS.


About the Authors

Unify streaming and analytical data with Amazon Data Firehose and Amazon SageMaker Lakehouse

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/unify-streaming-and-analytical-data-with-amazon-data-firehose-and-amazon-sagemaker-lakehouse/

Organizations are increasingly required to derive real-time insights from their data while maintaining the ability to perform analytics. This dual requirement presents a significant challenge: how to effectively bridge the gap between streaming data and analytical workloads without creating complex, hard-to-maintain data pipelines. In this post, we demonstrate how to simplify this process using Amazon Data Firehose (Firehose) to deliver streaming data directly to Apache Iceberg tables in Amazon SageMaker Lakehouse, creating a streamlined pipeline that reduces complexity and maintenance overhead.

Streaming data empowers AI and machine learning (ML) models to learn and adapt in real time, which is crucial for applications that require immediate insights or dynamic responses to changing conditions. This creates new opportunities for business agility and innovation. Key use cases include predicting equipment failures based on sensor data, monitoring supply chain processes in real time, and enabling AI applications to respond dynamically to changing conditions. Real-time streaming data helps customers make quick decisions, fundamentally changing how businesses compete in real-time markets.

Amazon Data Firehose seamlessly acquires, transforms, and delivers data streams to lakehouses, data lakes, data warehouses, and analytics services, with automatic scaling and delivery within seconds. For analytical workloads, a lakehouse architecture has emerged as an effective solution, combining the best elements of data lakes and data warehouses. Apache Iceberg, an open table format, enables this transformation by providing transactional guarantees, schema evolution, and efficient metadata handling that were previously only available in traditional data warehouses. SageMaker Lakehouse unifies your data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and other sources, and gives you the flexibility to access your data in-place with Iceberg-compatible tools and engines. By using SageMaker Lakehouse, organizations can harness the power of Iceberg while benefiting from the scalability and flexibility of a cloud-based solution. This integration removes the traditional barriers between data storage and ML processes, so data workers can work directly with Iceberg tables in their preferred tools and notebooks.

In this post, we show you how to create Iceberg tables in Amazon SageMaker Unified Studio and stream data to these tables using Firehose. With this integration, data engineers, analysts, and data scientists can seamlessly collaborate and build end-to-end analytics and ML workflows using SageMaker Unified Studio, removing traditional silos and accelerating the journey from data ingestion to production ML models.

Solution overview

The following diagram illustrates the architecture of how Firehose can deliver real-time data to SageMaker Lakehouse.

This post includes an AWS CloudFormation template to set up supporting resources so Firehose can deliver streaming data to Iceberg tables. You can review and customize it to suit your needs. The template performs the following operations:

Prerequisites

For this walkthrough, you should have the following prerequisites:

After you create the prerequisites, verify you can log in to SageMaker Unified Studio and the project is created successfully. Every project created in SageMaker Unified Studio gets a project location and project IAM role, as highlighted in the following screenshot.

Create an Iceberg table

For this solution, we use Amazon Athena as the engine for our query editor. Complete the following steps to create your Iceberg table:

  1. In SageMaker Unified Studio, on the Build menu, choose Query Editor.

  1. Choose Athena as the engine for query editor and choose the AWS Glue database created for the project.

  1. Use the following SQL statement to create the Iceberg table. Make sure to provide your project AWS Glue database and project Amazon S3 location (can be found on the project overview page):
CREATE TABLE firehose_events (
type struct<device: string, event: string, action: string>,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION '<PROJECT_S3_LOCATION>/iceberg/events'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Deploy the supporting resources

The next step is to deploy the required resources into your AWS environment by using a CloudFormation template. Complete the following steps:

  1. Choose Launch Stack.
  2. Choose Next.
  3. Leave the stack name as firehose-lakehouse.
  4. Provide the user name and password that you want to use for accessing the Amazon Kinesis Data Generator application.
  5. For DatabaseName, enter the AWS Glue database name.
  6. For ProjectBucketName, enter the project bucket name (located on the SageMaker Unified Studio project details page).
  7. For TableName, enter the table name created in SageMaker Unified Studio.
  8. Choose Next.

  1. Select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.

  1. Complete the stack.

Create a Firehose stream

Complete the following steps to create a Firehose stream to deliver data to Amazon S3:

  1. On the Firehose console, choose Create Firehose stream.

  1. For Source, choose Direct PUT.
  2. For Destination, choose Apache Iceberg Tables.

This example chooses Direct PUT as the source, but you can apply the same steps for other Firehose sources, such as Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

  1. For Firehose stream name, enter firehose-iceberg-events.

  1. Collect the database name and table name from the SageMaker Unified Studio project to use in the next step.

  1. In the Destination settings section, enable Inline parsing for routing information and provide the database name and table name from the previous step.

Make sure you enclose the database and table names in double quotes if you want to deliver data to a single database and table. Amazon Data Firehose can also route records to different tables based on the content of the record. For more information, refer to Route incoming records to different Iceberg tables.

  1. Under Buffer hints, reduce the buffer size to 1 MiB and the buffer interval to 60 seconds. You can fine-tune these settings based on your use case latency needs.

  1. In the Backup settings section, enter the S3 bucket created by the CloudFormation template (s3://firehose-demo-iceberg-<account_id>-<region>) and the error output prefix (error/events-1/).

  1. In the Advanced settings section, enable Amazon CloudWatch error logging to troubleshoot any failures, and in for Existing IAM roles, choose the role that starts with Firehose-Iceberg-Stack-FirehoseIamRole-*, created by the CloudFormation template.
  2. Choose Create Firehose stream.

Generate streaming data

Use the Amazon Kinesis Data Generator to publish data records into your Firehose stream:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane and open your stack.
  2. Select the nested stack for the generator, and go to the Outputs tab.
  3. Choose the Amazon Kinesis Data Generator URL.

  1. Enter the credentials that you defined when deploying the CloudFormation stack.

  1. Choose the AWS Region where you deployed the CloudFormation stack and choose your Firehose stream.
  2. For the template, replace the default values with the following code:
{
"type": {
"device": "{{random.arrayElement(["mobile", "desktop", "tablet"])}}",
"event": "{{random.arrayElement(["firehose_events_1", "firehose_events_2"])}}",
"action": "update"
},
"customer_id": "{{random.number({ "min": 1, "max": 1500})}}",
"event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
"region": "{{random.arrayElement(["pdx", "nyc"])}}"
}
  1. Before sending data, choose Test template to see an example payload.
  2. Choose Send data.

You can monitor the progress of the data stream.

Query the table in SageMaker Unified Studio

Now that Firehose is delivering data to SageMaker Lakehouse, you can perform analytics on that data in SageMaker Unified Studio using different AWS analytics services.

Clean up

It’s generally a good practice to clean up the resources created as part of this post to avoid additional cost. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack firehose-lakehouse* and on the Actions menu, choose Delete Stack.
  3. In SageMaker Unified Studio, delete the domain created for this post.

Conclusion

Streaming data allows models to make predictions or decisions based on the latest information, which is crucial for time-sensitive applications. By incorporating real-time data, models can make more accurate predictions and decisions. Streaming data can help organizations avoid the costs associated with storing and processing large datasets, because it focuses on the most relevant information. Amazon Data Firehose makes it straightforward to bring real-time streaming data to data lakes in Iceberg format and unifying it with other data assets in SageMaker Lakehouse, making streaming data accessible by various analytics and AI services in SageMaker Unified Studio to deliver real-time insights. Try out the solution for your own use case, and share your feedback and questions in the comments.


About the Authors

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.

Maria Ho is a Product Marketing Manager for Streaming and Messaging services at AWS. She works with services including Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Managed Service for Apache Flink, Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon MQ, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Services (Amazon SNS).