Tag Archives: Analytics

Announcing replication support and Intelligent-Tiering for Amazon S3 Tables

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/announcing-replication-support-and-intelligent-tiering-for-amazon-s3-tables/

Today, we’re announcing two new capabilities for Amazon S3 Tables: support for the new Intelligent-Tiering storage class that automatically optimizes costs based on access patterns, and replication support to automatically maintain consistent Apache Iceberg table replicas across AWS Regions and accounts without manual sync.

Organizations working with tabular data face two common challenges. First, they need to manually manage storage costs as their datasets grow and access patterns change over time. Second, when maintaining replicas of Iceberg tables across Regions or accounts, they must build and maintain complex architectures to track updates, manage object replication, and handle metadata transformations.

S3 Tables Intelligent-Tiering storage class
With the S3 Tables Intelligent-Tiering storage class, data is automatically tiered to the most cost-effective access tier based on access patterns. Data is stored in three low-latency tiers: Frequent Access, Infrequent Access (40% lower cost than Frequent Access), and Archive Instant Access (68% lower cost compared to Infrequent Access). After 30 days without access, data moves to Infrequent Access, and after 90 days, it moves to Archive Instant Access. This happens without changes to your applications or impact on performance.

Table maintenance activities, including compaction, snapshot expiration, and unreferenced file removal, operate without affecting the data’s access tiers. Compaction automatically processes only data in the Frequent Access tier, optimizing performance for actively queried data while reducing maintenance costs by skipping colder files in lower-cost tiers.

By default, all existing tables use the Standard storage class. When creating new tables, you can specify Intelligent-Tiering as the storage class, or you can rely on the default storage class configured at the table bucket level. You can set Intelligent-Tiering as the default storage class for your table bucket to automatically store tables in Intelligent-Tiering when no storage class is specified during creation.

Let me show you how it works
You can use the AWS Command Line Interface (AWS CLI) and the put-table-bucket-storage-class and get-table-bucket-storage-class commands to change or verify the storage tier of your S3 table bucket.

# Change the storage class
aws s3tables put-table-bucket-storage-class \
   --table-bucket-arn $TABLE_BUCKET_ARN  \
   --storage-class-configuration storageClass=INTELLIGENT_TIERING

# Verify the storage class
aws s3tables get-table-bucket-storage-class \
   --table-bucket-arn $TABLE_BUCKET_ARN  \

{ "storageClassConfiguration":
   { 
      "storageClass": "INTELLIGENT_TIERING"
   }
}

S3 Tables replication support
The new S3 Tables replication support helps you maintain consistent read replicas of your tables across AWS Regions and accounts. You specify the destination table bucket and the service creates read-only replica tables. It replicates all updates chronologically while preserving parent-child snapshot relationships. Table replication helps you build global datasets to minimize query latency for geographically distributed teams, meet compliance requirements, and provide data protection.

You can now easily create replica tables that deliver similar query performance as their source tables. Replica tables are updated within minutes of source table updates and support independent encryption and retention policies from their source tables. Replica tables can be queried using Amazon SageMaker Unified Studio or any Iceberg-compatible engine including DuckDB, PyIceberg, Apache Spark, and Trino.

You can create and maintain replicas of your tables through the AWS Management Console or APIs and AWS SDKs. You specify one or more destination table buckets to replicate your source tables. When you turn on replication, S3 Tables automatically creates read-only replica tables in your destination table buckets, backfills them with the latest state of the source table, and continually monitors for new updates to keep replicas in sync. This helps you meet time-travel and audit requirements while maintaining multiple replicas of your data.

Let me show you how it works
To show you how it works, I proceed in three steps. First, I create an S3 table bucket, create an Iceberg table, and populate it with data. Second, I configure the replication. Third, I connect to the replicated table and query the data to show you that changes are replicated.

For this demo, the S3 team kindly gave me access to an Amazon EMR cluster already provisioned. You can follow the Amazon EMR documentation to create your own cluster. They also created two S3 table buckets, a source and a destination for the replication. Again, the S3 Tables documentation will help you to get started.

I take a note of the two S3 Tables bucket Amazon Resource Names (ARNs). In this demo, I refer to these as the environment variables SOURCE_TABLE_ARN and DEST_TABLE_ARN.

First step: Prepare the source database

I start a terminal, connect to the EMR cluster, start a Spark session, create a table, and insert a row of data. The commands I use in this demo are documented in Accessing tables using the Amazon S3 Tables Iceberg REST endpoint.

sudo spark-shell \
--packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160" \
--master "local[*]" \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.defaultCatalog=spark_catalog" \
--conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.spark_catalog.type=rest" \
--conf "spark.sql.catalog.spark_catalog.uri=https://s3tables.us-east-1.amazonaws.com/iceberg" \
--conf "spark.sql.catalog.spark_catalog.warehouse=arn:aws:s3tables:us-east-1:012345678901:bucket/aws-news-blog-test" \
--conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \
--conf "spark.sql.catalog.spark_catalog.rest.signing-name=s3tables" \
--conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \
--conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
--conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider" \
--conf "spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled=false"

spark.sql("""
CREATE TABLE s3tablesbucket.test.aws_news_blog (
customer_id STRING,
address STRING
) USING iceberg
""")

spark.sql("INSERT INTO s3tablesbucket.test.aws_news_blog VALUES ('cust1', 'val1')")

spark.sql("SELECT * FROM s3tablesbucket.test.aws_news_blog LIMIT 10").show()
+-----------+-------+
|customer_id|address|
+-----------+-------+
|      cust1|   val1|
+-----------+-------+

So far, so good.

Second step: Configure the replication for S3 Tables

Now, I use the CLI on my laptop to configure the S3 table bucket replication.

Before doing so, I create an AWS Identity and Access Management (IAM) policy to authorize the replication service to access my S3 table bucket and encryption keys. Refer to the S3 Tables replication documentation for the details. The permissions I used for this demo are:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*",
                "s3tables:*",
                "kms:DescribeKey",
                "kms:GenerateDataKey",
                "kms:Decrypt"
            ],
            "Resource": "*"
        }
    ]
}

After having created this IAM policy, I can now proceed and configure the replication:

aws s3tables-replication put-table-replication \
--table-arn ${SOURCE_TABLE_ARN} \
--configuration  '{
    "role": "arn:aws:iam::<MY_ACCOUNT_NUMBER>:role/S3TableReplicationManualTestingRole", 
    "rules":[
        {
            "destinations": [
                {
                    "destinationTableBucketARN": "${DST_TABLE_ARN}"
                }]
        }
    ]

The replication starts automatically. Updates are typically replicated within minutes. The time it takes to complete depends on the volume of data in the source table.

Third step: Connect to the replicated table and query the data

Now, I connect to the EMR cluster again, and I start a second Spark session. This time, I use the destination table.

S3 Tables replication - destination table

To verify the replication works, I insert a second row of data on the source table.

spark.sql("INSERT INTO s3tablesbucket.test.aws_news_blog VALUES ('cust2', 'val2')")

I wait a few minutes for the replication to trigger. I follow the status of the replication with the get-table-replication-status command.

aws s3tables-replication get-table-replication-status \
--table-arn ${SOURCE_TABLE_ARN} \
{
    "sourceTableArn": "arn:aws:s3tables:us-east-1:012345678901:bucket/manual-test/table/e0fce724-b758-4ee6-85f7-ca8bce556b41",
    "destinations": [
        {
            "replicationStatus": "pending",
            "destinationTableBucketArn": "arn:aws:s3tables:us-east-1:012345678901:bucket/manual-test-dst",
            "destinationTableArn": "arn:aws:s3tables:us-east-1:012345678901:bucket/manual-test-dst/table/5e3fb799-10dc-470d-a380-1a16d6716db0",
            "lastSuccessfulReplicatedUpdate": {
                "metadataLocation": "s3://e0fce724-b758-4ee6-8-i9tkzok34kum8fy6jpex5jn68cwf4use1b-s3alias/e0fce724-b758-4ee6-85f7-ca8bce556b41/metadata/00001-40a15eb3-d72d-43fe-a1cf-84b4b3934e4c.metadata.json",
                "timestamp": "2025-11-14T12:58:18.140281+00:00"
            }
        }
    ]
}

When replication status shows ready, I connect to the EMR cluster and I query the destination table. Without surprise, I see the new row of data.

S3 Tables replication - target table is up to date

Additional things to know
Here are a couple of additional points to pay attention to:

  • Replication for S3 Tables supports both Apache Iceberg V2 and V3 table formats, giving you flexibility in your table format choice.
  • You can configure replication at the table bucket level, making it straightforward to replicate all tables under that bucket without individual table configurations.
  • Your replica tables maintain the storage class you choose for your destination tables, which means you can optimize for your specific cost and performance needs.
  • Any Iceberg-compatible catalog can directly query your replica tables without additional coordination—they only need to point to the replica table location. This gives you flexibility in choosing query engines and tools.

Pricing and availability
You can track your storage usage by access tier through AWS Cost and Usage Reports and Amazon CloudWatch metrics. For replication monitoring, AWS CloudTrail logs provide events for each replicated object.

There are no additional charges to configure Intelligent-Tiering. You only pay for storage costs in each tier. Your tables continue to work as before, with automatic cost optimization based on your access patterns.

For S3 Tables replication, you pay the S3 Tables charges for storage in the destination table, for replication PUT requests, for table updates (commits), and for object monitoring on the replicated data. For cross-Region table replication, you also pay for inter-Region data transfer out from Amazon S3 to the destination Region based on the Region pair.

As usual, refer to the Amazon S3 pricing page for the details.

Both capabilities are available today in all AWS Regions where S3 Tables are supported.

To learn more about these new capabilities, visit the Amazon S3 Tables documentation or try them in the Amazon S3 console today. Share your feedback through AWS re:Post for Amazon S3 or through your AWS Support contacts.

— seb

Amazon S3 Storage Lens adds performance metrics, support for billions of prefixes, and export to S3 Tables

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/amazon-s3-storage-lens-adds-performance-metrics-support-for-billions-of-prefixes-and-export-to-s3-tables/

Today, we’re announcing three new capabilities for Amazon S3 Storage Lens that give you deeper insights into your storage performance and usage patterns. With the addition of performance metrics, support for analyzing billions of prefixes, and direct export to Amazon S3 Tables, you have the tools you need to optimize application performance, reduce costs, and make data-driven decisions about your Amazon S3 storage strategy.

New performance metric categories
S3 Storage Lens now includes eight new performance metric categories that help identify and resolve performance constraints across your organization. These are available at organization, account, bucket, and prefix levels. For example, the service helps you identify small objects in a bucket or prefix that can  slow down application performance. This can be mitigated by batching small objects or using the Amazon S3 Express One Zone storage class for higher performance small object workloads.

To access the new performance metrics, you need to enable performance metrics in the S3 Storage Lens advanced tier when creating a new Storage Lens dashboard or editing an existing configuration.

Metric category Details Use case Mitigation
Read request size Distribution of read request sizes (GET) by day Identify dataset with small read request patterns that slow down performance Small request: Batch small objects or use Amazon S3 Express One Zone for high-performance small object workloads
Write request size Distribution of write request sizes (PUT, POST, COPY, and UploadPart) by day Identify dataset with small write request patterns that slow down performance Large request: Parallelize requests, use MPU or use AWS CRT
Storage size Distribution of object sizes Identify dataset with small small objects that slow down performance Small object sizes: Consider bundling small objects
Concurrent PUT 503 errors Number of 503s due to concurrent PUT operation on same object Identify prefixes with concurrent PUT throttling that slow down performance For single writer, modify retry behavior or use Amazon S3 Express One Zone. For multiple writers, use consensus mechanism or use Amazon S3 Express One Zone
Cross-Region data transfer Bytes transferred and requests sent across Region, in Region Identify potential performance and cost degradation due to cross-Region data access Co-locate compute with data in the same AWS Region
Unique objects accessed Number or percentage of unique objects accessed per day Identify datasets where small subset of objects are being frequently accessed. These can be moved to higher performance storage tier for better performance Consider moving active data to Amazon S3 Express One Zone or other caching solutions
FirstByteLatency (existing Amazon CloudWatch metric) Daily average of first byte latency metric The daily average per-request time from the complete request being received to when the response starts to be returned
TotalRequestLatency (existing Amazon CloudWatch metric) Daily average of Total Request Latency The daily average elapsed per request time from the first byte received to the last byte sent

How it works
On the Amazon S3 console I choose Create Storage Lens dashboard to create a new dashboard. You can also edit an existing dashboard configuration. I then configure general settings such as providing a Dashboard name, Status, and the optional Tags. Then, I choose Next.


Next, I define the scope of the dashboard by selecting Include all Regions and Include all buckets and specifying the Regions and buckets to be included.


I opt in to the Advanced tier in the Storage Lens dashboard configuration, select Performance metrics, then choose Next.


Next, I select Prefix aggregation as an additional metrics aggregation, then leave the rest of the information as default before I choose Next.


I select the Default metrics report, then General purpose bucket as the bucket type, and then select the Amazon S3 bucket in my AWS account as the Destination bucket. I leave the rest of the information as default, then select Next.


I review all the information before I choose Submit to finalize the process.


After it’s enabled, I’ll receive daily performance metrics directly in the Storage Lens console dashboard. You can also choose to export report in CSV or Parquet format to any bucket in your account or publish to Amazon CloudWatch. The performance metrics are aggregated and published daily and will be available at multiple levels: organization, account, bucket, and prefix. In this dropdown menu, I choose the % concurrent PUT 503 error for the Metric, Last 30 days for the Date range, and 10 for the Top N buckets.


The Concurrent PUT 503 error count metric tracks the number of 503 errors generated by simultaneous PUT operations to the same object. Throttling errors can degrade application performance. For a single writer, modify retry behavior or use higher performance storage tier such as Amazon S3 Express One Zone to mitigate concurrent PUT 503 errors. For multiple writers scenario, use a consensus mechanism to avoid concurrent PUT 503 errors or use higher performance storage tier such as Amazon S3 Express One Zone.

Complete analytics for all prefixes in your S3 buckets
S3 Storage Lens now supports analytics for all prefixes in your S3 buckets through a new Expanded prefixes metrics report. This capability removes previous limitations that restricted analysis to prefixes meeting a 1% size threshold and a maximum depth of 10 levels. You can now track up to billions of prefixes per bucket for analysis at the most granular prefix level, regardless of size or depth.

The Expanded prefixes metrics report includes all existing S3 Storage Lens metric categories: storage usage, activity metrics (requests and bytes transferred), data protection metrics, and detailed status code metrics.

How to get started
I follow the same steps outlined in the How it works section to create or update the Storage Lens dashboard. In Step 4 on the console, where you select export options, you can select the new Expanded prefixes metrics report. Thereafter, I can export the expanded prefixes metrics report in CSV or Parquet format to any general purpose bucket in my account for efficient querying of my Storage Lens data.


Good to know
This enhancement addresses scenarios where organizations need granular visibility across their entire prefix structure. For example, you can identify prefixes with incomplete multipart uploads to reduce costs, track compliance across your entire prefix structure for encryption and replication requirements, and detect performance issues at the most granular level.

Export S3 Storage Lens metrics to S3 Tables
S3 Storage Lens metrics can now be automatically exported to S3 Tables, a fully managed feature on AWS with built-in Apache Iceberg support. This integration provides daily automatic delivery of metrics to AWS managed S3 Tables for immediate querying without requiring additional processing infrastructure.

How to get started
I start by following the process outlined in Step 5 on the console, where I choose the export destination. This time, I choose Expanded prefixes metrics report. In addition to General purpose bucket, I choose Table bucket.

The new Storage Lens metrics are exported to new tables in an AWS managed bucket aws-s3.


I select the expanded_prefixes_activity_metrics table to view API usage metrics for expanded prefix reports.


I can preview the table on the Amazon S3 console or use Amazon Athena to query the table.


Good to know
S3 Tables integration with S3 Storage Lens simplifies metric analysis using familiar SQL tools and AWS analytics services such as Amazon Athena, Amazon QuickSight, Amazon EMR, and Amazon Redshift, without requiring a data pipeline. The metrics are automatically organized for optimal querying, with custom retention and encryption options to suit your needs.

This integration enables cross-account and cross-Region analysis, custom dashboard creation, and data correlation with other AWS services. For example, you can combine Storage Lens metrics with S3 Metadata to analyze prefix-level activity patterns and identify objects in prefixes with cold data that are eligible for transition to lower-cost storage tiers.

For your agentic AI workflows, you can use natural language to query S3 Storage Lens metrics in S3 Tables with the S3 Tables MCP Server. Agents can ask questions such as ‘which buckets grew the most last month?’ or ‘show me storage costs by storage class’ and get instant insights from your observability data.

Now available
All three enhancements are available in all AWS Regions where S3 Storage Lens is currently offered (except the China Regions and AWS GovCloud (US)).

These features are included in the Amazon S3 Storage Lens Advanced tier at no additional charge beyond standard advanced tier pricing. For the S3 Tables export, you pay only for S3 Tables storage, maintenance, and queries. There is no additional charge for the export functionality itself.

To learn more about Amazon S3 Storage Lens performance metrics, support for billions of prefixes, and export to S3 Tables, refer to the Amazon S3 user guide. For pricing details, visit the Amazon S3 pricing page.

Veliswa Boya.

Amazon OpenSearch Service improves vector database performance and cost with GPU acceleration and auto-optimization

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-opensearch-service-improves-vector-database-performance-and-cost-with-gpu-acceleration-and-auto-optimization/

Today we’re announcing serverless GPU acceleration and auto-optimization for vector index in Amazon OpenSearch Service that helps you build large-scale vector databases faster with lower costs and automatically optimize vector indexes for optimal trade-offs between search quality, speed, and cost.

Here are the new capabilities introduced today:

  • GPU acceleration – You can build vector databases up to 10 times faster at a quarter of the indexing cost when compared to non-GPU acceleration, and you can create billion-scale vector databases in under an hour. With significant gains in cost saving and speed, you get an advantage in time-to-market, innovation velocity, and adoption of vector search at scale.
  • Auto-optimization – You can find the best balance between search latency, quality, and memory requirements for your vector field without needing vector expertise. This optimization helps you achieve better cost-savings and recall rates when compared to default index configurations, while manual index tuning can take weeks to complete.

You can use these capabilities to build vector databases faster and more cost-effectively on OpenSearch Service. You can use them to power generative AI applications, search product catalogs and knowledge bases, and more. You can enable GPU acceleration and auto-optimization when you create a new OpenSearch domain or collection, as well as update an existing domain or collection.

Let’s go through how it works!

GPU acceleration for vector index
When you enable GPU acceleration on your OpenSearch Service domain or Serverless collection, OpenSearch Service automatically detects opportunities to accelerate your vector indexing workloads. This acceleration helps build the vector data structures in your OpenSearch Service domain or Serverless collection.

You don’t need to provision the GPU instances, manage their usage or pay for idle time. OpenSearch Service securely isolates your accelerated workloads to your domain’s or collection’s Amazon Virtual Private Cloud (Amazon VPC) within your account. You pay only for useful processing through the OpenSearch Compute Units (OCU) – Vector Acceleration pricing.

To enable GPU acceleration, go to the OpenSearch Service console and choose Enable GPU Acceleration in the Advanced features section when you create or update your OpenSearch Service domain or Serverless collection.

You can use the following AWS Command Line Interface (AWS CLI) command to enable GPU acceleration for an existing OpenSearch Service domain.

$ aws opensearch update-domain-config \
    --domain-name my-domain \
    --aiml-options '{"ServerlessVectorAcceleration": {"Enabled": true}}'

You can create a vector index optimized for GPU processing. This example index stores 768-dimensional vectors for text embeddings by enabling index.knn.remote_index_build.enabled.

PUT my-vector-index
{
    "settings": {
        "index.knn": true,
        "index.knn.remote_index_build.enabled": true
    },
    "mappings": {
        "properties": {
        "vector_field": {
        "type": "knn_vector",
        "dimension": 768,
      },
      "text": {
        "type": "text"
      }
    }
  }
}

Now you can add vector data and optimize your index using standard OpenSearch Service operations using the bulk API. The GPU acceleration is automatically applied to indexing and force-merge operations.

POST my-vector-index/_bulk
{"index": {"_id": "1"}}
{"vector_field": [0.1, 0.2, 0.3, ...], "text": "Sample document 1"}
{"index": {"_id": "2"}}
{"vector_field": [0.4, 0.5, 0.6, ...], "text": "Sample document 2"}

We ran index build benchmarks and observed speed gains from GPU acceleration ranging between 6.4 to 13.8 times. Stay tuned for more benchmarks and further details in upcoming posts.

To learn more, visit GPU acceleration for vector indexing in the Amazon OpenSearch Service Developer Guide.

Auto-optimizing vector databases
You can use the new vector ingestion feature to ingest documents from Amazon Simple Storage Service (Amazon S3), generate vector embeddings, optimize indexes automatically, and build large-scale vector indexes in minutes. During the ingestion, auto-optimization generates recommendations based on your vector fields and indexes of your OpenSearch Service domain or Serverless collection. You can choose one of these recommendations to quickly ingest and index your vector dataset instead of manually configuring these mappings.

To get started, choose Vector ingestion under the Ingestion menu in the left navigation pane of OpenSearch Service console.

You can create a new vector ingestion job with the following steps:

  • Prepare dataset – Prepare OpenSearch Service parquet documents in an S3 bucket and choose a domain or collection for your destination.
  • Configure index and automate optimizations – Auto-optimize your vector fields or manually configure them.
  • Ingest and accelerate indexing – Use OpenSearch ingestion pipelines to load data from Amazon S3 into OpenSearch Service. Build large vector indexes up to 10 times faster at a quarter of the cost.

In Step 2, configure your vector index with auto-optimize vector field. Auto-optimize is currently limited to one vector field. Further index mappings can be input after the auto-optimization job has completed.

Your vector field optimization settings depend on your use case. For example, if you need high search quality (recall rate) and don’t need faster responses, then choose Modest for the Latency requirements (p90) and more than or equal to 0.9 for the Acceptable search quality (recall). When you create a job, it starts to ingest vector data and auto-optimize vector index. The processing time depends on the vector dimensionality.

To learn more, visit Auto-optimize vector index in the OpenSearch Service Developer Guide.

Now available
GPU acceleration in Amazon OpenSearch Service is now available in the US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Europe (Ireland) Regions. Auto-optimization in OpenSearch Service is now available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions.

OpenSearch Service separately charges for used OCU – Vector Acceleration only to index your vector databases. For more information, visitOpenSearch Service pricing page.

Give it a try and send feedback to the AWS re:Post for Amazon OpenSearch Service or through your usual AWS Support contacts.

Channy

Amazon FSx for NetApp ONTAP now integrates with Amazon S3 for seamless data access

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/amazon-fsx-for-netapp-ontap-now-integrates-with-amazon-s3-for-seamless-data-access/

Today, we’re announcing the ability to access your data in Amazon FSx for NetApp ONTAP file systems using Amazon Simple Storage Service (Amazon S3). With this capability, you can use your enterprise file data to augment generative AI applications with Amazon Bedrock Knowledge Bases for Retrieval Augmented Generation (RAG), train machine learning (ML) models with Amazon SageMaker, generate insights with Amazon S3 integrated third-party services, use comprehensive research capabilities in AI-powered business intelligence (BI) tools such as Amazon Quick Suite, and run analyses using Amazon S3 based cloud-native applications, all while your file data continues to reside in your FSx for NetApp ONTAP file system.

Amazon FSx for NetApp ONTAP is the first and fully AWS managed NetApp ONTAP file system in the cloud to migrate on-premises applications that rely on NetApp ONTAP or other network-attached storage (NAS) appliances to AWS without having to change how you manage your data. FSx for NetApp ONTAP provides the popular capabilities, high performance, and data management APIs of ONTAP file systems with the added benefits of the AWS Cloud, such as simplified management, on-demand scaling, and seamless integration with other AWS services.

Over the years, AWS has developed a broad range of industry-leading AI, ML, and analytics services and applications that work with data in Amazon S3 that organizations use to innovate faster, discover new insights, and make even better data-driven decisions. However, some organizations want to use these services with their enterprise file data stored in NetApp ONTAP or other NAS appliances.

How to get started
You can create and attach an S3 Access Point to your FSx for ONTAP file system using the Amazon FSx console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.

I have an existing FSx for ONTAP file system demo-create-s3access which I created by following the steps in the Creating file systems in the FSx for ONTAP documentation. Using the Amazon FSx console I now choose the file system ID fs-0c45b011a7f071d70 to access the full details of the file system.

I’ll attach the access point to the volume of the file system. I choose the volume vol1 and then select Create S3 Access Point from the Actions dropdown menu.


I enter details such as the access point name, the type of file system user identity and the network configuration, then choose Create s3 Access Point to finalize the process.


After it’s created, the access point my-s3-accesspoint is ready to allow access to the file data stored in my file system demo-create-s3access from Amazon S3. Amazon Access Points are S3 endpoints that can be attached to Amazon FSx volumes and used to perform Amazon S3 object operations.


I can now bring proprietary data stored in the file system demo-create-s3access to Amazon S3 for use in applications that work with Amazon S3 while my file data continues to reside in the FSx for NetApp ONTAP file system using the access point my-s3-accesspoint (this data remains accessible through the file protocols).

For the walkthrough in this post, I’ll integrate with Quick Suite.

Integrating decades of enterprise file data with the latest AI powered BI tools on AWS
In the Quick Suite Console, in the left navigation pane, I choose Connections, then select Integrations. Before you begin, make sure that you have the correct permissions to the Amazon S3 AWS resource. You can control the AWS resources that Quick Suite can access by following the Amazon Quick Suite user guide.


After I’ve selected the Amazon S3 integration I enter my Amazon S3 Access Point alias as the S3 bucket URL, leave the rest of the information as default, then choose Create and continue.


I finalize the process by providing the Name of the knowledge base, the Description, then choose Create.


After the knowledge base has been created it’s automatically synchronized, it’s now available for interaction.


I want to learn more about the AWS European Sovereign Cloud so I’ve updated the file system (accessed through the S3 Access Point my-s3-accesspoin-iyytkgz83djdjj7abn3u711supfgkuse1b-ext-s3alias) with the AWS whitepaper on this topic. In the chat in Amazon Quick Suite. I start asking the first question “do we have any documentation on the europe sovereignty cloud?“. To answer my question, the chat agent accesses and analyzes various types of data sources I have permission to use, including uploaded files in my current conversation, spaces I have access to, knowledge bases from my integrations, and more.

When I verify the source, I see that the document I uploaded to my file system is listed as one of the sources.

Other use cases of Amazon S3 Access Points for Amazon FSx for NetApp ONTAP
Earlier, we looked at use cases such as connecting an organization’s proprietary file data to Amazon Quick Suite for advanced business intelligence. Additionally, Amazon S3 Access Points for Amazon FSx for NetApp ONTAP can be used to seamlessly integrate enterprise file data with comprehensive analytics services, such as Amazon Athena for serverless SQL queries or AWS Glue for ETL processing, to name a few.

Amazon S3 Access Points for Amazon FSx for NetApp ONTAP are also suitable for data access from serverless compute workloads that are cloud-native with containerized microservices that require flexible access to shared enterprise datasets, such as configuration files, reference data, content libraries, model artifacts, and application assets.

Now available
You can get started today using the Amazon FSx console, AWS CLI, or AWS SDK to attach Amazon S3 Access Points to your Amazon FSx for NetApp ONTAP file systems. The feature is available in the following AWS Regions: Africa (Cape Town), Asia Pacific (Hong Kong, Hyderabad, Jakarta, Melbourne, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo), Canada (Central, Calgary), Europe (Frankfurt, Ireland, London, Milan, Paris, Spain, Stockholm, Zurich), Israel (Tel Aviv), Middle East (Bahrain, UAE), South America (Sao Paulo), US East (N. Virginia, Ohio), and US West (N. California Oregon). You’re billed by Amazon S3 for the requests and data transfer costs through your S3 Access Point, in addition to your standard Amazon FSx charges. Learn more on the Amazon FSx for NetApp ONTAP pricing page.

PS: Writing a blog post at AWS is always a team effort, even when you see only one name under the post title. In this case, I want to thank Luke Miller, for his expertise and generous help with technical guidance, which made this overview possible and comprehensive.

Veliswa Boya.

AWS Clean Rooms launches privacy-enhancing synthetic dataset generation for ML model training

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/aws-clean-rooms-launches-privacy-enhancing-synthetic-dataset-generation-for-ml-model-training/

Today, we’re announcing privacy-enhancing synthetic dataset generation for AWS Clean Rooms, a new capability that organizations and their partners can use to generate privacy-enhancing synthetic datasets from their collective data to train regression and classification machine learning (ML) models. You can use this feature to generate synthetic training datasets that preserve the statistical patterns of the original data, without the model having access to original records, opening new opportunities for model training that were previously not possible due to privacy concerns.

When building ML models, data scientists and analysts typically face a fundamental tension between data utility and privacy protection. Access to high-quality, granular data is essential for training accurate models that can recognize trends, personalize experiences, and drive business outcomes. However, using granular data such as user-level event data from multiple parties raises significant privacy concerns and compliance challenges. Organizations want to answer questions like, “What characteristics indicate a high-probability customer conversion?”, but training on the individual-level signals often conflicts with privacy policies and regulatory requirements.

Privacy-enhancing synthetic dataset generation for custom ML
To address this challenge, we’re introducing privacy-enhancing synthetic dataset generation in AWS Clean Rooms ML, which organizations can use to create synthetic versions of sensitive datasets that can be more securely used for ML model training. This capability uses advanced ML techniques to generate new datasets that maintain the statistical properties of the original data while de-identifying subjects from the original source data.

Traditional anonymization techniques such as masking still carry the risk of re-identifying individuals in a dataset—knowing attributes about a person such as zip code and date of birth can be sufficient to identify them with census data. Privacy-enhancing synthetic dataset generation addresses this risk through a fundamentally different approach. The system trains a model that learns the essential statistical patterns of the original dataset, then generates synthetic records by sampling values from the original dataset and using the model to predict the predicted value column. Rather than merely copying or perturbing the original data, the system uses a model capacity reduction technique to mitigate the risk that the model will memorize information about individuals in the training data. The resulting synthetic dataset has the same schema and statistical characteristics as the original data, making it suitable for training classification and regression models. This approach quantifiably reduces the risk of re-identification.

Organizations using this capability have control over the privacy parameters, including the amount of noise applied and the level of protection against membership inference attacks, where an adversary attempts to determine whether a specific individual’s data was included in the training set. After generating the synthetic dataset, AWS Clean Rooms provides detailed metrics to help customers and their compliance teams understand the quality of the synthetic dataset across two critical dimensions: fidelity to the original data and privacy preservation. The fidelity score uses KL-divergence to measure how similar the synthetic data is to the original dataset, and the privacy score quantifies how likely the dataset is protected from membership inference attacks.

Working with synthetic data in AWS Clean Rooms
Getting started with privacy-enhancing synthetic dataset generation follows the established AWS Clean Rooms ML custom models workflow, with new steps to specify privacy requirements and review quality metrics. Organizations begin by creating configured tables with analysis rules using their preferred data sources, then join or create a collaboration with their partners and associate their tables with that collaboration.

The new capability introduces an enhanced analysis template where data owners define not only the SQL query that creates the dataset but also specify that the resulting dataset must be synthetic. Within this template, organizations classify columns to indicate which column the ML model will predict and which columns contain categorical versus numerical values. Critically, the template also includes privacy thresholds that the generated synthetic data must meet to be made available for training. These include an epsilon value that specifies how much noise must be present in the synthetic data to protect against re-identification, and a minimum protection score against membership inference attacks. Setting these thresholds appropriately requires understanding your organization’s specific privacy and compliance requirements, and we recommend engaging with your legal and compliance teams during this process.

After all data owners review and approve the analysis template, a collaboration member creates a machine learning input channel that references the template. AWS Clean Rooms then begins the synthetic dataset generation process, which typically completes within a few hours depending on the size and complexity of the dataset. If the generated synthetic dataset meets the required privacy thresholds defined in the analysis template, a synthetic machine learning input channel becomes available along with detailed quality metrics. Data scientists can review the actual protection score achieved against a simulated membership inference attack.

Once satisfied with the quality metrics, organizations can proceed to train their ML models using the synthetic dataset within the AWS Clean Rooms collaboration. Depending on the use case, they can export the trained model weights or continue to run inference jobs within the collaboration itself.

Let’s try it out
When creating a new AWS Clean Rooms collaboration, I can now set who pays for synthetic dataset generation.

After my Collaboration is configured, I can choose Require analysis template output to be synthetic when creating a new analysis template.

After my synthetic analysis template is ready, I can use it when running protected queries and view all the relevant ML input channel details.

Clean Rooms Synthetic Data Console

Now available
You can start using privacy-enhancing synthetic dataset generation through AWS Clean Rooms today. The feature is available in all commercial AWS Regions where AWS Clean Rooms is available. Learn more about it in the AWS Clean Rooms documentation.

Privacy-enhancing synthetic dataset generation is billed separately based on usage. You pay only for the compute used to generate your synthetic dataset, charged as Synthetic Data Generation Units (SDGUs). The number of SDGUs varies based on the size and complexity of your original dataset. This fee can be configured as a payer setting, meaning any collaboration member can agree to pay the costs. For more information on pricing, refer to the AWS Clean Rooms pricing page.

The initial release supports training classification and regression models on tabular data. The synthetic datasets work with standard ML frameworks and can be integrated into existing model development pipelines without requiring changes to your workflows.

This capability represents a significant advancement in privacy-enhanced machine learning. Organizations can unlock the value of sensitive user-level data for model training while mitigating the risk that sensitive information about individual users could be leaked. Whether you’re optimizing advertising campaigns, personalizing insurance quotes, or enhancing fraud detection systems, privacy-enhancing synthetic dataset generation makes it possible to train more accurate models through data collaboration while respecting individual privacy.

Run Apache Spark and Apache Iceberg write jobs 2x faster with Amazon EMR

Post Syndicated from Atul Payapilly original https://aws.amazon.com/blogs/big-data/run-apache-spark-and-apache-iceberg-write-jobs-2x-faster-with-amazon-emr/

Amazon EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining API compatibility with open source Apache Spark and Apache Iceberg table format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue use the optimized runtimes.

In this post, we demonstrate the write performance benefits of using the Amazon EMR 7.12 runtime for Spark and Iceberg compares to open source Spark 3.5.6 with Iceberg 1.10.0 tables on a 3TB merge workload.

Write Benchmark Methodology

Our benchmarks demonstrate that Amazon EMR 7.12 can run 3TB merge workloads over 2 times faster than open source Spark 3.5.6 with Iceberg 1.10.0, delivering significant improvements for data ingestion and ETL pipelines while providing the advanced features of Iceberg including ACID transactions, time travel, and schema evolution.

Benchmark workload

To evaluate the write performance improvements in Amazon EMR 7.12, we chose a merge workload that reflects common data ingestion and ETL patterns. The benchmark consists of 37 basic merge operations on TPC-DS 3TB tables, testing the performance of INSERT, UPDATE, and DELETE operations. The workload is inspired by established benchmarking approaches from the open source community, including Delta Lake’s merge benchmark methodology and the LST-Bench framework. We combined and adapted these approaches to create a comprehensive test of Iceberg write performance on AWS. We also started with an initial focus on copy-on-write performance only.

Workload characteristics

The benchmark executes 37 basic sequential merge queries that modify TPC-DS fact tables. The 37 queries are organized into three categories:

  • Inserts (queries m1-m6): Adding new records to tables with varying data volumes. These queries use source tables with 5-100% new records and zero matches, testing pure insert performance at different scales.
  • Upserts (queries m8-m16): Modifying existing records while inserting new ones. These upsert operations combine different ratios of matched and non-matched records—for example, 1% matches with 10% inserts, or 99% matches with 1% inserts—representing typical scenarios where data is both updated and augmented.
  • Deletes (queries m7, m17-m37): Removing records with varying selectivity. These range from small, targeted deletes affecting 5% of files and rows to large-scale deletions, including partition-level deletes that can be optimized to metadata-only operations.

The queries operate on the table state created by previous operations, simulating real ETL pipelines where subsequent steps depend on earlier transformations. For example, the first six queries insert between 607,000 and 11.9 million records into the web_returns table. Later queries then update and delete from this modified table, testing read-after-write performance. Source tables were generated by sampling the TPC-DS web_returns table with controlled match/non-match ratios for consistent test conditions across the benchmark runs.

The merge operations vary in scale and complexity:

  • Small operations affecting 607,000 records
  • Large operations modifying over 12 million records
  • Selective deletes requiring file rewrites
  • Partition-level deletes optimized to metadata operations

Benchmark configuration

We ran the benchmark on identical hardware for both Amazon EMR 7.12 and open source Spark 3.5.6 with Iceberg 1.10.0:

  • Cluster: 9 r5d.4xlarge instances (1 primary, 8 workers)
  • Compute: 144 total vCPUs, 1,152 GB memory
  • Storage: 2 x 300 GB NVMe SSD per instance
  • Catalog: Hadoop Catalog
  • Data format: Parquet files on Amazon S3
  • Table format: Apache Iceberg (default: copy-on-write mode)

Benchmark results

We compared benchmark results for Amazon EMR 7.12 to open source Spark 3.5.6 and Iceberg 1.10.0. We ran the 37 merge queries in three sequential iterations, and the average runtime across these iterations was taken for comparison. The following table shows the results averaged across three iterations:

Amazon EMR 7.12 (seconds) Open Source Spark 3.5.6 + Iceberg 1.10.0 (seconds) Speedup
443.58 926.63 2.08x

The average runtime for the three iterations on Amazon EMR 7.12 with Iceberg enabled was 443.58 seconds, demonstrating a 2.08x speed increase compared to open source Spark 3.5.6 and Iceberg 1.10.0. The following figure presents the total runtimes in seconds.

The following table summarizes the metrics.

Metric Amazon EMR 7.12 on EC2 Open source Spark 3.5.6 and Iceberg 1.10.0
Average runtime in seconds 443.58 926.63
Geometric mean over queries in seconds 6.40746 18.50945
Cost* $1.58 $2.68

*Detailed cost estimates are discussed later in this post.

The following chart demonstrates the per-query performance improvement of Amazon EMR 7.12 relative to open source Spark 3.5.6 and Iceberg 1.10.0. The extent of the speedup varies from one query to another, with the fastest up to 13.3 times faster for query m31, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Performance optimizations in Amazon EMR

Amazon EMR 7.12 achieves over 2x faster write performance through systematic optimizations across the write execution pipeline. These improvements span multiple areas:

  • Metadata-only delete operations: When deleting entire partitions, EMR can now optimize these operations to metadata-only changes, eliminating the need to rewrite data files. This significantly reduces the time and cost for partition-level delete operations.
  • Bloom filter joins for merge operations: Enhanced join strategies using bloom filters reduce the amount of data that needs to be read and processed during merge operations, particularly benefiting queries with selective predicates.
  • Parallel file write out: Optimized parallelism during the write phase of merge operations improves throughput when writing filtered results back to Amazon S3, reducing overall merge operation time. We balanced the parallelism with read performance for overall optimized performance on the entire workload.

These optimizations work together to deliver consistent performance improvements across diverse write patterns. The result is significantly faster data ingestion and ETL pipeline execution while maintaining Iceberg’s ACID assurances and data consistency of Iceberg.

Cost comparison

Our benchmark provides the total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex, real-world decision support scenario. For additional insights, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR expenses.

  • Amazon EC2 cost (includes SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
    • 4xlarge hourly rate = $1.152 per hour
  • Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
  • Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
    • 4xlarge Amazon EMR cost = $0.27 per hour
  • Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

The calculations reveal that the Amazon EMR 7.12 benchmark yields a 1.7x cost efficiency improvement over open source Spark 3.5.6 and Iceberg 1.10.0 in running the benchmark job.

Metric Amazon EMR 7.12 Open source Spark 3.5.6 and Iceberg 1.10.0
Runtime in seconds 443.58 926.63
Number of EC2 instances(Includes primary node) 9 9
Amazon EBS Size 20gb 20gb
Amazon EC2(Total runtime cost) $1.28 $2.67
Amazon EBS cost $0.00 $0.01
Amazon EMR cost $0.30 $0
Total cost $1.58 $2.68
Cost savings Amazon EMR 7.12 is 1.7 times better Baseline

Run open source Spark benchmarks on Iceberg tables

We used separate EC2 clusters, each equipped with nine r5d.4xlarge instances, for testing both open source Spark 3.5.6 and Amazon EMR 7.12 for Iceberg workload. The primary node was equipped with 16 vCPU and 128 GB of memory, and the eight worker nodes together had 128 vCPU and 1024 GB of memory. We conducted tests using the Amazon EMR default settings to showcase the typical user experience and minimally adjusted the settings of Spark and Iceberg to maintain a balanced comparison.

The following table summarizes the Amazon EC2 configurations for the primary node and eight worker nodes of type r5d.4xlarge.

EC2 Instance vCPU Memory (GiB) Instance storage (GB) EBS root volume (GB)
r5d.4xlarge 16 128 2 x 300 NVMe SSD 20 GB

Benchmarking instructions

Follow the steps below to run the benchmark:

  1. For the open source run, create a Spark cluster on Amazon EC2 using Flintrock with the configuration described previously.
  2. Setup the TPC-DS source data with Iceberg in your S3 bucket.
  3. Build the benchmark application jar from the source to run the benchmarking and get the results.

Detailed instructions are provided in the emr-spark-benchmark GitHub repository.

Summarize the results

After the Spark job finishes, retrieve the test result file from the output S3 bucket at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or by using the Amazon Command Line Interface (AWS CLI). The Spark benchmark application organizes the data by creating a timestamp folder and placing a summary file within a folder labeled summary.csv. The output CSV files contain four columns without headers:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

With the data from three separate test runs with one iteration each time, we can calculate the average and geometric mean of the benchmark runtimes.

Clean up

To help prevent future charges, delete the resources you created by following the instructions provided in the Cleanup section of the GitHub repository.

Summary

Amazon EMR is consistently enhancing the EMR runtime for Spark when used with Iceberg tables, achieving write performance that is over 2 times faster than open source Spark 3.5.6 and Iceberg 1.10.0 with EMR 7.12 on 3TB merge workloads. This represents a significant improvement for data ingestion and ETL pipelines, helping to deliver 1.7x cost reduction while maintaining the ACID assurances of Iceberg. We encourage you to keep up to date with the latest Amazon EMR releases to fully benefit from ongoing performance improvements.

To stay informed, subscribe to the RSS feed for the AWS Big Data Blog, where you can find updates on the EMR runtime for Spark and Iceberg, as well as tips on configuration best practices and tuning recommendations.


About the authors

Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.

Akshaya KP is a software development engineer for Amazon EMR at Amazon Web Services.

Hari Kishore Chaparala is a software development engineer for Amazon EMR at Amazon Web Services.

Giovanni Matteo is the Senior Manager for the Amazon EMR Spark and Iceberg group.

Medidata’s journey to a modern lakehouse architecture on AWS

Post Syndicated from Mike Araujo original https://aws.amazon.com/blogs/big-data/medidatas-journey-to-a-modern-lakehouse-architecture-on-aws/

This post was co-authored by Mike Araujo Principal Engineer at Medidata Solutions.

The life sciences industry is transitioning from fragmented, standalone tools towards integrated, platform-based solutions. Medidata, a Dassault Systèmes company, is building a next-generation data platform that addresses the complex challenges of modern clinical research. In this post, we show you how Medidata created a unified, scalable, real-time data platform that serves thousands of clinical trials worldwide with AWS services, Apache Iceberg, and a modern lakehouse architecture.

Challenges with legacy architecture

As the Medidata clinical data repository expanded, the team recognized the shortcomings of the legacy data solution to provide quality data products to their customers across their growing portfolio of data offerings. Several data tenants began to erode. The following diagram shows Medidata’s legacy extract, transform, and load (ETL) architecture.

Built upon a series of scheduled batch jobs, the legacy system proved ill-equipped to provide a unified view of the data across the entire ecosystem. Batch jobs ran at different intervals, often requiring a sufficient degree of scheduling buffer to make sure upstream jobs completed within the expected window. As the data volume expanded, the jobs and their schedules continued to inflate, introducing a latency window between ingestion and processing for dependent consumers. Different consumers operating from various underlying data services further magnified the problem as pipelines had to be continuously built across a variety of data delivery stacks.

The expanding portfolio of pipelines began to overwhelm existing maintenance operations. With more operations, the opportunity for failure expanded and recovery efforts further complicated. Existing observability systems were inundated with operational data, and identifying the root cause of data quality issues became a multi-day endeavor. Increases in the data volume required scaling considerations across the entire data estate.

Additionally, the proliferation of data pipelines and copies of the data in different technologies and storage systems necessitated expanding access controls with enhanced security features to make sure only the correct users had access to the subset of data to which they were permitted. Making sure access control changes were correctly propagated across all systems added a further layer of complexity to consumers and producers.

Solution overview

With the advent of Clinical Data Studio (Medidata’s unified data management and analytics solution for clinical trials) and Data Connect (Medidata’s data solution for acquiring, transforming, and exchanging electronic health record (EHR) data across healthcare organizations), Medidata introduced a new world of data discovery, analysis, and integration to the life sciences industry powered by open source technologies and hosted on AWS. The following diagram illustrates the solution architecture.

Fragmented batch ETL jobs were replaced by real-time Apache Flink streaming pipelines, an open source, distributed engine for stateful processing, and powered by Amazon Elastic Kubernetes Service (Amazon EKS), a fully managed Kubernetes service. The Flink jobs write to Apache Kafka running in Amazon Managed Apache Kafka (Amazon MSK), a streaming data service that manages Kafka infrastructure and operations, before landing in Iceberg tables backed by the AWS Glue Data Catalog, a centralized metadata repository for data assets. From this collection of Iceberg tables, a central, single source of data is now accessible from a variety of consumers without additional downstream processing, alleviating the need for custom pipelines to satisfy the requirements of downstream consumers. Through these fundamental architectural changes, the team at Medidata solved the issues presented by the legacy solution.

Data availability and consistency

With the introduction of the Flink jobs and Iceberg tables, the team was able to deliver a consistent view of their data across the Medidata data experience. Pipeline latency was reduced from days to minutes, helping Medidata customers realize a 99% performance gain from the data ingestion to the data analytics layers. Due to Iceberg’s interoperability, Medidata users saw the same view of the data regardless of where they viewed that data, minimizing the need for consumer-driven custom pipelines because Iceberg could plug into existing consumers.

Maintenance and durability

Iceberg’s interoperability provided a single copy of the data to satisfy their use cases, so the Medidata team could focus its observation and maintenance efforts on a five-times smaller subset of operations than previously required. Observability was enhanced by tapping into the various metadata components and metrics exposed by Iceberg and the Data Catalog. Quality management transformed from cross-system traces and queries to a single analysis of unified pipelines, with an added benefit of point in time data queries thanks to the Iceberg snapshot feature. Data volume increases are handled with out-of-box scaling supported by the entire infrastructure stack and AWS Glue Iceberg optimization features that include compaction, snapshot retention, and orphan file deletion, which provide a set-and-forget experience for solving a number of common Iceberg frustrations, such as the small file problem, orphan file retention, and query performance.

Security

With Iceberg at the center of its solution architecture, the Medidata team no longer had to spend the time building custom access control layers with enhanced security features at each data integration point. Iceberg on AWS centralizes the authorization layer using familiar systems such as AWS Identity and Access Management (IAM), providing a single and durable control for data access. The data also stays entirely within the Medidata virtual private cloud (VPC), further reducing the opportunity for unintended disclosures.

Conclusion

In this post, we demonstrated how legacy universe of consumer-driven custom ETL pipelines can be replaced with a scalable, high-performant streaming lakehouses. By putting Iceberg on AWS at the center of data operations, you can have a single source of data for your consumers.

To learn more about Iceberg on AWS, refer to Optimizing Iceberg tables and Using Apache Iceberg on AWS.


About the authors

Mike Araujo

Mike is a Principal Engineer at Medidata Solutions, working on building a next generation data and AI platform for clinical data and trials. By using the power of open source technologies such as Apache Kafka, Apache Flink, and Apache Iceberg, Mike and his team have enabled the delivery of billions of clinical events and data transformations in near real time to downstream consumers, applications, and AI agents. His core skills focus on architecting and building big data and ETL solutions at scale as well as their integration in agentic workflows.

Sandeep Adwankar

Sandeep is a Senior Product Manager at AWS, who has driven feature launches across Amazon SageMaker, AWS Glue, and AWS Lake Formation. He has led initiatives in Amazon S3 Tables analytics, Iceberg compaction strategies, and AWS Glue Iceberg optimizations. His recent work focuses on generative AI and autonomous systems, including the AWS Glue Data Catalog model context protocol and Amazon Bedrock structured knowledge bases. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that accelerate their business outcomes.

Ian Beatty

Ian is a Technical Account Manager at AWS, where he specializes in supporting independent software vendor (ISV) customers in the healthcare and life sciences (HCLS) and financial services industry (FSI) sectors. Based in the Rochester, NY area, Ian helps ISV customers navigate their cloud journey by maintaining resilient and optimized workloads on AWS. With over a decade of experience building on AWS since 2014, he brings deep technical expertise from his previous roles as an AWS Architect and DevSecOps team lead for SaaS ISVs before joining AWS more than 3 years ago.

Ashley Chen

Ashley is a Solutions Architect at AWS based in Washington D.C. She supports independent software vendor (ISV) customers in the healthcare and life sciences industries, focusing on customer enablement, generative AI applications, and container workloads.

New one-click onboarding and notebooks with a built-in AI agent in Amazon SageMaker Unified Studio

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/new-one-click-onboarding-and-notebooks-with-ai-agent-in-amazon-sagemaker-unified-studio/

Today we’re announcing a faster way to get started with your existing AWS datasets in Amazon SageMaker Unified Studio. You can now start working with any data you have access to in a new serverless notebook with a built-in AI agent, using your existing AWS Identity and Access Management (IAM) roles and permissions.

New updates include:

  • One-click onboarding – Amazon SageMaker can now automatically create a project in Unified Studio with all your existing data permissions from AWS Glue Data Catalog, AWS Lake Formation, and Amazon Simple Storage Services (Amazon S3).
  • Direct integration – You can launch SageMaker Unified Studio directly from Amazon SageMaker, Amazon Athena, Amazon Redshift, and Amazon S3 Tables console pages, giving a fast path to analytics and AI workloads.
  • Notebooks with a built-in AI agent – You can use a new serverless notebook with a built-in AI agent, which supports SQL, Python, Spark, or natural language and gives data engineers, analysts, and data scientists one place to develop and run both SQL queries and code.

You also have access to other tools such as a Query Editor for SQL analysis, JupyterLab integrated developer environment (IDE), Visual ETL and workflows, and machine learning (ML) capabilities.

Try one-click onboarding and connect to Amazon SageMaker Unified Studio
To get started, go to the SageMaker console and choose the Get started button.

You will be prompted either to select an existing AWS Identity and Access Management (AWS IAM) role that has access to your data and compute, or to create a new role.

Choose Set up. It takes a few minutes to complete your environment. After this role is granted access, you’ll be taken to the SageMaker Unified Studio landing page where you will see the datasets that you have access to in AWS Glue Data Catalog as well as a variety of analytics and AI tools to work with.

This environment automatically creates the following serverless compute: Amazon Athena Spark, Amazon Athena SQL, AWS Glue Spark, and Amazon Managed Workflows for Apache Airflow (MWAA) serverless. This means you completely skip provisioning and can start working immediately with just-in-time compute resources, and it automatically scales back down when you finish, helping to save on costs.

You can also get started working on specific tables in Amazon Athena, Amazon Redshift, and Amazon S3 Tables. For example, you can select Query your data in Amazon SageMaker Unified Studio and then choose Get started in Amazon Athena console.

If you start from these consoles, you’ll connect directly to the Query Editor with the data that you were looking at already accessible, and your previous query context preserved. By using this context-aware routing, you can run queries immediately once inside the SageMaker Unified Studio without unnecessary navigation.

Getting started with notebooks with a built-in AI agent
Amazon SageMaker is introducing a new notebook experience that provides data and AI teams with a high-performance, serverless programming environment for analytics and ML jobs. The new notebook experience includes Amazon SageMaker Data Agent, a built-in AI agent that accelerates development by generating code and SQL statements from natural language prompts while guiding users through their tasks.

To start a new notebook, choose the Notebooks menu in the left navigation pane to run SQL queries, Python code, and natural language, and to discover, transform, analyze, visualize, and share insights on data. You can get started with sample data such as customer analytics and retail sales forecasting.

When you choose a sample project for customer usage analysis, you can open sample notebook to explore customer usage patterns and behaviors in a telecom dataset.

As I noted, the notebook includes a built-in AI agent that helps you interact with your data through natural language prompts. For example, you can start with data discovery using prompts like:

Show me some insights and visualizations on the customer churn dataset.

After you identify relevant tables, you can request specific analysis to generate Spark SQL. The AI agent creates step-by-step plans with initial code for data transformations and Python code for visualizations. If you see an error message while running the generated code, choose Fix with AI to get help resolving it. Here is a sample result:

For ML workflows, use specific prompts like:

Build an XGBoost classification model for churn prediction using the churn table, with purchase frequency, average transaction value, and days since last purchase as features.

This prompt receives structured responses including a step-by-step plan, data loading, feature engineering, and model training code using the SageMaker AI capabilities, and evaluation metrics. SageMaker Data Agent works best with specific prompts and is optimized for AWS data processing services including Athena for Apache Spark and SageMaker AI.

To learn more about new notebook experience, visit the Amazon SageMaker Unified Studio User Guide.

Now available
One-click onboarding and the new notebook experience in Amazon SageMaker Unified Studio are now available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), and Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland) Regions. To learn more, visit the SageMaker Unified Studio product page.

Give it a try in the SageMaker console and send feedback to AWS re:Post for SageMaker Unified Studio or through your usual AWS Support contacts.

Channy

New business metadata features in Amazon SageMaker Catalog to improve discoverability across organizations

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/new-business-metadata-features-in-amazon-sagemaker-catalog-to-improve-discoverability-across-organizations/

Amazon SageMaker Catalog, which is now built in to Amazon SageMaker, can help you collect and organize your data with the accompanying business context people need to understand it. It automatically documents assets generated by AWS Glue and Amazon Redshift, and it connects directly with Amazon Quick Sight, Amazon Simple Storage Service (Amazon S3) buckets, Amazon S3 Tables, and AWS Glue Data Catalog (GDC).

With only a few clicks, you can curate data inventory assets with the required business metadata by adding or updating business names (asset and schema), descriptions (asset and schema), read me, glossary terms (asset and schema), and metadata forms. You can also create AI-generated suggestions, review and refine descriptions, and publish enriched asset metadata directly to the catalog. This helps reduce manual documentation effort, improves metadata consistency, and accelerates asset discoverability across organizations.

Starting today, you can use new capabilities in Amazon SageMaker Catalog metadata to improve business metadata and search:

  • Column-level metadata forms and rich descriptions – You can create custom metadata forms to capture business-specific information directly in individual columns. Columns also support markdown-enabled rich text descriptions for comprehensive data documentation and business context.
  • Enforce metadata rules for glossary terms for asset publishing – You can use metadata enforcement rules for glossary terms, meaning data producers must use approved business vocabulary when publishing assets. By standardizing metadata practices, your organization can improve compliance, enhance audit readiness, and streamline access workflows for greater efficiency and control.

These new SageMaker Catalog metadata capabilities help address consistent data classification and improve discoverability across your organizational catalogs. Let’s take a closer look at each capability.

Column-level metadata forms and rich descriptions
You can now use custom metadata forms and rich text descriptions at the column level, extending existing curation capabilities for business names, descriptions, and glossary term classifications. Custom metadata form field values and rich text content are indexed in real time and become immediately discoverable through search.

To edit column-level metadata, select the schema of your catalog asset used in your project and choose the View/Edit action for each column.

When you choose one of the columns as an asset owner, you can define custom key-value metadata forms and markdown descriptions to provide detailed column documentation.

Now data analysts in your organization can search using custom form field values and rich text content, alongside existing column names, descriptions, and glossary terms.

Enforce metadata rules for glossary terms for asset publishing
You can define mandatory glossary term requirements for data assets during the publishing workflow. Your data producers must now classify their assets with approved business terms from organizational glossaries before publication, promoting consistent metadata standards and improving data discoverability. The enforcement rules validate that required glossary terms are applied, preventing assets from being published without proper business context.

To enable a new metadata rule for glossary terms, choose Add in your domain units under the Domain Management section in the Govern menu.

Now you can select either Metadata forms or Glossary association as a type of requirement for the rule. When you select Glossary association, you can choose up to 5 required glossary terms per rule.

If you attempt to publish assets without adding the required glossary terms, the error message prompting you to enforce the glossary rule appears.

Standardizing metadata and aligning data schemas with business language enhances data governance and improves search relevance, helping your organization better understand and trust published data.

You can use AWS Command Line Interface (AWS CLI) and AWS SDKs to use these features. To learn more, visit the Amazon SageMaker Unified Studio data catalog in the Amazon SageMaker Unified Studio User Guide.

Now available
The new metadata capabilities are now available in AWS Regions where Amazon SageMaker Catalog is available.

Give it a try and send feedback to AWS re:Post for Amazon SageMaker Catalog or through your usual AWS Support contacts.

Channy

Your guide to AWS Analytics at AWS re:Invent 2025

Post Syndicated from Sonu Kumar Singh original https://aws.amazon.com/blogs/big-data/your-guide-to-aws-analytics-at-aws-reinvent-2025/

re:Invent banner

It’s that time of year again — AWS re:Invent is here! At re:Invent, bold ideas come to life. Get a front-row seat to hear inspiring stories from AWS experts, customers, and leaders as they explore today’s most impactful topics, from data analytics to AI.

For all the data enthusiasts and professionals, we’ve curated a comprehensive guide to every analytics session to help you plan your perfect agenda. Make sure to secure your seat early for must-attend sessions via the attendee portal.

Pro tip: Even if a session shows as fully reserved, we encourage you to join the walk-up line at the session location. Based on previous years’ experiences, additional seats often become available due to no-shows or last-minute schedule changes. The walk-up line operates on a first-come, first-served basis, and many attendees have successfully accessed their desired sessions this way. Just be sure to arrive at least 15 minutes before the session starts for the best chance of getting a seat.

Can’t make it in person? No problem — grab a free virtual pass to stream live sessions from anywhere.

And don’t forget to stop by the AWS Kiosk in the AWS Village Expo for AWS Analytics, Amazon SageMaker, Amazon OpenSearch Service and AWS Messaging and Streaming services! See live demos of analytics services, meet AWS experts, get your toughest data questions answered, explore the latest launches, join our data trivia, and even win exclusive AWS-authored books and many more swags.

Data Innovation Talk

INV201 | Harnessing analytics for humans and AI

Emerging trends, ranging from Open Table Formats (OTF) to agentic infrastructure, are rapidly changing how humans and applications interact with analytics to drive mission-critical business decisions. Join Mai-Lan Tomsen Bukovec, VP of AWS Technology, to explore emerging trends, the evolution of analytics engines and applications, and how to future-proof your data foundation for the rapidly changing landscape of analytics at scale. Learn how AWS is transforming data and analytics services to lead in optimized data storage, querying, streaming, processing, and governance – for both human users and agentic infrastructure.

Breakouts

Dive into cutting-edge topics with re:Invent breakout sessions. These immersive, hour-long lectures are led by AWS experts, customers, offering you unparalleled insights and knowledge in a concise format. Whether you’re exploring the latest in cloud technology, AWS Analytics advancements, or industry-specific solutions, these sessions are designed to expand your horizon and inspire your next big idea.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:30 AM – 9:30 AM PST | Venetian | Level 3 | Lido 3106

ANT203 | Enabling AI innovation with Amazon SageMaker Unified Studio

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

BIZ207 | Democratize access to insights with Amazon Quick Suite

8:30 AM – 9:30 AM PST | MGM | Level 1 | Grand 123

ANT204 | Architecting the future: Amazon SageMaker as a data and AI platform

11:00 AM – 12:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT317 | Modernize your data warehouse by moving to Amazon Redshift

8:30 AM – 9:30 AM PST | MGM | Level 3 | Chairman’s 366

ANT318 | Scaling Amazon Redshift with a multi-warehouse architecture

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT216 | What’s new with Amazon SageMaker in the era of unified data and AI

10:00 AM – 11:00 AM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT335 | Agentic data engineering with AWS Analytics MCP Servers

11:30 AM – 12:30 PM PST | Wynn | Upper Convention Promenade | Cristal 7

ANT328 | Data Processing architectures for building AI solutions

9:00 AM – 10:00 AM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Mint Green Theater

ANT307 | Operating Apache Kafka and Apache Flink at scale

1:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 364

BIZ203 | Amazon’s journey deploying Quick Suite across thousands of users

10:00 AM – 11:00 AM PST | Wynn | Upper Convention Promenade | Bollinger

ANT304 | Build an AI-ready data foundation

1:00 PM – 2:00 PM PST | MGM | Level 1 | Grand 122

BIZ227 | Generate new revenue streams with Amazon Quick Sight embedded

10:00 AM – 11:00 AM PST | Wynn | Upper Convention Promenade | Bollinger

BIZ331 | Build robust data foundations to power enterprise AI and BI

1:30 PM – 2:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Mint Green Theater

ANT206 | What’s new in Amazon Redshift and Amazon Athena

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT424 | Autonomous agents powered by streaming data and Retrieval Augmented Generation

2:00 PM – 3:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT343 | Best practices for building Apache Iceberg based lakehouse architectures on AWS

10:00 AM – 11:00 AM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT209 | Universal data connectivity with ETL and SQL queries

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT308 | Explore what’s new in data and AI governance with SageMaker Catalog

11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Pink Theater

ANT310 | Powering your Agentic AI experience with AWS Streaming and Messaging

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 3 South | South Seas E

ANT344 | Build, govern, and share Amazon Quick Suite dashboards with Amazon SageMaker

10:30 AM – 11:30 AM PST | MGM | Level 1 | Grand 116

ANT314 | Build Advanced Search with Vector, Hybrid, and AI Techniques

4:30 PM – 5:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Mint Green Theater

ANT305 | Innovations in AWS analytics: Data processing

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT315 | Intelligent Observability and Modernization with Amazon OpenSearch Service

4:00 PM – 5:00 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

DAT445 | Deep dive into databases zero-ETL integrations

12:00 PM – 1:00 PM PST | MGM | Level 3 | Chairman’s 360

ANT336 | Enterprise-scale ETL optimization for Apache Spark

. 3:00 PM – 4:00 PM PST | MGM | Level 1 | Grand 122

ANT309 | Accelerate analytics and AI with an open and secure lakehouse architecture

.
12:00 PM – 1:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Orange Theater

ANT339 | Turn unstructured data in Amazon S3 into AI-ready assets with SageMaker Catalog

. . .
1:00 PM – 2:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT201 | What’s new in search, observability, and vector databases with OpenSearch

. . .
1:30 PM – 2:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

BIZ228 | Reimagine business intelligence with Amazon Quick Sight

. . .
1:30 PM – 2:30 PM PST | Mandalay Bay | Level 3 South | South Seas E

OPN413 | Transforming Apache Kafka into a Scalable Message Queue

. . .
5:30 PM – 6:30 PM PST | Mandalay Bay | Level 3 South | South Seas F

ANT423 | Amazon Kinesis Data Streams under the hood

. . .

Chalk talks

These hour-long, highly engaging sessions offer a unique blend of expert insight and collaborative learning. An AWS specialist kicks off with a concise, informative lecture, setting the stage for an in-depth, interactive Q&A. With a limited audience size, you’ll have the opportunity to dive deep into topics, ask pressing questions, and engage in meaningful discussions with both the presenter and fellow attendees.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4 Friday, Dec 5
8:30 AM – 9:30 AM PST | MGM | Level 1 | Boulevard 167

ANT301-R1 | Accelerating the shift from batch to real-time streaming

11:30 AM – 12:30 PM PST | Caesars Forum | Level 1 | Academy 411

ANT302-R1 | Accelerate GenAI-powered data discovery and sharing with SageMaker Catalog

9:00 AM – 10:00 AM PST | MGM | Level 3 | Room 353

ANT301-R | Accelerating the shift from batch to real-time streaming

11:30 AM – 12:30 PM PST | MGM | Level 3 | Room 353

ANT207 | Develop with natural language and agentic AI in Amazon SageMaker Unified Studio

10:30 AM – 11:30 AM PST | Caesars Forum | Level 1 | Summit 221

ANT331 | Optimize Cost and Performance in Amazon OpenSearch Service

8:30 AM – 9:30 AM PST | Mandalay Bay | Level 2 South | Reef C

ANT347 | Build a secure and regulated data foundation for AI

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT217 | Build data pipelines in minutes with the Amazon SageMaker Visual experience

9:00 AM – 10:00 AM PST | Mandalay Bay | Level 3 South | South Seas H

ANT319-R1 | Optimizing Apache Spark workloads with AWS Analytics

12:30 PM – 1:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT346 | Architectural blueprints for your lakehouse in Amazon SageMaker

.
10:00 AM – 11:00 AM PST | Mandalay Bay | Level 3 South | South Seas A

ANT420-R | AI-driven scaling in Amazon Redshift Serverless

12:00 PM – 1:00 PM PST | Caesars Forum | Level 1 | Alliance 305

ANT301-R2 | Accelerating the shift from batch to real-time streaming

10:00 AM – 11:00 AM PST | MGM | Level 1 | Boulevard 158

ANT321 | Top 10 tips to improve query performance in Amazon Redshift

2:00 PM – 3:00 PM PST | MGM | Level 1 | Room 101

ANT303 | Implement data pipelines for analytics using Amazon SageMaker Unified Studio

.
10:30 AM – 11:30 AM PST | Wynn | Convention Promenade | Latour 5

ANT302-R | Accelerate GenAI-powered data discovery and sharing with SageMaker Catalog

1:00 PM – 2:00 PM PST | Mandalay Bay | Level 2 South | Lagoon G

ANT330-R | Design and build Intelligent Observability with Amazon OpenSearch Service

10:00 AM – 11:00 AM PST | Wynn | Convention Promenade | La Tache 2

ANT320 | Strengthening security for Apache Spark workloads

2:00 PM – 3:00 PM PST | Mandalay Bay | Level 3 South | South Seas J

ANT322 | Architectural patterns for real-time data analytics on AWS

.
11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT338 | Bring unified analytics to your data warehouse with the lakehouse architecture

1:30 PM – 2:30 PM PST | MGM | Level 3 | Premier 320

ANT325-R1 | A deep dive into AI/ML development in SageMaker Unified Studio

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT332 | Building high-quality data products for AI Agents

3:30 PM – 4:30 PM PST | MGM | Level 1 | Room 101

ANT337 | Breaking data silos with the lakehouse architecture

.
11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Montrachet 1

BIZ323 | Design AI-powered BI architectures for modern enterprises with Amazon Quick Suite

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 2 South | Lagoon G

ANT420-R1 | AI-driven scaling in Amazon Redshift Serverless

1:00 PM – 2:00 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT340 | Deep dive into data processing in SageMaker Unified Studio

. .
1:30 PM – 2:30 PM PST | MGM | Level 3 | Room 353

ANT325-R | A deep dive into AI/ML development in SageMaker Unified Studio

2:30 PM – 3:30 PM PST | Mandalay Bay | Lower Level North | South Pacific B

ANT341 | Build trust in AI with end-to-end data lineage in Amazon SageMaker Catalog

2:30 PM – 3:30 PM PST | MGM | Level 3 | Chairman’s 356

ANT345 | Building secure and scalable lakehouses for the future

. .
2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT329 | Build Advanced AI-powered Search with OpenSearch MCP and Vectors

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas C

BIZ327 | Bridge data silos to unlock complete insights with Amazon Quick Suite

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas J

ANT413 | Upgrade Amazon DataZone to Amazon SageMaker Catalog for analytics and AI

. .
3:00 PM – 4:00 PM PST | MGM | Level 3 | Premier 320

BIZ319 | Beyond chatbots: Discover conversational AI in Amazon Quick Suite

3:00 PM – 4:00 PM PST | Wynn | Convention Promenade | Latour 5

ANT421 | Advanced Stream Processing with Apache Flink

4:00 PM – 5:00 PM PST | MGM | Level 3 | Room 350

ANT324 | Building Pipelines for Analytics, ML and AI in Amazon Sagemaker Unified Studio

. .
4:00 PM – 5:00 PM PST | MGM | Level 3 | Chairman’s 356

ANT422 | Building Resilient Multi-Tenant Messaging with Amazon SQS

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Reef C

ANT319-R | Optimizing Apache Spark workloads with AWS Analytics

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT323 | Mastering materialized views: tips for fast, low-latency queries in Redshift

. .
4:30 PM – 5:30 PM PST | Caesars Forum | Level 1 | Alliance 305

ANT330-R1 | Design and build Intelligent Observability with Amazon OpenSearch Service

5:30 PM – 6:30 PM PST | MGM | Level 3 | Room 350

ANT326 | Mastering data transformations with Amazon Athena

5:30 PM – 6:30 PM PST | MGM | Level 1 | Boulevard 167

ANT316 | Orchestrating with Apache Airflow, MWAA, and SageMaker Unified Studio

Builders’ sessions

Immerse yourself in our builders’ sessions – a hands-on learning experience designed to elevate your AWS skills. These focused, hour-long workshops bring together a small group of up to ten attendees with a dedicated AWS expert at each table.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:30 AM – 9:30 AM PST | Wynn | Convention Promenade | Latour 7

ANT407-R1 | Building event-driven applications with AWS Streaming and Messaging

11:30 AM – 12:30 PM PST | MGM | Level 1 | Room 104

ANT415 | Securely monetize your data with Amazon Redshift

1:00 PM – 2:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT407-R | Building event-driven applications with AWS Streaming and Messaging

12:30 PM – 1:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Builders’ Session 1

ANT409 | Getting hands on with zero-ETL and data federation

11:30 AM – 12:30 AM PST | MGM | Level 1 | Room 104

ANT410-R | Integrate and orchestrate data workflows with AWS Glue & MWAA

2:30 PM – 3:30 PM PST | MGM | Level 3 | Room 304

ANT405-R1 | Build high performance Apache Iceberg data lakes with Amazon S3 Tables

1:00 PM – 2:00 PM PST | Wynn | Convention Promenade | Latour 7

ANT406-R | Build trust in your data with Amazon SageMaker Catalog

2:00 PM – 3:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT419-R | Vector search with Amazon OpenSearch Service

11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Latour 7

ANT406-R1 | Build trust in your data with Amazon SageMaker Catalog

4:30 PM – 5:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Builders’ Session 2

ANT410-R1 | Integrate and orchestrate data workflows with AWS Glue & MWAA

4:00 PM – 5:00 PM PST | MGM | Level 3 | Room 304

ANT419-R1 | Vector search with Amazon OpenSearch Service

3:30 PM – 4:30 PM PST | Caesars Forum | Level 1 | Alliance 315

OPN407-R1 | Performance tuning for streaming Ingestion into Apache Iceberg

2:30 PM – 3:30 PM PST | MGM | Level 1 | Room 104

ANT408 | Data analytics for financial organizations with Amazon SageMaker

. . .
3:00 PM – 4:00 PM PST | Caesars Forum | Level 1 | Alliance 311

OPN407-R | Performance tuning for streaming Ingestion into Apache Iceberg

. . .
4:00 PM – 5:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT405-R | Build high performance Apache Iceberg data lakes with Amazon S3 Tables

. . .

Workshops

Roll your sleeves in our dynamic 2-hour workshops, where you’ll tackle real-world challenges using AWS services. These interactive sessions kick off with a brief, informative lecture to set the stage, then quickly transition into hands-on problem-solving. Bring your laptop and prepare to build alongside AWS experts, who will guide you through practical applications of cloud computing concepts. Whether you’re new to AWS or looking to sharpen your skills, these workshops offer a unique opportunity to learn by doing, enabling you to leave with confidence and applicable knowledge in AWS technologies.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:00 AM – 10:00 AM PST | Mandalay Bay | Lower Level North | Islander C

ANT402-R1 | Build a fraud detection system with Amazon SageMaker Unified Studio

12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT418 | Unleash Apache Kafka’s elasticity and cost-efficiency with Amazon MSK

8:30 AM – 10:30 AM PST | Mandalay Bay | Lower Level North | Islander C

ANT402-R | Build a fraud detection system with Amazon SageMaker Unified Studio

12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT412 | Power streaming analytics on AWS with AI-driven insights

8:00 AM – 10:00 AM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom H

ANT403 | Building Production-Ready Data Systems for AI Applications

12:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 368

ANT404-R1 | Build modern data applications with the lakehouse architecture on AWS

8:30 AM – 10:30 AM PST | Caesars Forum | Level 1 | Alliance 308

BIZ204-R1 | Experience AI-powered BI with Amazon Quick Suite

3:00 PM – 5:00 PM PST | Mandalay Bay | Lower Level North | Islander C

ANT416 | Solve complex data and AI governance challenges with Amazon SageMaker Catalog

8:30 AM – 10:30 AM PST | Wynn | Upper Convention Promenade | Cristal 3

BIZ306 | Create agentic AI chat experiences with Amazon Quick Suite

3:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom K

ANT411 | Low-cost logging and observability with Amazon OpenSearch Service

12:30 PM – 2:30 PM PST | MGM | Level 1 | Grand 113

ANT404-R | Build modern data applications with the lakehouse architecture on AWS

.
12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT417 | Simplifying data interoperability with the lakehouse architecture on AWS

3:00 PM – 5:00 PM PST | Wynn | Upper Convention Promenade | Cristal 1

BIZ204-R | Experience AI-powered BI with Amazon Quick Suite

3:30 PM – 5:30 PM PST | Mandalay Bay | Lower Level North | Islander C

ANT401 | Build an AI-powered enterprise search with Amazon OpenSearch service

.
3:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom K

ANT414 | Scale intelligent analytics with Amazon Redshift multi-cluster architectures

. . .

Lightning Talks

Located in the Expo Hall, each of these 20-minute theater presentations are dedicated to a specific customer story, service demo, or AWS Partner offering.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
5:00 PM – 5:20 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 4

ANT334 | High-performance NLP & geospatial analysis with Redshift

. 3:00 PM – 3:20 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

ANT333 | Fast-track to insights: AWS-SAP data strategy

12:30 PM – 12:50 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 3

ANT342 | ITTI’s Cross-Company Data Mesh Blueprint with Amazon SageMaker

6:00 PM – 6:20 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 3

ANT348 | Seamless data sharing in Amazon Redshift

. . .

Conclusion

We hope this post acts as your go-to resource for navigating the AWS analytics track at re:Invent 2025. For staying in the know about the most recent trends and advancements in AWS Analytics, follow our LinkedIn page.


About the authors

Navnit Shukla

Navnit Shukla

Navnit serves as an AWS Specialist Solutions Architect with a focus on Data and AI. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, he is the author of Data Wrangling on AWS and co-author of AI-Ready Data Blueprints with O’Reilly.

Sonu Kumar Singh

Sonu Kumar Singh

Sonu is a Senior Solutions Architect with over 13 years of experience, with a specialization in Analytics and Healthcare domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact.

Scaling data governance with Amazon DataZone: Covestro success story

Post Syndicated from Jörg Janssen original https://aws.amazon.com/blogs/big-data/scaling-data-governance-with-amazon-datazone-covestro-success-story/

Covestro Deutschland AG, headquartered in Leverkusen, Germany, is a global leader in high-performance polymer materials and components. Since its spin-off from Bayer AG in 2015, Covestro has established itself as a key player in the chemical industry, with 48 production sites worldwide, €14.4 billion 2023 revenue, and 17,500 employees. Covestro’s core business focuses on developing innovative, sustainable solutions for products used in various aspects of daily life. The company offers materials for mobility, building and living, electrical and electronics sectors, in addition to sports and leisure, cosmetics, health, and the chemical industry. The company’s products, such as polycarbonates, polyurethanes, coatings, adhesives, and specialty elastomers, are important components in automotive, construction, electronics, and medical device industries.

To support this global operation and diverse product portfolio, Covestro adopted a robust data management solution. In this post, we show you how Covestro transformed its data architecture by implementing Amazon DataZone and AWS Serverless Data Lake Framework (SDLF), transitioning from a centralized data lake to a data mesh architecture. Through this strategic shift, teams can share and consume data while maintaining high quality standards through a consolidated data marketplace and business metadata glossary. The result: streamlined data access, better data quality, and stronger governance at scale that various producer and consumer teams can use to run data and analytics workloads at scale, enabling over 1,000 data pipelines and achieving a 70% reduction in time-to-market.

Business and data challenges

Prior to their transformation, Covestro operated with a centralized data lake managed by a single data platform team that handled the data engineering tasks. This centralized approach created several challenges: bottlenecks in project delivery because of limited engineering resources, complicated prioritization of use cases, and inefficient data sharing processes. The setup often resulted in unnecessary data duplication, which in turn slowed down time-to-market for new analytics initiatives, increased costs, and limited the ability of business units to act quickly on insights.The lack of visibility into data assets created significant operational challenges:

  • Teams could not find existing datasets, often recreating data already stored elsewhere
  • No clear understanding of data lineage or quality metrics
  • Difficulty in determining who owned specific data assets or who to contact for access
  • Absence of metadata and documentation about available datasets
  • Departments shared little knowledge about how they were using data

These visibility issues, combined with the lack of unified access controls, led to:

  • Siloed data initiatives across departments
  • Reduced trust in data quality
  • Inefficient use of resources
  • Delayed project timelines
  • Missed opportunities for cross-functional collaboration and insights

A strategic solution: Why Amazon DataZone and SDLF?

The challenges Covestro faced reflect deeper structural limitations of centralized data architectures. As Covestro scaled, central data teams often became bottlenecks, and lack of domain context led to fragmented quality, inconsistent standards, and poor collaboration. Instead of centralizing control, a data mesh gives ownership to the teams who generate and understand the data, while keeping the governance and interoperability consistent across the organization. This makes it well-suited for Covestro’s environment, which requires agility, scalability, and cross-team collaboration.

AWS Serverless Data Lake Framework (SDLF) is a solution to these challenges, providing a robust foundation for data mesh architectures. Traditional data lake implementations often centralize data ownership and governance, but with the flexible design of SDLF, organizations can build decentralized data domains that align with modern data mesh principles. The framework provides domain-oriented teams with the infrastructure, security controls, and operational patterns needed to own and manage their data products independently, while maintaining consistent governance across the organization. Through its modular architecture and infrastructure as code templates, SDLF accelerates the creation of domain-specific data products, so that Covestro’s teams can deploy standardized yet customizable data pipelines. This approach supports the key pillars of data mesh: domain-oriented decentralization, data as a product, self-serve infrastructure, and federated governance, providing Covestro with a practical path to overcome the limitations of traditional centralized architectures.

Amazon DataZone enhances the data mesh implementation through a unified experience for discovering and accessing data across decentralized domains. As a data management service, Amazon DataZone helps organizations catalog, discover, share, and govern data across organizational boundaries. It provides a central governance layer where organizations can establish data sharing agreements, manage access controls, and enable self-service data access while supporting security and compliance. While teams can use the SDLF framework to build and operate domain-specific data products, Amazon DataZone complements it with a searchable catalog enriched with metadata, business context, and usage policies, making data products easier to find, trust, and reuse.

Through the sharing capabilities of Amazon DataZone, domain teams can share their data products with other domains while maintaining granular access controls and governance policies, enabling cross-domain collaboration and data reuse. This integration means that domain teams can publish their SDLF-managed datasets to an Amazon DataZone catalog, so authorized consumers across the organization can discover and access them. Through the built-in governance capabilities built into Amazon DataZone, organizations can implement standardized data sharing workflows, check data quality, and enforce consistent access controls across their distributed data system, strengthening their data mesh architecture with robust data governance and democratization capabilities.Together, SDLF and Amazon DataZone provide Covestro with a comprehensive solution for implementing a modern data mesh architecture, enabling autonomous data domains to operate with consistent governance, seamless data sharing, and enterprise-wide data discovery.

Solution architecture and implementation

The following architecture illustrates the high-level design of the data mesh solution. The implementation used a comprehensive AWS solution built on AWS services to create a robust, scalable, and governed data mesh that serves multiple business domains across the Covestro organization.

Data domain foundation: Serverless Data Lake Framework

A key pillar of the implementation is the Serverless Data Lake Framework (SDLF), which provides the foundational infrastructure and security needed to support data mesh strategies. SDLF delivers the core building blocks for data domains such as Amazon S3 storage layers, built-in encryption with AWS KMS, IAM-based access control, and infrastructure as code (IaC) automation. By using these components, Covestro can deploy decentralized, domain-owned data products rapidly while maintaining consistent governance across the enterprise.

The framework uses Amazon Simple Storage Service (Amazon S3) as the primary data storage layer, delivering virtually unlimited scalability and eleven nines of durability for diverse data assets. The proposed S3 bucket architecture follows AWS Well-Architected principles, implementing a multi-tiered structure with distinct raw, staging, and analytics data zones. This layered approach helps different business domains to maintain data sovereignty (each domain owns and controls its data, while keeping accessibility patterns organization-wide).

Security is a fundamental aspect in Covestro’s data mesh implementation. SDLF automatically implements encryption at rest and in transit across data storage and processing components. AWS Key Management Service (AWS KMS) provides centralized key management, while carefully crafted AWS Identity and Access Management (IAM) roles enable resource isolation.

Data processing with AWS Glue

AWS Glue serves as the cornerstone of the data processing and transformation capabilities, offering serverless extract, transform, and load ETL services that automatically scale based on workload demands.

Covestro’s pre-existent centralized data lake was fed by more than 1,000 ingestion data pipelines interacting with a variety of source systems. To support the migration of existing ingestion and processing pipelines, Covestro developed reusable blueprints that included the development and security standards defined for the data mesh.Covestro released standardized patterns that teams can deploy across multiple domains while providing the flexibility needed for domain-specific requirements. These blueprints support diverse source systems, from traditional databases like Oracle, SQL Server, and MySQL to modern software as a service (SaaS) applications such as SAP C4C.

They also developed specialized blueprints for processing, standardizing, and cleaning ingested raw data. These blueprints store processed data in Apache Iceberg format, automatically saving metadata in the AWS Glue Data Catalog and providing built-in capabilities to handle schema evolution seamlessly.

Covestro relies on SDLF to quickly configure and deploy the blueprints as AWS Glue jobs inside the domain. With SDLF, teams deploy a data pipeline through a YAML configuration file, and the orchestration and management mechanisms of SDLF handle the rest. The solution includes comprehensive monitoring capabilities built on Amazon DynamoDB, providing real-time visibility into data pipeline health and performance metrics (when teams deploy a pipeline through SDLF, the system automatically integrates it with the monitoring setup).

Data quality with AWS Glue Data Quality

To achieve data reliability across domains, Covestro extended the capabilities of SDLF to incorporate AWS Glue Data Quality into data processing pipelines. This integration enables automated data quality checks as part of the standard data processing workflow. Thanks to the configuration-driven design of SDLF, data producers can implement quality controls either using recommended rules, which are automatically generated through data profiling, or applying their own domain-specific rules.

The integration provides data teams with the flexibility to define quality expectations while maintaining consistency in how quality checks are implemented at the pipeline level. The solution logs quality evaluation results, providing visibility into the data quality metrics for each data product. These elements are illustrated in the following figure.

Enterprise-ready access control with AWS Lake Formation

AWS Lake Formation integration with the Data Catalog supports the security and access control layer that makes the data mesh implementation enterprise-ready. Through Lake Formation, Covestro implemented fine-grained access controls that respect domain boundaries while enabling controlled cross-domain data sharing.

The service’s integration with IAM means that Covestro can implement role-based access patterns that align with their organizational structure, so users can access the data they need while keeping appropriate security boundaries.

Data democratization with Amazon DataZone

Amazon DataZone functions as the heart of the data mesh implementation. Deployed in a dedicated AWS account, it provides the data governance, discovery, and sharing capabilities that were missing in the previous centralized approach. DataZone offers a unified, searchable catalog enriched with business context, automated access controls, and standardized sharing workflows that enable true data democratization across the organization.

Through Amazon DataZone, Covestro established a comprehensive data catalog that helps business users across different domains to discover, understand, and request access to data assets without requiring deep technical expertise. The business glossary functionality supports consistent data definitions across domains, eliminating the confusion that often arises when different teams use different terminology for the same concepts.

Data product owners can use the integration of Amazon DataZone integration with AWS Lake Formation to grant or revoke cross-domain access to data, streamlining the data sharing process while supporting security and compliance requirements.

Managing cross-domain data pipeline dependencies

When implementing Covestro’s data mesh architecture on AWS, one of the most significant challenges was orchestrating data pipelines across multiple domains. The core question to address was “How can Data Domain A determine when a required dataset from Data Domain B has been refreshed and is ready for consumption?”.

In a data mesh architecture, domains maintain ownership of their data products while enabling consumption by other domains. This distributed model creates complex dependency chains where downstream pipelines must wait for upstream data products to complete processing before execution can begin.

To address this cross-domain dependency coordination, Covestro extended the SDLF with a custom dependency checker component that operates through both shared and domain-specific elements.

The shared components consist of two centralized Amazon DynamoDB tables located in a hub AWS account: one collecting successful pipeline execution logs from the domains, and another aggregating pipeline dependencies across the entire data mesh.

These domains deploy local components such as a dependency-tracking Amazon DynamoDB table and an AWS Step Functions state machine. The state machine checks prerequisites using centralized execution logs and integrates seamlessly as the first step in every SDLF-deployed pipeline, without additional configuration. The following diagram shows the process described.

To prevent circular dependencies that could create locks in the distributed orchestration system, Covestro implemented a sophisticated detection mechanism using Amazon Neptune. DynamoDB Streams automatically replicate dependency changes from domain tables to the central registry, triggering an AWS Lambda function that uses the Gremlin graph traversal language (using pygremlin) to construct, update, and analyze a directed acyclic graph (DAG) of the pipeline relationships, with native Gremlin functions detecting circular dependencies and sending automated notifications, as illustrated in the following diagram. This process continuously updates the graph to reflect any new pipeline dependencies or changes across the data mesh.

Operational excellence through infrastructure as code

Infrastructure as code (IaC) practices using AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK) significantly improve the operational efficiency of the data mesh implementation. The infrastructure code is version-controlled in GitHub repositories, providing complete traceability and collaboration capabilities for data engineering teams. This approach uses a dedicated deployment account that uses AWS CodePipeline to orchestrate consistent deployments across multiple data mesh domains.

The centralized deployment model supports that infrastructure changes follow a standardized continuous integration and deployment (CI/CD) process, where code commits trigger automated pipelines that validate, test, and deploy infrastructure components to the appropriate domain accounts. Each data domain resides in its own separate set of AWS accounts (dev, qa, prod), and the centralized deployment pipeline respects these boundaries while enabling controlled infrastructure provisioning.

IaC enables the data mesh to scale horizontally when onboarding new domains, supporting the maintenance of consistent security, governance, and operational standards across the entire environment. Covestro provisions new domains quickly using proven templates, accelerating time-to-value for business teams.

Business impact and technical outcomes

The implementation of the data mesh architecture using Amazon DataZone and SDLF has delivered significant measurable benefits across Covestro’s organization:

Accelerated data pipeline development

  • 70% reduction in time-to-market for new data products through standardized blueprints
  • Successful migration of more than 1,000 data pipelines to the new architecture
  • Automated pipeline creation without manual coding requirements
  • Standardized approach and sharing across domains

Enhanced data governance and quality

  • Comprehensive business glossary implementation that supports consistent terminology
  • Automated data quality checks integrated into pipelines
  • End-to-end data lineage visibility across domains
  • Standardized metadata management through Apache Iceberg integration

Improved data discovery and access

  • Self-service data discovery portal through Amazon DataZone
  • Streamlined cross-domain data sharing with appropriate security controls
  • Reduced data duplication through improved visibility of existing assets
  • Efficient management of cross-domain pipeline dependencies

Operational efficiency

  • Decreased central data team bottlenecks through domain-oriented ownership
  • Reduced operational overhead through automated deployment processes
  • Improved resource utilization through elimination of redundant data processing
  • Enhanced monitoring and troubleshooting capabilities

The new infrastructure has fundamentally transformed how Covestro’s teams interact with data, enabling business domains to operate autonomously while upholding enterprise-wide standards for quality and governance. This has created a more agile, efficient, and collaborative data ecosystem that supports both current needs and future growth.

What’s next

As Covestro’s data platform continues to evolve, the focus is now to support domain teams to effectively built data products for cross domain analytics. In parallel, Covestro is actively working to improve data transparency with data lineage in Amazon DataZone through OpenLineage to support more comprehensive data traceability across a diverse set of processing tools and formats.

Conclusion

In this post, we showed you how Covestro transformed its data architecture transitioning from a centralized data lake to a data mesh architecture, and how this foundation will prove invaluable in supporting their journey toward becoming a more data-driven organization. Their experience demonstrates how modern data architectures, when properly implemented with the right tools and frameworks, can transform business operations and unlock new opportunities for innovation.

This implementation serves as a blueprint for other enterprises looking to modernize their data infrastructure while maintaining security, governance, and scalability. It shows that with careful planning and the right technology choices, organizations can successfully transition from centralized to distributed data architectures without compromising on control or quality.

For more on Amazon DataZone, see the Getting Started guide. To learn about the SDLF, see Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code.


About the authors

Jörg Janssen

Jörg Janssen

Jörg serves as the Product Owner of the Covestro Data and Analytics Platform within Covestro IT & D. In this role, he bridges business requirements and technical execution, enabling data-driven innovation across the organization. With a strong background in chemistry and decades of IT experience, he plays a key role in advancing Covestro’s data strategy—empowering business units to develop innovative solutions in chemical manufacturing and operations, while ensuring effective data governance and stewardship.

Mousam Majhi

Mousam Majhi

Mousam is a Senior ProServe Cloud Architect focusing on Data & AI within AWS Professional Services. He works with Manufacturing and Travel, Transportation, and Logistics customers in DACH to achieve their business outcomes by leveraging data and AI powered solutions. Outside of work, Mousam enjoys hiking in the Bavarian Alps.

Giuseppe Perillo

Giuseppe Perillo

Giuseppe is a Data Architect at AWS Professional Services, specializing in data governance and the design of reliable, enterprise-grade data platforms. With a strong foundation in dimensional modeling, data warehousing, integration, and data quality, he helps customers build structured, trusted, and analytics-ready data environments.

Maddyzeth Ariza

Maddyzeth Ariza

Maddyzeth is a Data Architect at AWS Professional Services. She designs and implements scalable, cloud data solutions that support enterprise analytics, machine learning, and real-time processing. She specializes in serverless data architectures on AWS, including data lakes, data mesh, and enterprise-wide data governance. Outside of work, she enjoys exploring historic cities and landmarks across Europe.

Unlock real-time data insights with schema evolution using Amazon MSK Serverless, Iceberg, and AWS Glue streaming

Post Syndicated from Nitin Kumar original https://aws.amazon.com/blogs/big-data/unlock-real-time-data-insights-with-schema-evolution-using-amazon-msk-serverless-iceberg-and-aws-glue-streaming/

Efficient real-time synchronization of data within data lakes present challenges. Any data inaccuracies or latency issues can significantly compromise analytical insights and subsequent business strategies. Organizations increasingly require synchronized data in near real-time to extract actionable intelligence and respond promptly to evolving market dynamics. Additionally, scalability remains a concern for data lake implementations, which must accommodate expanding volumes of streaming data and maintain optimal performance without incurring high operational costs.

Schema evolution is the process of modifying the structure (schema) of a data table to accommodate changes in the data over time, such as adding or removing columns, without disrupting ongoing operations or requiring a complete data rewrite. Schema evolution is vital in streaming data environments for several reasons. Unlike batch processing, streaming pipelines operate continuously, ingesting data in real time from sources that are actively serving production applications. Source systems naturally evolve over time as businesses add new features, refine data models, or respond to changing requirements. Without proper schema evolution capabilities, even minor changes to source schemas can force streaming pipeline shutdowns, requiring developers to manually reconcile schema differences and rebuild tables.

Such disruptions reduce the core value proposition of streaming architectures—continuous, low-latency data processing. Organizations can maintain uninterrupted data flows and keep source systems evolving independently by using the seamless schema evolution provided by Apache Iceberg. This reduces operational friction and maintains the availability of real-time analytics and applications even as underlying data structures change.

Apache Iceberg is an open table format, delivering essential capabilities for streaming workloads, including robust schema evolution support. This critical feature enables table schemas to adapt dynamically as source database structures evolve, maintaining operational continuity. Consequently, when database columns undergo additions, removals, or modifications, the data lake accommodates these changes seamlessly without requiring manual intervention or risking data inconsistencies.

Our comprehensive solution showcases an end-to-end real-time CDC pipeline that enables immediate processing of data modifications from Amazon Relational Database Service (Amazon RDS) for MySQL, streaming altered records directly to AWS Glue streaming jobs using Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless. These jobs continually process incoming changes and update Iceberg tables on Amazon Simple Storage Service (Amazon S3) so that the data lake reflects the current state of the operational database environment in real time. By using Apache Iceberg’s comprehensive schema evolution support, our ETL pipeline automatically adapts to database schema modifications, providing data lake consistency and currentness without manual intervention. This approach combines complete process control with instantaneous analytics on operational data, eliminating traditional latency, and future-proofs the solution to address evolving organizational data needs. The architecture’s inherent flexibility facilitates adaptation to diverse use cases requiring immediate data insights.

Solution overview

To effectively address streaming challenges, we propose an architecture using Amazon MSK Serverless, a comprehensive managed Apache Kafka service that autonomously provisions and scales computational and storage resources. This solution offers a frictionless mechanism for ingesting and processing streaming data without the complexity of capacity management. Our implementation uses Amazon MSK Connect with the Debezium MySQL connector to capture and stream database modifications in real time. Rather than employing traditional batch processing methodologies, we implement an AWS Glue streaming job that directly consumes data from Kafka topics, processes CDC events as they occur, and writes transformed data to Apache Iceberg tables on Amazon S3.

The workflow consists of the following:

  1. Data flows from Amazon RDS through Amazon MSK Connect using the Debezium MySQL connector to Amazon MSK Serverless. This represents a CDC pipeline that captures database changes from the relational database and streams them to Kafka.
  2. From Amazon MSK Serverless, the data then moves to AWS Glue job, which processes the data and stores it in Amazon S3 as Iceberg tables. The AWS Glue job interacts with the AWS Glue Data Catalog to maintain metadata about the datasets.
  3. Analyze the data using the serverless interactive query service Amazon Athena, which can be used to query the iceberg table created in Data Catalog. This allows for interactive data analysis without managing infrastructure.

The following diagram illustrates the architecture that we implement through this post. Each number corresponds to the preceding list and shows major components that you implement.

Prerequisites

Before getting started, make sure you have the following:

  • An active AWS account with billing enabled
  • An AWS Identity and Access Management (IAM) user with specific permissions to create and manage resources, such as a virtual private cloud (VPC), subnet, security group, IAM roles, NAT gateway, internet gateway, Amazon Elastic Compute Cloud (Amazon EC2) client, MSK Serverless, MSK Connector and its plugin AWS Glue job, and S3 buckets.
  • Sufficient VPC capacity in your chosen AWS Region.

For this post, we create the solution resources in the US East (N. Virginia) – us-east-1 Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.

Configuring CDC and processing using AWS CloudFormation

In this post, you use the CloudFormation template vpc-msk-mskconnect-rds-client-gluejob.yaml. This template sets up the streaming CDC pipeline resources such as a VPC, subnet, security group, IAM roles, NAT, internet gateway, EC2 client, MSK Serverless, MSK Connect, Amazon RDS, S3 buckets, and AWS Glue job.

To create the solution resources for the CDC pipeline, complete the following steps:

  1. Launch the stack vpc-msk-mskconnect-rds-client-gluejob.yaml using the CloudFormation template:
  2. Provide the parameter values as listed in the following table.

    A B C
    1 Parameters Description Sample value
    2 EnvironmentName An environment name that is prefixed to resource names. msk-iceberg-cdc-pipeline
    3 DatabasePassword Database admin account password. ****
    4 InstanceType MSK client EC2 instance type. t2.micro
    5 LatestAmiId Latest AMI ID of Amazon Linux 3 for ec2 instance. You can use the default value. /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64
    6 VpcCIDR IP range (CIDR notation) for this VPC. 10.192.0.0/16
    7 PublicSubnet1CIDR IP range (CIDR notation) for the public subnet in the first Availability Zone. 10.192.10.0/24
    8 PublicSubnet2CIDR IP range (CIDR notation) for the public subnet in the second Availability Zone. 10.192.11.0/24
    9 PrivateSubnet1CIDR IP range (CIDR notation) for the private subnet in the first Availability Zone. 10.192.20.0/24
    10 PrivateSubnet2CIDR IP range (CIDR notation) for the private subnet in the second Availability Zone. 10.192.21.0/24
    11 NumberOfWorkers Number of workers for AWS Glue streaming job. 3
    12 GlueWorkerType Worker type for AWS Glue streaming job. For example, G.1X. G.1X
    13 GlueDatabaseName Name of the AWS Glue Data Catalog database. glue_cdc_blogdb
    14 GlueTableName Name of the AWS Glue Data Catalog table. iceberg_cdc_tbl

The stack creation process can take approximately 25 minutes to complete. You can check the Outputs tab for the stack after the stack is created, as shown in the following screenshot.

Following the successful deployment of the CloudFormation stack, you now have a fully operational Amazon RDS database environment. The database instance contains the salesdb database with the customer table populated with 30 data records.

These records have been streamed to the Kafka topic through the Debezium MySQL connector implementation, establishing a reliable CDC pipeline. With this foundation in place, proceed to the next phase of the data architecture: near real-time data processing using the AWS Glue streaming job.

Run the AWS Glue streaming job

To transfer the data load from the Kafka topic (created by the Debezium MySQL connector for database table customer) to the Iceberg table, run the AWS Glue streaming job configured by the CloudFormation setup. This process will migrate all existing customer data from the source database table to the Iceberg table. Complete the following steps:

  1. On the CloudFormation console, choose the stack vpc-msk-mskconnect-rds-client-gluejob.yaml
  2. On the Outputs tab, retrieve the name of the AWS Glue streaming job from the GlueJobName row. In the following screenshot, the name is IcebergCDC-msk-iceberg-cdc-pipeline.
  3. On the AWS Glue console, choose ETL jobs in the navigation pane.
  4. Search for the AWS Glue job named IcebergCDC-msk-iceberg-cdc-pipeline.
  5. Choose the job name to open its details page.
  6. Choose Run to start the job. On the Runs tab, confirm if the job ran without failure.

You need to wait approximately 2 minutes for the job to process before continuing. This pause allows the jobrun to fully process records from the Kafka topic (initial load) and create the Iceberg table.

Query the Iceberg table using Athena

After the AWS Glue streaming job has successfully started and the Iceberg table has been created in the Data Catalog, follow these steps to validate the data using Athena:

  1. On the Athena console, navigate to the query editor.
  2. Choose the Data Catalog as the data source.
  3. Choose the database glue_cdc_blogdb.
  4. To validate the data, enter the following query to preview the data and find the total count:
    SELECT id, name, mktsegment FROM "glue_cdc_blogdb"."iceberg_cdc_tbl" order by id desc limit 40;
    
    SELECT count(*) as total_rows FROM "glue_cdc_blogdb"."iceberg_cdc_tbl";

    The following screenshot shows the output of the example query.

After performing the preceding steps, you’ve established a complete near real-time data processing pipeline by running an AWS Glue streaming job that transfers data from Kafka topics to an Apache Iceberg table, then verified the successful data migration by querying the results through Amazon Athena.

Upload incremental (CDC) data for further processing

Now that you’ve successfully completed the initial full data load, it’s time to focus on the dynamic aspects of the data pipeline. In this section, we explore how the system handles ongoing data modifications such as insertions, updates, and deletions in Amazon RDS for MySQL database. These changes won’t go unnoticed. Our Debezium MySQL connector stands ready to capture each modification event, transforming database changes into a continuous stream of data. Working in tandem with our AWS Glue streaming job, this architecture is designed to promptly process and propagate every change in our source database through our data pipeline.Let’s see this real-time data synchronization mechanism in action, demonstrating how our modern data infrastructure maintains consistency across systems with minimal latency. Follow these steps:

  1. On the Amazon EC2 console, access the EC2 instance that you created using the CloudFormation template named as KafkaClientInstance.
  2. Log in to the EC2 instance using AWS Systems Manager Agent (SSM Agent). Select the instance named as KafkaClientInstance and then choose Connect.
  3. Enter the following commands to insert the data into the RDS table. Use the same database password you entered when you created the CloudFormation stack.
    sudo su - ec2-user
    RDS_AURORA_ENDPOINT=`aws rds describe-db-instances --region us-east-1 | jq -r '.DBInstances[] | select(.DBName == "salesdb") | .Endpoint.Address'`
    mysql -f -u master -h $RDS_AURORA_ENDPOINT  --password

  4. Now perform the insert, update, and delete in the CUSTOMER table.
    use salesdb;
    
    INSERT INTO customer VALUES(31, 'Customer Name 31', 'Market segment 31');
    INSERT INTO customer VALUES(32, 'Customer Name 32', 'Market segment 32');
    
    UPDATE customer SET name='Customer Name update 29', mktsegment='Market segment update 29' WHERE id = 29;
    UPDATE customer SET name='Customer Name update 30', mktsegment='Market segment update 30' WHERE id = 30;
    
    DELETE FROM customer WHERE id = 27;
    DELETE FROM customer WHERE id = 28;
    

  5. Validate the data to verify the insert, update, and delete records in the Iceberg table from Athena, as shown in the following screenshot.

After performing the preceding steps, you’ve learned how our CDC pipeline handles ongoing data modifications by performing insertions, updates, and deletions in the MySQL database and verifying how these changes are automatically captured by Debezium MySQL connector, streamed through Kafka, and reflected in the Iceberg table in near real time.

Schema evolution: Adding new columns to the Iceberg table

The schema evolution mechanism in this implementation provides an automated approach to detecting and adding new columns from incoming data to existing Iceberg tables. Although Iceberg inherently supports robust schema evolution capabilities (including adding, dropping, and renaming columns, updating types, and reordering), this code specifically automates the column addition process for streaming environments. This automation uses Iceberg’s underlying schema evolution capabilities, which guarantee correctness through unique column IDs that ensure new columns never read existing values from another column. By handling column additions programmatically, the system reduces operational overhead in streaming pipelines where manual schema management would create bottlenecks. However, dropping and renaming columns, updating types, and reordering still required manual intervention.

When new data arrives through Kafka streams, the handle_schema_evolution() function orchestrates a four-step process to ensure seamless table schema updates.

  1. It analyzes the incoming batch DataFrame to infer its schema structure, cataloging all column names and their corresponding data types.
  2. It retrieves the existing Iceberg table’s schema from the AWS Glue catalog to establish a baseline for comparison.
  3. The system then performs a schema comparison using method compare_schemas() between batch schema with existing table schema.
    1. If the incoming frame contains fewer columns than the catalog table, no action is taken.
    2. It identifies any new columns present in the incoming data that don’t exist in the current table structure and returns a list of new columns that need to be added.
    3. New columns will be added at the last.
    4. Handle type evolution isn’t supported. If needed, you can handle the same at comment # Handle type evolution in the compare_schemas() method.
    5. If the destination table has columns that are dropped in the source table, it doesn’t drop those columns. If that is required for your use case, you can use drop column manually using ALTER TABLE ... DROP COLUMN.
    6. Renaming the column isn’t supported. To rename the column use case, manually evolve the schema using ALTER TABLE … RENAME COLUMN.
  4. Finally, if new columns are discovered, the function executes ALTER TABLE … ADD COLUMN statements to evolve the Iceberg table schema, adding the new columns with their appropriate data types.

This approach eliminates the need for manual schema management and prevents data pipeline failures that would typically occur when encountering unexpected fields in streaming data. The implementation also includes proper error handling and logging to track schema evolution events, making it particularly valuable for environments where data structures frequently change.

def infer_schema_from_batch(batch_df):
    """
    Infer schema from the batch DataFrame
    Returns a dictionary with column names and their inferred types
    """
    schema_dict = {}
    for field in batch_df.schema.fields:
        schema_dict[field.name] = field.dataType
    return schema_dict

def get_existing_table_schema(spark, table_identifier):
    """
    Read the existing table schema from the Iceberg table
    Returns a dictionary with column names and their types
    """
    try:
        existing_df = spark.table(table_identifier)
        schema_dict = {}
        for field in existing_df.schema.fields:
            schema_dict[field.name] = field.dataType
        return schema_dict
    except Exception as e:
        print(f"Error reading existing table schema: {e}")
        return {}

def compare_schemas(batch_schema, existing_schema):
    """
    Compare batch schema with existing table schema
    Returns a list of new columns that need to be added
    """
    new_columns = []
    
    for col_name, col_type in batch_schema.items():
        if col_name not in existing_schema:
            new_columns.append((col_name, col_type))
        elif existing_schema[col_name] != col_type:
            # Handle type evolution if needed
            print(f"Warning: Column {col_name} type mismatch - existing: {existing_schema[col_name]}, new: {col_type}")
    
    return new_columns

def spark_type_to_sql_string(spark_type):
    """
    Convert Spark DataType to SQL string representation for ALTER TABLE
    """
    type_mapping = {
        'IntegerType': 'INT',
        'LongType': 'BIGINT',
        'StringType': 'STRING',
        'BooleanType': 'BOOLEAN',
        'DoubleType': 'DOUBLE',
        'FloatType': 'FLOAT',
        'TimestampType': 'TIMESTAMP',
        'DateType': 'DATE'
    }
    
    type_name = type(spark_type).__name__
    return type_mapping.get(type_name, 'STRING')

def evolve_table_schema(spark, table_identifier, new_columns):
    """
    Alter the Iceberg table to add new columns
    """
    if not new_columns:
        return
    
    try:
        for col_name, col_type in new_columns:
            sql_type = spark_type_to_sql_string(col_type)
            alter_sql = f"ALTER TABLE {table_identifier} ADD COLUMN {col_name} {sql_type}"
            print(f"Executing schema evolution: {alter_sql}")
            spark.sql(alter_sql)
            print(f"Successfully added column {col_name} with type {sql_type}")
    except Exception as e:
        print(f"Error during schema evolution: {e}")
        raise e

def handle_schema_evolution(spark, batch_df, table_identifier):
    """
    schema evolution steps
    1. Infer schema from batch DataFrame
    2. Read existing table schema
    3. Compare schemas and identify new columns
    4. Alter table if schema evolved
    """
    # Step 1: Infer schema from batch DataFrame
    batch_schema = infer_schema_from_batch(batch_df)
    print(f"Batch schema: {batch_schema}")
    
    # Step 2: Read existing table schema
    existing_schema = get_existing_table_schema(spark, table_identifier)
    print(f"Existing table schema: {existing_schema}")
    
    # Step 3: Compare schemas
    new_columns = compare_schemas(batch_schema, existing_schema)
    
    # Step 4: Evolve schema if needed
    if new_columns:
        print(f"Schema evolution detected. New columns: {new_columns}")
        evolve_table_schema(spark, table_identifier, new_columns)
        return True
    else:
        print("No schema evolution needed")
        return False

In this section, we demonstrate how our system handles structural changes to the underlying data model by adding a new status column to the customer table and populating it with default values. Our architecture is designed to seamlessly propagate these schema modifications throughout the pipeline so that downstream analytics and processing capabilities remain uninterrupted while accommodating the enhanced data model. This flexibility is essential for maintaining a responsive, business-aligned data infrastructure that can evolve alongside changing organizational needs.

  1. Add a new status column to the customer table and populate it with default values as Green.
    use salesdb;
    
    ALTER TABLE customer ADD COLUMN status VARCHAR(20) NOT NULL;
    
    UPDATE customer SET status = 'Green';
    

  2. Use the Athena console to validate the data and schema evolution, as shown in the following screenshot.

When schema evolution occurs in an Iceberg table, the metadata.json file undergoes specific updates to track and manage these changes. In job when schema evolution detected, it ran the following query to evolve the schema for the Iceberg table.

ALTER TABLE glue_catalog.glue_cdc_blogdb.iceberg_cdc_tbl ADD COLUMN status string

We checked the metadata.json file in Amazon S3 for iceberg table location, and the following screenshot shows how the schema evolved.

We now explain how our implementation handles schema evolution by automatically detecting and adding new columns from incoming data streams to existing Iceberg tables. The system employs a four-step process that analyzes incoming data schemas, compares them with existing table structures, identifies new columns, and executes the necessary ALTER TABLE statements to evolve the schema without manual intervention, though certain schema changes still require manual handling.

Clean up

To clean up your resources, complete the following steps:

  1. Stop the running AWS Glue streaming job:
    1. On the AWS Glue console, choose ETL jobs in the navigation pane.
    2. Search for the AWS Glue job named IcebergCDC-msk-iceberg-cdc-pipeline.
    3. Choose the job name to open its details page.
    4. On the Runs tab, select running jobrun and choose Stop job run. Confirm that the job stopped successfully.
  2. Remove the AWS Glue database and table:
    1. On the AWS Glue console, choose Tables in the navigation pane, select iceberg_cdc_tbl, and choose Delete.
    2. Choose Databases in the navigation pane, select glue_cdc_blogdb, and choose Delete.
  3. Delete the CloudFormation stack vpc-msk-mskconnect-rds-client-gluejob.yaml.

Conclusion

This post showcases a solution that businesses can use to access real-time data insights without the traditional delays between data creation and analysis. By combining Amazon MSK Serverless, Debezium MySQL connector, AWS Glue streaming, and Apache Iceberg tables, the architecture captures database changes instantly and makes them immediately available for analytics through Amazon Athena. A standout feature is the system’s ability to automatically adapt when database structures change—such as adding new columns—without disrupting operations or requiring manual intervention. This eliminates the technical complexity typically associated with real-time data pipelines and provides business users with the most current information for decision-making, effectively bridging the gap between operational databases and analytical systems in a cost-effective, scalable way.


About the Authors

Nitin Kumar

Nitin Kumar

Nitin is a Cloud Engineer (ETL) at AWS, specializing in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. In his free time, he likes to watch movies and spend time with his family.

Shubham Purwar

Shubham Purwar

Shubham is an Analytics Specialist Solutions Architect at AWS. He helps organizations unlock the full potential of their data by designing and implementing scalable, secure, and high-performance analytics solutions on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Noritaka Sekiyama

Noritaka Sekiyama

Noritaka is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Stifel’s approach to scalable Data Pipeline Orchestration in Data Mesh

Post Syndicated from Srinivas Kandi, Hossein Johari, Ahmad Rawashdeh, Lei Meng original https://aws.amazon.com/blogs/big-data/stifels-approach-to-scalable-data-pipeline-orchestration-in-data-mesh/

This is a guest post by Hossein Johari, Lead and Senior Architect at Stifel Financial Corp, Srinivas Kandi and Ahmad Rawashdeh, Senior Architects at Stifel, in partnership with AWS.

Stifel Financial Corp, a diversified financial services holding company is expanding its data landscape that requires an orchestration solution capable of managing increasingly complex data pipeline operations across multiple business domains. Traditional time-based scheduling systems fall short in addressing the dynamic interdependencies between data products, requires event-driven orchestration. Key challenges include coordinating cross-domain dependencies, maintaining data consistency across business units, meeting stringent SLAs, and scaling effectively as data volumes grow. Without a flexible orchestration solution, these issues can lead to delayed business operations and insights, increased operational overhead, and heightened compliance risks due to manual interventions and rigid scheduling mechanisms that cannot adapt to evolving business needs.

In this post, we walk through how Stifel Financial Corp, in collaboration with AWS ProServe, has addressed these challenges by building a modular, event-driven orchestration solution using AWS native services that enables precise triggering of data pipelines based on dependency satisfaction, supporting near real-time responsiveness and cross-domain coordination.

Data platform orchestration

Stifel and AWS technology teams identified several key requirements that would guide their solution architecture to overcome the above listed challenges along with traditional data pipeline orchestration.

Coordinated pipeline execution across multiple data domains based on events

  • The orchestration solution must support triggering data pipelines across multiple business domains based on events such as data product publication or completion of upstream jobs.

Smart dependency management

  • The solution should intelligently manage pipeline dependencies across domains and accounts.
  • It must ensure that downstream pipelines wait for all necessary upstream data products, regardless of which team or AWS account owns them.
  • Dependency logic should be dynamic and adaptable to changes in data availability.

Business-aligned configuration

  • A no-code architecture should allow business users and data owners to define pipeline dependencies and triggers using metadata.
  • All changes to dependency configurations should be version-controlled, traceable, and auditable.

Scalable and flexible architecture

  • The orchestration solution should support hundreds of pipelines across multiple domains without performance degradation.
  • It should be easy to onboard new domains, define new dependencies, and integrate with existing data mesh components.

Visibility and monitoring

  • Business users and data owners should have access showing pipeline status, including success, failure, and progress.
  • Alerts and notifications should be sent when issues occur, with clear diagnostics to support rapid resolution.

Example Scenario

The following below illustrates a cross-domain data dependency scenario, where a data product in domain (D1 and D2) relies on the prompt refresh of data products from other domains, each operating on distinct schedules. Upon completion, these upstream data products emit refresh events that automatically trigger the execution of a dependent downstream pipeline.

  • Dataset DS1 for Domain D1 depends on RD1 and RD2 from raw data domain which gets refreshed at different times T1 and T2
  • Dataset DS2 for Domain D1 depends on RD3 from raw data domain which gets refreshed at different times T3
  • Dataset DS3 for Domain D1 depends on data refresh of datasets DS1 and DS2 from Domain D1
  • Dataset DS4 for Domain D1 depends on datasets DS3 from Domain D1 and dataset DS1 from Domain D2 which is refreshed at time T4.

Solution Overview

The orchestration solution involves two main components.

1. Cross account event sharing

The following diagram illustrates the architecture for distributing data refresh events across domains within the orchestration solution using Amazon EventBridge. Data producers emit refresh events to a centralized event bus upon completing their updates. These events are then propagated to all subscribing domains. Each domain evaluates incoming events against its pipeline dependency configurations, enabling precise and prompt triggering of downstream data pipelines.

Cross Account Event Publish Using Eventbridge

The following snippet shows the data refresh event:


Sample EventBridge cross account event forward rule.

The following screenshots depicts a sample data refresh event that will be broadcasted to consumer data domains.


2. Data Pipeline orchestration

The following diagram describes the technical architecture of the orchestration solution using several AWS services such as Amazon Eventbridge, Amazon SQS, AWS Lambda, AWS Glue, Amazon SNS and Amazon Aurora.

The orchestration solution revolves around five core processors.

Data product pipeline scheduler

The scheduler is a daily scheduled Glue job that finds data products that are due for data refresh based on orchestration metadata and, for each identified data product, the scheduler retrieves both internal and external dependencies and stores them in the orchestration state management system database tables with a status of WAITING.

Data refresh events processor

Data refresh events are emitted from a central event bus and routed to domain-specific event buses. These domain buses send the events to a message queue for asynchronous processing. Any undeliverable events are redirected to a dead-letter queue for further inspection and recovery.

The event processor Lambda function consumes messages from the queue and evaluates whether the incoming event corresponds to any defined dependencies within the domain. If a match is found, the dependency status is updated from WAITING to ARRIVED. The processor also checks whether all dependencies for a given data product have been satisfied. If so, it starts the corresponding pipeline execution workflow by triggering an AWS Step Functions state machine.

Data product pipeline processor

Retrieves orchestration metadata to find the pipeline configuration and associated Glue job and parameters for the target data product. Triggers the Glue job using the retrieved configuration and parameters. This step ensures that the pipeline is launched with the correct context and input values. It also captures the Glue job run Id and updates the data product status to PROCESSING within the orchestration state management database, enabling downstream monitoring and status tracking.

Data product pipeline status processor

Each domain’s EventBridge is configured to listen for AWS Glue job state change events, which are routed to a message queue for asynchronous processing. A processing function evaluates incoming job state events:

  • For successful job completions, the corresponding pipeline status is updated from PROCESSING to COMPLETED in the orchestration state database. If the pipeline is configured to publish downstream events, a data refresh event is emitted to the central event bus.
  • For failed jobs, the pipeline status is updated from PROCESSING to ERROR, enabling downstream systems to manage exceptions or start retrying of a failed job.
  • Sample Glue Job state change events for successful completion. The glue job name from the event is used to update the status of the data product.

Data product pipeline monitor

The pipeline monitoring system operates through an EventBridge scheduled trigger that activates every 10 minutes to scan the orchestration state. During this scan, it identifies data products with satisfied dependencies but pending pipeline execution and initiates those pipelines automatically. When pipeline reruns are necessary, the system resets the orchestration state, allowing the monitor to reassess dependencies and trigger the appropriate pipelines. Any pipeline failures are promptly captured as exception notifications and directed to a dedicated notification queue for thorough analysis and team alerting.

Orchestration metadata data model

The following diagram describes the reference data model for storing the dependencies and state management of the data pipelines.

Table Name Description
data_product This table stores information on the data product and settings such publishing event for the data product.
data_product_dependencies This table stores information on the data product dependencies for both internal and external data products.
data_product_schedule This table stores information on the data product run schedule (Ex: daily / weekly)
data_pipeline_config This table stores information about the Glue job used for the data pipeline (ex: Name of the glue job, connections)
data_pipeline_parameters This table stores the Glue job parameters
data_product_status This table tracks the execution status of the data product pipeline, transitioning states from ‘Waiting’ to either ‘Complete’ or ‘Error’ based on runtime outcomes
data_product_dependencies_events_status This table stores the status of data dependencies refresh status. It is used to keep track of the dependencies and updates the status as the data refresh events arrive
data_product_status_history This table stores the historical data of data product data pipeline executions for audit and reporting
data_product_dependencies_events_status_history This table stores the historical data of data product data dependency status for audit and reporting

Outcome

With data pipeline orchestration and use of AWS serverless services, Stifel was able to speed up the data refresh process by cutting down the lag time associated with fixed scheduling of triggering data pipelines as well increase the parallelism of executing the data pipelines which was a constraint with on-premises data platform. This approach offers:

  • Scalability by supporting coordination across multiple data domains.
  • Reliability through automated tracking and resolution of pipeline dependencies.
  • Timeliness by ensuring pipelines are executed precisely when their prerequisites are met.
  • Cost optimization by leveraging AWS serverless technologies Lambda for compute, EventBridge for event routing, Aurora Serverless for database operations, and Step Functions for workflow orchestration and pay only for actual usage rather than provisioned capacity while providing automatic scaling to handle varying workloads.

Conclusion

In this post, we showed how a modular, event-driven orchestration solution can effectively manage cross-domain data pipelines. Organizations can refer to this blog post to build robust data pipeline orchestration avoiding rigid schedules and dependencies by leveraging event-based triggers.

Special thanks: This implementation success is a result of close collaboration between Stifel Financial leadership team (Kyle Broussard Managing Director, Martin Nieuwoudt Director of Data Strategy & Analytics) , AWS ProServe, and the AWS account team. We want to thank Stifel Financial Executives and the Leadership Team for the strong sponsorship and direction.

About the authors

Amit Maindola

Amit Maindola

Amit is a Senior Data Architect with AWS ProServe team focused on data engineering, analytics, and AI/ML. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Srinivas Kandi

Srinivas Kandi

Srinivas is a Senior Architect at Stifel focusing on delivering the next generation of cloud data platform on AWS. Prior to joining Stifel, Srini was a delivery specialist in cloud data analytics at AWS helping several customers in their transformational journey into AWS cloud. In his free time, Srini likes to explore cooking, travel and learn new trends and innovations in AI and cloud computing.

Hossein Johari

Hossein Johari

Hossein is a seasoned data and analytics leader with over 25 years of experience architecting enterprise-scale platforms. As Lead and Senior Architect at Stifel Financial Corp. in St. Louis, Missouri, he spearheads initiatives in Data Platforms and Strategic Solutions, driving the design and implementation of innovative frameworks that support enterprise-wide analytics, strategic decision-making, and digital transformation. Known for aligning technical vision with business objectives, he works closely with cross-functional teams to deliver scalable, forward-looking solutions that advance organizational agility and performance.

Ahmad Rawashdeh

Ahmad Rawashdeh

Ahmad is a Senior Architect at Stifel Financial. He supports Stifel and its clients in designing, implementing, and building scalable and reliable data architectures on Amazon Web Services (AWS), with a strong focus on data lake strategies, database services, and efficient data ingestion and transformation pipelines.

Lei Meng

Lei Meng

Lei is a data architect at Stifel. His focus is working in designing and implementing scalable and secure data solutions on the AWS and helping Stifel’s cloud migration from on-premises systems.

How Twilio built a multi-engine query platform using Amazon Athena and open-source Presto

Post Syndicated from Amber Runnels original https://aws.amazon.com/blogs/big-data/how-twilio-built-a-multi-engine-query-platform-using-amazon-athena-and-open-source-presto/

Twilio is a customer engagement platform that powers real-time, personalized customer experiences for leading brands through APIs that democratize communications channels like voice, text, chat, and video.

At Twilio, we manage a 20 petabyte-scale Amazon Simple Storage Service (Amazon S3) data lake that serves the analytics needs of over 1,500 users, processing 2.5 million queries monthly, and scanning an average of 85 PB of data. To meet our growing demands for scalability, emerging technology support, and data mesh architecture adoption, we built Odin, a multi-engine query platform that provides an abstraction layer built on top of Presto Gateway.

In this post, we discuss how we designed and built Odin, combining Amazon Athena with open-source Presto to create a flexible, scalable data querying solution.

A growing need for a multi-engine platform

Our data platform has been built on Presto since its inception, but over the years as we expanded to support multiple business lines and diverse use cases, we began to encounter challenges related to scalability, operational overhead, and cost management. Maintaining the platform through frequent version upgrades also became difficult. These upgrades required significant time to evaluate backwards compatibility, integrate with our existing data ecosystem, and determine optimal configurations across releases.

The administrative burden of upgrades and our commitment to minimizing user disruption caused our Presto version to fall behind. This prevented us from accessing the latest features and optimizations available in later releases. The adoption of Apache Hudi for our transaction-dependent critical workloads created a new requirement which our existing Presto deployment version couldn’t support. We needed an up-to-date Presto or Trino compatible service to accommodate these use cases while still reducing the operational overhead of maintaining our own query infrastructure.

Building a comprehensive data platform required us to balance multiple competing requirements and business constraints. We needed a solution that could support diverse workload types, from interactive analytics to ETL batch processing, while providing the flexibility to optimize compute resources based on specific use cases. We also wanted to improve upon cost management and attribution in our shared multi-tenanted query platform. Additionally, we needed to ensure that adopting any new technology did not cause any disruption to our users and maintained backward compatibility with existing systems during the transition period.

Selecting Amazon Athena as our modern analytics engine

Our users relied on SQL for interactive analysis, and we wanted to preserve this experience and make use of our existing jobs and application code. This meant we needed a Presto-compatible analytics service to modernize our data platform.

Amazon Athena is a serverless interactive query service built on Presto and Trino that allows you to run queries using a familiar ANSI SQL interface. Athena appealed to us due to its compatibility with open-source Trino and its seamless upgrade experience. Athena helps to ease the burden of managing a large-scale query infrastructure, and with provisioned capacity, offers predictable and scalable pricing for our largest query workloads. Athena’s workgroups provided the query and cost management capabilities we needed to efficiently support diverse teams and workload patterns with minimal overhead.

The ability to blend on-demand and dedicated serverless capacity models allows us to optimize workload distribution for our requirements, achieving the flexibility and scalability needed in a managed query environment. To address latency-sensitive and predictive query workloads, we adopted provisioned capacity for its serverless capacity guarantee and workload concurrency control features. For queries that may be ad-hoc and more flexible in scheduling, we opted to use the cost-efficient multi-tenant on-demand model, which optimizes resource utilization through shared infrastructure. In parallel to migrating workloads to Athena, we also needed a way to support legacy workloads that use custom implementations of Presto features. This requirement drove us to abstract the underlying implementation, allowing us to present users with a unified interface. This would give us the flexibility key to future proof our infrastructure and use the most appropriate compute for the workload and use case.

The birth of Odin

The following diagram shows Twilio’s multi-engine query platform that incorporates both Amazon Athena and open-source Presto.

Comprehensive AWS analytics architecture featuring multiple BI tools, query engines, caching layer, and query history tracking

High Level Architecture of Odin’s Query Engines

Odin is a Presto-based gateway built on Zuul, an open-source L7 application gateway developed by Netflix. Zuul had already demonstrated its scalability at Twilio, having been successfully adopted by other internal teams. Since end users primarily connect to the platform via a JDBC connector using the Presto Driver (which operates through HTTP calls), Zuul’s specialization in HTTP call management made it an ideal technical choice for our needs.

Odin functions as a central hub for query processing, employing a pluggable design that accommodates various query frameworks for maximum extensibility and flexibility. To interact with the Odin platform users are initially directed to an Amazon Application Load Balancer that sits in front of the Odin instances running on Amazon EC2. The Odin instances handle the authentication, routing, and entire query workflow throughout the query’s lifetime. Amazon ElastiCache for Redis handles the query tracking for Athena and Amazon DynamoDB is responsible for the maintaining the query history. Both query engines, Amazon Athena and the Presto clusters running on Amazon EC2,are supported by the AWS Glue Data Catalog as the metastore repository and query data from our Amazon S3-based data lake.

Routing queries to multiple engines

We had a variety of use cases that were being served by this query platform and therefore we opted to use Amazon Athena as our primary query engine while continuing to route certain legacy workloads to our Presto clusters. Prior to our architectural redesign, we encountered operational challenges due to our end users being tightly bound to specific Presto clusters which led to inevitable disruptions during maintenance windows. Additionally, users frequently overloaded individual clusters with diverse workloads ranging from lightweight ad-hoc analytics to complex data warehousing queries and resource-intensive ETL processes. This prompted us to implement a more sophisticated routing solution, one that was use case focused and not tightly bound to the specific underlying compute.

To enable routing across multiple query engines within the same platform, we developed a query hint mechanism that allows users to specify their intended use case. Users append this hint to the JDBC string via the X-Presto-Extra-Credential header, which Odin’s logical routing layer then evaluates alongside multiple factors including user identity, query origin, and fallback planning. The system also assesses whether the target resource has sufficient capacity, if not, it reroutes the query to an alternative resource with available capacity. While users provide initial context through their hints, Odin makes the final routing decisions intelligently on the server side. This approach balances user input with centralized orchestration, ensuring consistent performance and resource availability.

For example, say a user might specify the following connection string when connecting to the Odin platform from a Tableau client:

jdbc:presto://odin.twilio.com:443/hive?SSL=true&extraCredentials=routing:athena

The connection string uses the extraCredentials header to signal execution on Athena, where Odin validates query submission details, including the submitting user and tool, before determining the appropriate Athena workgroup for initial routing. Since this Tableau data source and user qualify as “critical queries,” the system routes them to a workgroup backed by capacity reservations. However, if that workgroup has too many pending queries in the execution queue, Odin’s routing logic automatically redirects to alternative workgroups with greater available resources. When necessary, queries may ultimately route to workgroups running on on-demand capacity. Through this fallback logic, Odin provides built-in load balancing at the routing layer, ensuring optimal utilization across the underlying compute infrastructure.

Here is an example workflow of how our queries are routed to Athena workgroups:

Architecture diagram showing how queries from Looker and Tableau are evaluated and routed through Amazon Athena workgroups with failback mechanisms

Once a query has been submitted to a workgroup for execution, Odin will also log the routing decision in our tracking system based on Amazon ElastiCache for Redis so that Odin’s routing logic can maintain real-time awareness of queue depths across all Athena workgroups. Additionally, Odin uses Amazon EventBridge to integrate with Amazon Athena to keep track of a query state change and create event-based workflows. Our Redis-based query tracking system effectively handles edge cases, such as when a JDBC client terminates mid-query. Even during such unexpected interruptions, the platform consistently maintains and updates the accurate state of the query.

Query history

Following successful query routing to either an Athena workgroup or one of our open-source Presto clusters, Odin persists the query identifier and destination endpoint in a query history table in DynamoDB. This design utilizes a RESTful architecture where initial query submissions operate as POST requests, while subsequent status checks function as GET requests that utilize DynamoDB as the authoritative lookup mechanism to locate and poll the appropriate execution engine. By centralizing query execution records in DynamoDB rather than maintaining state on individual servers, we’ve created a truly stateless system where incoming requests can be handled by any Amazon EC2 instance hosting our Odin web service.

Lessons learned

The transition from open-source Presto to Athena required some adaptation time, due to subtle differences in how these query engines operate. Since our Odin framework was built on the Presto driver, we needed to modify our processing approach to ensure compatibility between both systems.

As we began to adopt Athena for more use cases, we noticed a difference in the record counts between Athena and the original Presto queries. We discovered this was due to open-source Presto returning results with every page containing a header column, whereas Athena results only contain the header column on the first page and subsequent pages containing records only. This difference meant that for a 60-page result set, Athena would return 59 fewer rows than open-source Presto. Once we identified this pagination behavior, we optimized Odin’s result handling logic to properly interpret and process Athena’s format, so that queries would return accurate results.

Due to the nature of using the Odin platform, most of our interactions with the Athena service are API driven so we make use of the ResultSet object with the GetQueryResults API to retrieve query execution data. Using this mechanism, the API returns the data as all VARCHAR data type, even for complex types such as row, map, or array. This created a challenge because Odin uses the Presto driver for query parsing, resulting in a type mismatch between the expected formats and actual returned data. To address this, we implemented a translation layer within the Odin framework that converts all data types to VARCHAR and handles any downstream implications of this conversion internally.

These technical adjustments, while initially challenging, highlighted the importance of carefully managing the subtle differences between different query execution engines when building a unified data platform.

Scale of Odin and looking ahead

The Odin platform serves over 1,500 users who execute approximately 80,000 queries daily, totaling 2.5 million queries per month. Odin also powers more than 5,000 Business Intelligence (BI) reports and dashboards for Tableau and Looker. The queries are executed across our multi-engine landscape of more than 30 workgroups in Athena based on both provisioned capacity and on-demand workgroups and 4 Presto clusters on running on EC2 instances with Auto Scaling enabled that run on average 180 instances each. As Twilio continues to experience rapid growth, our Odin platform has enabled us to mature our technology stacks by both upgrading existing compute resources and integrating new technologies. We can do all this without disrupting the experience for our end users. While Odin serves as our foundation, we’re excited to continue to expand this pluggable infrastructure. Our roadmap includes migrating our self-managed open-source Presto implementation to EMR Trino, introducing Apache Spark as a compute engine via Amazon EMR Serverless or AWS Glue jobs, and integrating generative AI capabilities to intelligently route queries across Odin’s various compute options.

Conclusion

In this post, we’ve shared how we built Odin, our unified multi-engine query platform. By combining AWS services like Amazon Athena, Amazon ElastiCache for Redis, and Amazon DynamoDB with our open-source technology stack, we created a transparent abstraction layer for users. This integration has resulted in a highly available and resilient platform environment that serves our query processing needs.

By embracing this multi-engine approach, not only did we solve our query infrastructure challenges but we also established a flexible foundation that will continue to evolve with our data needs, ensuring we can deliver powerful insights at scale regardless of how technology trends shift in the future.

To learn more and get started using Amazon Athena, please see the Athena User Guide.


About the authors

Aakash Pradeep

Aakash Pradeep

Aakash is a Senior Software Engineer at Adobe with over 15 years of experience across ingestion, compute, storage, and query platforms. Previously, at Twilio, he worked extensively on developing the Odin platform to serve his customers query infrastructure needs. Aakash is a PrestoCon speaker, holds multiple patents in real-time analytics, and is passionate about building high-performance distributed systems.

Venkatram Bondugula

Venkatram Bondugula

Venkatram is a seasoned backend engineer with over a decade of experience specializing in the design and development of scalable data platforms for big data and distributed systems. With a strong background in backend architecture and data engineering, he has built and optimized high-performance systems that power data-driven decision-making at scale.

Amber Runnels

Amber Runnels

Amber is a Senior Analytics Specialist Solutions Architect at AWS specializing in big data and distributed systems. She helps customers optimize workloads in the AWS data ecosystem to achieve a scalable, performant, and cost-effective architecture. Aside from technology, she is passionate about exploring the many places and cultures this world has to offer, reading novels, and building terrariums.

Best practices for upgrading from Amazon Redshift DC2 to RA3 and Amazon Redshift Serverless

Post Syndicated from Ziad Wali original https://aws.amazon.com/blogs/big-data/best-practices-for-upgrading-from-amazon-redshift-dc2-to-ra3-and-amazon-redshift-serverless/

Amazon Redshift is a fast, petabyte-scale cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, delivering the best price-performance.

With a fully managed, AI-powered, massively parallel processing (MPP) architecture, Amazon Redshift drives business decision-making quickly and cost-effectively. Previously, Amazon Redshift offered DC2 (Dense Compute) node types optimized for compute-intensive workloads. However, they lacked the flexibility to scale compute and storage independently and didn’t support many of the modern features now available. As analytical demands grow, many customers are upgrading from DC2 to RA3 or Amazon Redshift Serverless, which offer independent compute and storage scaling, along with advanced capabilities such as data sharing, zero-ETL integration, and built-in artificial intelligence and machine learning (AI/ML) support with Amazon Redshift ML.

This post provides a practical guide to plan your target architecture and migration strategy, covering upgrade options, key considerations, and best practices to facilitate a successful and seamless transition.

Upgrade process from DC2 nodes to RA3 and Redshift Serverless

The first step towards upgrade is to understand how the new architecture should be sized; for this, AWS provides a recommendation table for provisioned clusters. When determining the configuration for Redshift Serverless endpoints, you can assess compute capacity details by examining the relationship between RPUs and memory. Each RPU allocates 16 GiB of RAM. To estimate the base RPU requirement, divide your DC2 nodes cluster’s total RAM by 16. These recommendations provide guidance in sizing the initial target architecture but depend on the computing requirements of your workload. To better estimate your requirements, consider conducting a proof of concept that uses Redshift Test Drive to run potential configurations. To learn more, see Find the best Amazon Redshift configuration for your workload using Redshift Test Drive and Successfully conduct a proof of concept in Amazon Redshift. After you decide on the target configuration and architecture, you can build the strategy for upgrading.

Architecture patterns

The first step is to define the target architecture for your solution. You can choose the main architecture pattern that best aligns with your use case from the options presented in Architecture patterns to optimize Amazon Redshift performance at scale. There are two main scenarios, as illustrated in the following diagram.

At the time of writing, Redshift Serverless doesn’t have manual workload management; everything runs with automatic workload management. Consider isolating your workload into multiple endpoints based on use case to enable independent scaling and better performance. For more information, refer to Architecture patterns to optimize Amazon Redshift performance at scale.

Upgrade strategies

You can choose from two possible upgrade options when upgrading from DC2 nodes to RA3 nodes or Redshift Serverless:

  • Full re-architecture – The first step is to evaluate and assess the workloads to determine whether you could benefit from a modern data architecture, then re-architect the existing platform during the upgrade process from DC2 nodes.
  • Phased approach– This is a two-stage strategy. The first stage involves a straightforward migration to the target RA3 or Serverless configuration. In the second stage, you can modernize the target architecture by taking advantage of cutting-edge Redshift features.

We usually recommend a phased approach, which allows for a smoother transition while enabling future optimization. The first stage of a phased approach consists of the following steps:

  • Evaluate an equivalent RA3 nodes or Redshift Serverless configuration for your existing DC2 cluster, using the sizing guidelines for provisioned clusters or the compute capacity options for serverless endpoints.
  • Thoroughly validate the chosen target configuration in a non-production environment using Redshift Test Drive. This automated tool simplifies the process of simulating your production workloads on various potential target configurations, enabling a comprehensive what-if analysis. This step is strongly recommended.
  • Proceed to the upgrade process when you are satisfied with the price-performance ratio of a particular target configuration, using one of the methods detailed in the following section.

Redshift RA3 instances and Redshift Serverless provide access to powerful new capabilities, including zero-ETL, Amazon Redshift Streaming Ingestion, data sharing writes, and independent compute and storage scaling. To maximize these benefits, we recommend conducting a comprehensive review of your current architecture (the second stage of a phased approach) to identify opportunities for modernization using Amazon Redshift’s latest features. For example:

Upgrade options

You can choose from three ways to resize or upgrade a Redshift cluster from DC2 to RA3 or Redshift Serverless: snapshot restore, classic resize, and elastic resize.

Snapshot restore

The snapshot restore method follows a sequential process that begins with capturing a snapshot of your existing (source) cluster. This snapshot is then used to create a new target cluster with your desired specifications. After creation, it’s essential to verify data integrity by confirming that data has been correctly transferred to the target cluster. An important consideration is that any data written to the source cluster after the initial snapshot must be manually transferred to maintain synchronization.

This method offers the following advantages:

  • Allows for the validation of the new RA3 or Serverless setup without affecting the existing DC2 cluster
  • Provides the flexibility to restore to different AWS Regions or Availability Zones
  • Minimizes cluster downtime for write operations during the transition

Keep in mind the following considerations:

  • Setup and data restore might take longer than elastic resize.
  • You might encounter data synchronization challenges. Any new data written to the source cluster after snapshot creation requires manual copying to the target. This process might need multiple iterations to achieve full synchronization and require downtime before cutoff.
  • A new Redshift endpoint is generated, necessitating connection updates. Consider renaming both clusters in order to maintain the original endpoint (make sure the new target cluster adopts the original source cluster’s name)

Classic resize

Amazon Redshift creates a target cluster and migrates your data and metadata to it from the source cluster using a backup and restore operation. All your data, including database schemas and user configurations, is accurately transferred to the new cluster. The source cluster restarts initially and is unavailable for a few minutes, causing minimal downtime. It quickly resumes, allowing both read and write operations as the resize continues in the background.

Classic resize is a two-stage process:

  • Stage 1 (critical path) – During this stage, metadata migration occurs between the source and target configurations, temporarily placing the source cluster in read-only mode. This initial phase is typically brief. When this phase is complete, the cluster is made available for read and write queries. Although tables originally configured with KEY distribution style are temporarily stored using EVEN distribution, they will be redistributed to their original KEY distribution during Stage 2 of the process.
  • Stage 2 (background operations) – This stage focuses on restoring data to its original distribution patterns. This operation runs in the background with low priority without interfering with the primary migration process. The duration of this stage varies based on multiple factors, including the volume of data being redistributed, ongoing cluster workload, and the target configuration being used.

The overall resize duration is primarily determined by the data volume being processed. You can monitor progress on the Amazon Redshift console or by using the SYS_RESTORE_STATE system view, which displays the percentage completed for the table being converted (accessing this view requires superuser privileges).

The classic resize approach offers the following advantages:

  • All possible target node configurations are supported
  • A comprehensive reconfiguration of the source cluster rebalances the data slices to default per node, leading to even data distribution across the nodes

However, keep in mind the following:

  • Stage 2 redistributes the data for optimal performance. However, Stage 2 runs at a lower priority, and in busy clusters, it can take a long time to complete. To speed up the process, you can manually run the ALTER TABLE DISTSTYLE command on your tables having KEY DISTSTYLE. By executing this command, you can prioritize the data redistribution to happen faster, mitigating any potential performance degradation due to the ongoing Stage 2 process.
  • Due to the Stage 2 background redistribution process, queries can take longer to complete during the resize operation. Consider enabling concurrency scaling as a mitigation strategy.
  • Drop unnecessary and unused tables before initiating a resize to speed up data distribution.
  • The snapshot used for the resize operation becomes dedicated to this operation only. Therefore, it can’t be used for a table restore or other purpose.
  • The cluster must operate within a virtual private cloud (VPC).
  • This approach requires a new or a recent manual snapshot taken before initiating a classic resize.
  • We recommend scheduling the operation during off-peak hours or maintenance windows for minimal business impact.

Elastic resize

When using elastic resize to change the node type, Amazon Redshift follows a sequential process. It begins by creating a snapshot of your existing cluster, then provisions a new target cluster using the most recent data from that snapshot. While data transfers to the new cluster in the background, the system remains in read-only mode. As the resize operation approaches completion, Amazon Redshift automatically redirects the endpoint to the new cluster and stops all connections to the original one. If any issues arise during this process, the system typically performs an automatic rollback without requiring manual intervention, though such failures are rare.

Elastic resize offers several advantages:

  • It’s a quick process that takes 10–15 minutes on average
  • Users maintain read access to their data during the process, experiencing only minimal interruption
  • The cluster endpoint remains unchanged throughout and after the operation

When considering this approach, keep in mind the following:

  • Elastic resize operations can only be performed on clusters using the EC2-VPC platform. Therefore, it’s not available for Redshift Serverless.
  • The target node configuration must provide sufficient storage capacity for existing data.
  • Not all target cluster configurations support elastic resize. In such cases, consider using classic resize or snapshot restore.
  • After the process is started, elastic resize can’t be stopped.
  • Data slices remain unchanged; this can potentially cause some data or CPU skew.

Upgrade recommendations

The following flowchart visually guides the decision-making process for choosing the appropriate Amazon Redshift upgrade method.

When upgrading Amazon Redshift, the method depends on the target configuration and operational constraints. For Redshift Serverless, always use the snapshot restore method. If upgrading to an RA3 provisioned cluster, you can choose from two options: use snapshot restore if a full maintenance window with downtime is acceptable, or choose classic resize for minimal downtime, because it rebalances the data slices to default per node, leading to even data distribution across the nodes. Although you can use elastic resize for certain node type changes (for example, DC2 to RA3) within specific ranges, it’s not recommended because elastic resize doesn’t change the number of slices, potentially leading to data or CPU skew, which can later impact the performance of the Redshift cluster. However, elastic resize remains the primary recommendation when you need to add or reduce nodes in an existing cluster.

Best practices for migration

When planning your migration, consider the following best practices:

  • Conduct a pre-migration assessment using Amazon Redshift Advisor or Amazon CloudWatch.
  • Choose the right target architecture based on your use cases and workloads. You can use Redshift Test Drive to determine the right target architecture.
  • Backup using manual snapshots, and enable automated rollback.
  • Communicate timelines, downtime, and changes to stakeholders.
  • Update runbooks with new architecture details and endpoints.
  • Validate workloads using benchmarks and data checksum.
  • Use maintenance windows for final syncs and cutovers.

By following these practices, you can achieve a controlled, low-risk migration that balances performance, cost, and operational continuity.

Conclusion

Migrating from Redshift DC2 nodes to RA3 nodes or Redshift Serverless requires a structured approach to support performance, cost-efficiency, and minimal disruption. By selecting the right architecture for your workload, and validating data and workloads post-migration, organizations can seamlessly modernize their data platforms. This upgrade facilitates long-term success, helping teams fully harness RA3’s scalable storage or Redshift Serverless auto scaling capabilities while optimizing costs and performance.


About the authors

Ziad Wali

Ziad Wali

Ziad is an Analytics Specialist Solutions Architect at AWS. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature.

Omama Khurshid

Omama Khurshid

Omama is an Analytics Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Srikant Das

Srikant Das

Srikant is an Analytics Specialist Solutions Architect at Amazon Web Services, designing scalable, robust cloud solutions in Analytics & AI. Beyond his technical expertise, he shares travel adventures and data insights through engaging blogs, blending analytical rigor with storytelling on social media.

Building a real-time ICU patient analytics pipeline with AWS Lambda event source mapping

Post Syndicated from Priyanka Chaudhary original https://aws.amazon.com/blogs/big-data/building-a-real-time-icu-patient-analytics-pipeline-with-aws-lambda-event-source-mapping/

In hospital intensive care units (ICUs), continuous patient monitoring is critical. Medical devices generate vast amounts of real-time data on vital signs such as heart rate, blood pressure, and oxygen saturation. The key challenge lies in early detection of patient deterioration through vital sign trending. Healthcare teams must process thousands of data points daily per patient to identify concerning patterns, a task crucial for timely intervention and potentially life-saving care.

AWS Lambda event source mapping can help in this scenario by automatically polling data streams and triggering functions in real-time without additional infrastructure management. By using AWS Lambda for real-time processing of sensor data and storing aggregated results in secure data structures designed for large analytic datasets called Iceberg tables in Amazon Simple Storage Service (Amazon S3) buckets, medical teams can achieve both immediate alerting capabilities and gain long-term analytical insights, enhancing their ability to provide timely and effective care.

In this post, we demonstrate how to build a serverless architecture that processes real-time ICU patient monitoring data using Lambda event source mapping for immediate alert generation and data aggregation, followed by persistent storage in Amazon S3 with an Iceberg catalog for comprehensive healthcare analytics. The solution demonstrates how to handle high-frequency vital sign data, implement critical threshold monitoring, and create a scalable analytics platform that can grow with your healthcare organization’s needs and help monitor sensor alert fatigue in the ICU.

Architecture

The following architecture diagram illustrates a real-time ICU patient analytics system.

Arch diagram

In this architecture, real-time patient monitoring data from hospital ICU sensors is ingested into AWS IoT Core, which then streams the data into Amazon Kinesis Data Streams. Two Lambda functions consume this streaming data concurrently for different purposes, both using Lambda event source mapping integration with Kinesis Data Streams. The first Lambda function uses the filtering feature of event source mapping to detect critical health events where SpO2(blood oxygen saturation) levels fall below 90%, immediately triggering notifications to caregivers through Amazon Simple Notification Service (Amazon SNS) for rapid response. The second Lambda function employs the tumbling window feature of event source mapping to aggregate sensor data over 10-minute time intervals. This aggregated data is then systematically stored in S3 buckets in Apache Iceberg format for historical analysis and reporting. The entire pipeline operates in a serverless manner, providing scalable, real-time processing of critical healthcare data while maintaining both immediate alerting capabilities and long-term data storage for analytics.

Amazon S3 data, with its support for Apache Iceberg table format, enables healthcare organizations to efficiently store and query large volumes of time-series patient data. This solution allows for complex analytical queries across historical patient data while maintaining high performance and cost efficiency.

Prerequisites

To implement the solution provided in this post, you should have the following:

  • An active AWS account
  • IAM permissions to deploy CloudFormation templates and provision AWS resources
  • Python installed on your machine to run the ICU patient sensor data simulator code

Deploy a real-time ICU patient analytics pipeline using CloudFormation

You use AWS CloudFormation templates to create the resources for a real-time data analytics pipeline.

  1. To get started, Sign in to the console as Account user and select the appropriate Region.
  2. Download and launch CloudFormation template  where you want to host the Lambda functions.
  3. Choose Next.
  4. On the Specify stack details page, enter a Stack name (for example, IoTHealthMonitoring).
  5. For Parameters, enter the following:
    1. IoTTopic: Enter the MQTT topic for your IoT devices (for example, icu/sensors).
    2. EmailAddress: Enter an email address for receiving notifications.
  6. Wait for the stack creation to complete. This process might take 5-10 minutes.
  7. After the CloudFormation stack completes, it creates following resources:
    1. An AWS IoT Core rule to capture data from the specified IoTTopic topic and routes it to Kinesis data stream.
    2. A Kinesis data stream for ingesting IoT sensor data.
    3. Two Lambda functions:
      • FilterSensorData: Monitors critical health metrics and sends alerts.
      • AggregateSensorData: Aggregates sensor data in 10 minutes window.
    4. An Amazon DynamoDB table (NotificationTimestamps) to store notification timestamps for rate limiting alerts.
    5. An Amazon SNS topic and subscription to send email notifications for critical patient conditions.
    6. An Amazon Data Firehose delivery stream to deliver processed data to Amazon S3 using Iceberg format.
    7. Amazon S3 buckets to store sensor data.
    8. Amazon Athena and AWS Glue resources for the database and an Iceberg table for querying aggregated data.
    9. AWS Identity and Access Management (IAM) roles and policies to support required permissions for Amazon IoT rules, Lambda functions, and Data Firehose streams.
    10. Amazon CloudWatch log groups to record for Kinesis Firehose activity and Lambda functions.

Solution walkthrough

Now that you’ve deployed the solution, let’s review a functional walkthrough. First, simulate patient vital signs data and send it to AWS IoT Core using the following Python code on your local machine. To run this code successfully, ensure you have the necessary IAM permissions to publish messages to the IoT topic in the AWS account where the solution is deployed.

import boto3
import json
import random
import time
# AWS IoT Data client
iot_data_client = boto3.client(
    'iot-data',
    region_name='us-west-2'
)
# IOT Topic to publish
topic = 'icu/sensors'
# Fixed set of patient IDs
patient_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print("Infinite sensor data simulation...")
try:
    while True:
        for patient_id in patient_ids:
            # Generate sensor data
            message = {
                "patient_id": patient_id,
                "timestamp": int(time.time()),
                "spo2": random.randint(91, 99),
                "heart_rate": random.randint(60, 100),
                "temperature_f": round(random.uniform(97.0, 100.0), 1)
            }
            # Publish to topic
            response = iot_data_client.publish(
                topic=topic,
                qos=1,
                payload=json.dumps(message)
            )
            print(f"Published: {message}")
        # Wait 30 seconds before next round
        print("Sleeping for 30 seconds...\n")
        time.sleep(30)
except KeyboardInterrupt:
    print("\nSimulation stopped by user.")

The following is the format of a sample ICU sensor message produced by the simulator.

{
    "patient_id": 1,
    "timestamp": 1683000000,
    "spo2": 85,
    "heart_rate": 75,
    "temperature_f": 98.6
}

Data is published to the icu/sensors IoT topic every 30 seconds for 10 different patients, creating a continuous stream of ICU patient monitoring data. Messages published to AWS IoT Core are passed to Kinesis Data Streams using the following message routing rule deployed by our solution.

Two Lambda functions consume data from Data Streams concurrently, both using the Lambda event source mapping integration with Kinesis Data Streams.

Event source mapping

Lambda event source mapping automatically triggers Lambda functions in response to data changes from supported event sources like Amazon DynamoDB Streams, Amazon Kinesis Data Streams, Amazon Simple Queue Service (Amazon SQS), Amazon MQ, and Amazon Managed Streaming for Apache Kafka. This serverless integration works by having Lambda poll these sources for new records, which are then processed in configurable batch sizes ranging from 1 to 10,000 records. When new data is detected, Lambda automatically invokes the function synchronously, handling the scaling automatically based on the workload. The service supports at-least-once delivery and provides robust error handling through retry policies and dead-letter queues for failed events. Event source mappings can be fine-tuned through various parameters such as batch windows, maximum record age, and retry attempts, making them highly adaptable to different use cases. This feature is particularly valuable in event-driven architectures, so that customers can focus on business logic while AWS manages the complexities of event processing, scaling, and reliability.

Event source mapping uses tumbling windows and filtering to process and analyze data.

Tumbling windows

Tumbling windows in Lambda event processing enable data aggregation in fixed, non-overlapping time intervals, where each event belongs to exactly one window. This is ideal for time-based analytics and periodic reporting. When combined with event source mapping, this approach allows efficient batch processing of events within defined time periods (for example, 10-minute windows), enabling calculations such as average vital signs or cumulative fluid intake and output while optimizing function invocations and resource usage.

When you configure an event source mapping between Kinesis Data Streams and a Lambda function, use the Tumbling Window Duration setting, which appears in the trigger configuration in the Lambda console. The solution you deployed using the CloudFormation template includes the AggregateSensorData Lambda function, which uses a 10-minute tumbling window configuration. Depending on the volume of messages flowing through the Amazon Kinesis stream, the AggregateSensorData function can be invoked multiple times for each 10-minute window, sequentially, with the following attributes in the event supplied to the function.

  • Window start and end: The beginning and ending timestamps for the current tumbling window.
  • State: An object containing the state returned from the previous window, which is initially empty. The state object can contain up to 1 MB of data.
  • isFinalInvokeForWindow: Indicates if this is the last invocation for the tumbling window. This only occurs once per window period.
  • isWindowTerminatedEarly: A window ends early only if the state exceeds the maximum allowed size of 1 MB.

In a tumbling window, there is a series of Lambda invocations in the following pattern:

AggregateSensorData Lambda code snippet:

def handler(event, context):
    
    state_across_window = event['state']
    # Iterate through each record and decode the base64 data
    for record in event['Records']:
        encoded_data = record['kinesis']['data']
        partition_key = record['kinesis']['partitionKey']
        decoded_bytes = base64.b64decode(encoded_data)
        decoded_str = decoded_bytes.decode('utf-8')
        decoded_json = json.loads(decoded_str)
        # create partition_key attribute if it do not exists in state
        if partition_key not in state_across_window:
            state_across_window[partition_key] = {"min_spo2": decoded_json['spo2'], "max_spo2": decoded_json['spo2'], "avg_spo2": decoded_json['spo2'], "sum_spo2": decoded_json['spo2'], "min_heart_rate": decoded_json['heart_rate'], "max_heart_rate": decoded_json['heart_rate'], "avg_heart_rate": decoded_json['heart_rate'], "sum_heart_rate": decoded_json['heart_rate'], "min_temperature_f": decoded_json['temperature_f'], "max_temperature_f": decoded_json['temperature_f'], "avg_temperature_f": decoded_json['temperature_f'], "sum_temperature_f": decoded_json['temperature_f'], "record_count": 1}
        else:
            min_spo2 = state_across_window[partition_key]['min_spo2'] if state_across_window[partition_key]['min_spo2'] < decoded_json['spo2'] else decoded_json['spo2']
            max_spo2 = state_across_window[partition_key]['max_spo2'] if state_across_window[partition_key]['max_spo2'] > decoded_json['spo2'] else decoded_json['spo2']
            sum_spo2 = state_across_window[partition_key]['sum_spo2'] + decoded_json['spo2']
            min_heart_rate = state_across_window[partition_key]['min_heart_rate'] if state_across_window[partition_key]['min_heart_rate'] < decoded_json['heart_rate'] else decoded_json['heart_rate']
            max_heart_rate = state_across_window[partition_key]['max_heart_rate'] if state_across_window[partition_key]['max_heart_rate'] > decoded_json['heart_rate'] else decoded_json['heart_rate']
            sum_heart_rate = state_across_window[partition_key]['sum_heart_rate'] + decoded_json['heart_rate']
            
            min_temperature_f = state_across_window[partition_key]['min_temperature_f'] if state_across_window[partition_key]['min_temperature_f'] < decoded_json['temperature_f'] else decoded_json['temperature_f']
            max_temperature_f = state_across_window[partition_key]['max_temperature_f'] if state_across_window[partition_key]['max_temperature_f'] > decoded_json['temperature_f'] else decoded_json['temperature_f']
            sum_temperature_f = state_across_window[partition_key]['sum_temperature_f'] + decoded_json['temperature_f']
            
            record_count = state_across_window[partition_key]['record_count'] + 1
            avg_spo2 = sum_spo2/record_count
            avg_heart_rate = sum_heart_rate/record_count
            avg_temperature_f = sum_temperature_f/record_count
            
            state_across_window[partition_key] = {"min_spo2": min_spo2, "max_spo2": max_spo2, "avg_spo2": avg_spo2, "sum_spo2": sum_spo2, "min_heart_rate": min_heart_rate, "max_heart_rate": max_heart_rate, "avg_heart_rate": avg_heart_rate, "sum_heart_rate": sum_heart_rate, "min_temperature_f": min_temperature_f, "max_temperature_f": max_temperature_f, "avg_temperature_f": avg_temperature_f, "sum_temperature_f": sum_temperature_f, "record_count": record_count}
        
    # Determine if the window is final (window end)
    is_final_window = event.get('isFinalInvokeForWindow', False)
    # Determine if the window is terminated (window ended early)
    is_terminated_window = event.get('isWindowTerminatedEarly', False)
    window_start = event['window']['start']
    window_end = event['window']['end']
    if is_final_window or is_terminated_window:
        firehose_client = boto3.client('firehose')
        firehose_stream = os.environ['FIREHOSE_STREAM_NAME']
        for key, value in state_across_window.items():
            value['patient_id'] = key
            value['window_start'] = window_start
            value['window_end'] = window_end
            
            firehose_client.put_record(
                DeliveryStreamName= firehose_stream,
                Record={'Data': json.dumps(value) }
            )
        
        return {
            "state": {},
            "batchItemFailures": []
        }
    else:
        print(f"interim call for window: ws: {window_start} we: {window_end}")
        return {
            "state": state_across_window,
            "batchItemFailures": []
        }
  • The first invocation contains an empty state object in the event. The function returns a state object containing custom attributes that are specific to the custom logic in the aggregation.
  • The second invocation contains the state object provided by the first Lambda invocation. This function returns an updated state object with new aggregated values. Subsequent invocations follow this same sequence. Following is a sample of the aggregated state, which can be supplied to subsequent Lambda invocations within the same 10-minute tumbling window.
{
    "min_spo2": 88,
    "max_spo2": 90,
    "avg_spo2": 89.2,
    "sum_spo2": 625,
    "min_heart_rate": 21,
    "max_heart_rate": 22,
    "avg_heart_rate": 21.1,
    "sum_heart_rate": 148,
    "min_temperature_f": 90,
    "max_temperature_f": 91,
    "avg_temperature_f": 90.1,
    "sum_temperature_f": 631,
    "record_count": 7,
    "patient_id": "44",
    "window_start": "2025-05-29T20:51:00Z",
    "window_end": "2025-05-29T20:52:00Z"
}
  • The final invocation in the tumbling window has the isFinalInvokeForWindow flag set to the true. This contains the state returned by the most recent Lambda invocation. This invocation is responsible for passing aggregated state messages to the Data Firehose stream, which delivers data to the Amazon S3 bucket using Iceberg data format.
  • After the aggregated data is sent to Amazon S3, you can query the data using Athena.
Query: SELECT * FROM "cfdb_<<Database>>"."table_<<Table>>"

Sample result of the preceding Athena query:

Event source mapping with filtering

Lambda event source mapping with filtering optimizes data processing from sources like Amazon Kinesis by applying JSON pattern filtering before function invocation. This is demonstrated in the ICU patient monitoring solution, where the system filters for SpO2 readings from Kinesis Data Streams that are below 90%. Instead of processing all incoming data, the filtering capability is used to selectively processes only critical readings, significantly reducing costs and processing overhead. The solution uses DynamoDB for sophisticated state management, tracking low SpO2 events through a schema combining PatientID and timestamp-based keys within defined monitoring windows.

This state-aware implementation balances clinical urgency with operational efficiency by sending immediate Amazon SNS notifications when critical conditions are first detected while implementing a 15-minute alert suppression window to prevent alert fatigue among healthcare providers. By maintaining state across multiple Lambda invocations, the system helps ensure rapid response to potentially life-threatening situations while minimizing unnecessary notifications for the same patient condition. The integration of Lambda’event filtering, DynamoDB state management, and reliable alert delivery provided by Amazon SNS creates a robust, scalable healthcare monitoring solution that exemplifies how AWS services can be strategically combined to address complex requirements while balancing technical efficiency with clinical effectiveness.

Filter sensor data Lambda code snippet:

sns_client = boto3.client('sns')
dynamodb = boto3.resource('dynamodb')
table_name = os.environ['DYNAMODB_TABLE']
sns_topic_arn = os.environ['SNS_TOPIC_ARN']
table = dynamodb.Table(table_name)
FIFTEEN_MINUTES = 15 * 60  # 15 minutes in seconds
def handler(event, context):
    for record in event['Records']:
        print(f"Aggregated event: {record}")
        encoded_data = record['kinesis']['data']
        partition_key = record['kinesis']['partitionKey']
        decoded_bytes = base64.b64decode(encoded_data)
        decoded_str = decoded_bytes.decode('utf-8')
        # Check last notification timestamp from DynamoDB
        try:
            response = table.get_item(Key={'partition_key': partition_key})
            item = response.get('Item')
            now = int(time.time())
            if item:
                last_sent = item.get('timestamp', 0)
                if now - last_sent < FIFTEEN_MINUTES:
                    print(f"Notification for {partition_key} skipped (sent recently)")
                    continue
            # Send SNS Notification
            sns_response = sns_client.publish(
                TopicArn=sns_topic_arn,
                Message=f"Patient SpO2 below 90 percentage event information: {decoded_str}",
                Subject=f"Low SpO2 detected for patient ID {partition_key}"
            )
            print("Message sent to SNS! MessageId:", sns_response['MessageId'])
            # Update DynamoDB with current timestamp and TTL
            table.put_item(Item={
                'partition_key': partition_key,
                'timestamp': now,
                'ttl': now + FIFTEEN_MINUTES + 60  # Add extra buffer to TTL
            })
        except Exception as e:
            print("Error processing event:", e)
            return {
                'statusCode': 500,
                'body': json.dumps('Error processing event')
            }
    return {
        'statusCode': 200,
        'body': {}
    }

To generate an alert notification through the deployed solution, update the preceding simulator code by setting the SpO2 value to less than 90 and run it again. Within 1 minute, you should receive an alert notification at the email address you provided during stack creation. The following image is an example of an alert notification generated by the deployed solution.

Clean up

To avoid ongoing costs after completing this tutorial, delete the CloudFormation stack that you deployed earlier in this post. This will remove most of the AWS resources created for this solution. You might need to manually delete objects created in Amazon S3, because CloudFormation won’t remove non-empty buckets during stack deletion.

Conclusion

As demonstrated in this post, you can build a serverless real-time analytics pipeline for healthcare monitoring by using AWS IoT Core, Amazon S3 buckets with iceberg format, and Amazon Kinesis Data Streams integration with AWS Lambda event source mapping. This architectural approach eliminates the need for complex code while enabling rapid critical patient care alerts and data aggregation for analysis using Lambda. The solution is particularly valuable for healthcare organizations looking to modernize their patient monitoring systems with real-time capabilities. The architecture can be extended to handle various medical devices and sensor data streams, making it adaptable for different healthcare monitoring scenarios. This post presents one implementation approach, and organizations adopting this solution should ensure the architecture and code meets their specific application performance, security, privacy, and regulatory compliance needs.

If this post helps you or inspires you to solve a problem, we would love to hear about it!


About the authors

Nihar Sheth

Nihar Sheth

Nihar is a Senior Product Manager on the AWS Lambda team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Pratik Patel

Pratik Patel

Pratik is Sr Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively helps in keeping customers’ AWS environments operationally healthy.

Priyanka Chaudhary

Priyanka Chaudhary

Priyanka is Senior Solutions Architect at AWS. She is specialized in data lake and analytics services and helps many customers in this area. As a Solutions Architect, she plays a crucial role in guiding strategic customers through their cloud journey by designing scalable and secure cloud solutions. Outside of work, she loves spending time with friends and family, watching movies, and traveling.

Bridging data silos: cross-bounded context querying with Vanguard’s Operational Read-only Data Store (ORDS) using Amazon Redshift

Post Syndicated from Naresh Rajaram original https://aws.amazon.com/blogs/big-data/bridging-data-silos-cross-bounded-context-querying-with-vanguards-operational-read-only-data-store-ords-using-amazon-redshift/

Are you modernizing your legacy batch processing systems? At Vanguard, we faced significant challenges with our legacy mainframe system that limited our ability to deliver modern, personalized customer experiences. Our centralized database architecture created performance bottlenecks and made it difficult to scale services independently for our millions of personal and institutional investors.

In this post, we show you how we modernized our data architecture using Amazon Redshift as our Operational Read-only Data Store (ORDS). You’ll learn how we transitioned to a cloud-native, domain-driven architecture while preserving critical batch processing capabilities. We show you how this solution enabled us to create logically isolated data domains while maintaining cross-domain analytics capabilities—all while adhering to the principles of bounded contexts and distributed data ownership.

Background and challenges

As financial needs continue to evolve, Vanguard is committed to delivering adaptable, top-notch experiences that foster long-lasting customer relationships. This commitment spans from enhancing the personal investor journey to bringing personalized mobile dashboards and connecting institutional clients with advanced advice offerings.

To elevate customer experience and drive digital transformation, Vanguard has embraced domain-driven design principles. This approach focuses on creating autonomous teams, fostering faster innovation, and building data mesh architecture. Central to this transformation is the Personal Investor team’s mainframe modernization effort, transitioning from a legacy system to a cloud-based, distributed data architecture organized around bounded contexts – distinct business domains that manage their own data. As part of this shift, each microservice now manages its own local data store using Amazon Aurora PostgreSQL-Compatible Edition or Amazon DynamoDB. This approach enables domain-level data ownership and operational autonomy.

Vanguard’s existing mainframe system, built on a centralized Db2 database, enables cross-domain data access and integration but also introduces several architectural challenges. Though batch processes can join data across multiple bounded contexts using SQL joins and database operations to integrate information from various sources, this tight coupling creates significant risks and operational issues.

Challenges with the centralized database approach include:

  • Resource Contention: Processes from one domain can negatively impact other domains due to shared compute resources, leading to performance degradation across the system.
  • Lack of Domain Isolation: Changes in one bounded context can have unintended ripple effects across other domains, increasing the risk of system-wide failures.
  • Scalability Constraints: The centralized architecture creates bottlenecks as load increases, making it difficult to scale individual components independently.
  • High Coupling: Tight integration between domains makes it challenging to modify or upgrade individual components without affecting the entire system.
  • Limited Fault Tolerance: Issues in one domain can cascade across the entire system due to shared infrastructure and data dependencies.

To address these architectural challenges, we chose to use Amazon Redshift as our Operational Read-only Data Store (ORDS). The Amazon Redshift architecture has compute and storage separation, which enables us to create multi-cluster architectures with a separate endpoint for each domain with independent scaling of compute and storage resources. Our solution leverages the data sharing capabilities of Amazon Redshift to create logically isolated data domains while maintaining the ability to perform cross-domain analytics when needed.

Key benefits of the Amazon Redshift solution include:

  1. Resource Isolation: Each domain can be assigned dedicated Amazon Redshift compute resources, making sure one domain’s workload doesn’t impact others.
  2. Independent Scaling: Domains can scale their compute resources independently based on their specific needs.
  3. Controlled Data Sharing: Amazon Redshift’s data sharing feature enables secure and controlled cross-domain data access without tight coupling, maintaining clear domain boundaries.

Let’s explore the different solutions we evaluated before selecting ORDS with Amazon Redshift as our optimal approach.

Solutions explored

We implemented ORDS as our optimal solution after conducting a comprehensive evaluation of available options. This section outlines our decision-making process and examines the alternatives we considered during our assessment.

Operational Read-only Data Store (ORDS):

In our evaluation, we found that using Amazon Redshift for ORDS provides a powerful solution for handling data across different business areas. It excels at managing large volumes of data from multiple sources, providing fast access to replicated data for batch processes that require cross-bounded context data, and combining information using familiar SQL queries. The solution particularly shines in handling high-volume reads from our data sources.

Advantages:

  • Works well in a relational database
  • Excels at real-time access to data from multiple business areas
  • Improves performance of batch jobs dealing with large data volumes
  • Stores data in familiar table format, accessible via SQL
  • Enforces clear data ownership, with each business area responsible for its data
  • Offers scalable architecture that reduces the risk of single point of failure

Disadvantages:

  • Requires additional data validation during loading processes to maintain data uniqueness
  • Needs careful management of primary key constraints since Amazon Redshift optimizes for analytical performance
  • May require additional monitoring and controls compared to traditional RDBMS systems

Here are the other solutions we evaluated:

Bulk APIs:

We found that Bulk APIs provides an approach for handling large volumes of data.

Advantages:

  • Near real time access to bulk data through a single request
  • Autonomous teams have control over access patterns
  • Efficient batch processing of large datasets with multi-record retrieval

Disadvantages:

  • Each product team needs to create their own bulk API
  • If you need data from different areas, you must combine it yourself
  • The team providing the API must make sure it can handle large amounts of requests
  • You might need to use multiple APIs to get all the data you want
  • If you’re getting data in chunks (pagination), you might miss some information if it changes between requests

While Bulk APIs offer powerful capabilities, we found they require substantial team coordination and careful implementation to be effective.

Data Lake:

Our evaluation showed that data lakes can effectively combine information from different parts of our business. They excel at processing large amounts of data at once, providing search capabilities through unified data formats, and managing large volumes of diverse and complex data.

Advantages:

  • Handles massive data volumes efficiently
  • Supports multiple data formats and structures
  • Enables complex analytics and data science workloads
  • Provides cost-effective storage solutions
  • Accommodates both structured and unstructured data

Disadvantages:

  • May not provide real-time, high-speed data access
  • Requires additional effort with complex data structures, especially those with many interconnected parts
  • Needs specific strategies to organize data in a simple, flat structure
  • Demands significant data governance and management
  • Requires specialized skills for effective implementation

While data lakes excel at big-picture analysis of large datasets, they weren’t optimal for our real-time data needs and complex data relationships.

S3 Export/Exchange: 

In our analysis, we found that S3 Export/Exchange provides a method for sharing data between different business areas using file storage. This approach effectively handles large volumes of data and allows straightforward filtering of information using data frames.

Advantages:

  • Provides simple, cost-effective data storage
  • Supports high-volume data transfers
  • Enables straightforward data filtering capabilities
  • Offers flexible access control
  • Facilitates cross-region data sharing

Disadvantages:

  • Not suitable for real-time data needs
  • Requires extra processing to convert data into usable table format
  • Demands significant data preparation effort
  • Lacks immediate data consistency
  • Needs additional tools for data transformation

While S3 Export/Exchange works well for sharing large datasets between teams, it didn’t meet our requirements for quick, real-time access or immediately usable data formats.

The following table provides a high-level comparison of the different data integration solutions we considered for our modernization efforts. It outlines where each solution is most appropriate to use and when it might not be the best choice:

Solution Bulk APIs Data Lake ORDS S3 Export/Exchange
When to use Real-time operational data is needed

Fetching specific data subsets

Processing large amounts of data at once

Many bounded context

Near real-time access across multiple bounded contexts

Large volume batch processing

Few bounded contextsHandling large volumes of data

Point-in-time export is sufficient

When not to use Many bounded contexts involved Real-time data access needed

Structured, transactional data processing

Within a single bounded context Real-time data needs

Many bounded contexts

Table 1: Data Integration Solutions Comparison

Based on our comparison, we found ORDS to be the optimal solution for our needs, particularly when our batch processes require access to data from multiple bounded contexts in real-time. Our implementation efficiently handles large volumes of data, significantly improving the performance of our batch jobs. We chose ORDS because it stores data in a familiar table format, accessible via SQL, making it simple and efficient for our teams to use.

The architecture also aligns with our domain-driven design principles by enforcing clear data ownership, where each bounded context maintains responsibility for its own data management. This approach provides us with both scalability and reliability, reducing the risk of a single point of failure.

Amazon Redshift: Powering Vanguard’s ORDS Solution

Amazon Redshift serves as the backbone of our ORDS implementation, offering several crucial features that support our modernization goals:

Data Sharing

Our solution leveraged the robust data sharing capabilities of Amazon Redshift, available on both Server-based Redshift RA3 instances and Redshift Serverless options. This functionality provided us with instant, secure, and live data access without copies, maintaining transactional consistency across our environment. The flexibility of same account, cross-account, and cross-Region data sharing has been particularly valuable for our distributed architecture.

High Performance

We’ve achieved significant performance improvements through Amazon Redshift’s efficient query processing and data retrieval capabilities. The system effectively handles our complex data needs while maintaining robust performance across various workloads and data volumes.

Multi-Availability Zone Support

Our implementation benefited from Amazon Redshift’s Multi-AZ support, which maintains high availability and reliability for our critical operations. This feature minimizes downtime without requiring extensive setup and significantly reduces our risk of data loss.

Familiar Interface

The relational environment of Amazon Redshift, similar traditional databases like Amazon RDS and IBM Db2, has enabled a smooth transition for our teams. This familiarity has accelerated adoption and improved productivity, as our teams can leverage their existing SQL expertise. By centralizing data from multiple business areas in ORDS using Amazon Redshift, we maintain consistent, efficient, and secure data access across our product teams. This setup is particularly valuable for our batch processing that requires data from various parts of the business, offering us a blend of performance, reliability, and ease of use.

Operational Read-only Data Store (ORDS) using Amazon Redshift

Here’s how our ORDS architecture implements Amazon Redshift data sharing to solve these challenges:

ORDS Architecture Diagram

Figure 1: Vanguard’s ORDS Architecture using Amazon Redshift Data Sharing

Amazon Redshift Ingestion Pattern:

We utilized Amazon Redshift’s zero-ETL functionality to integrate data and enable real-time analytics directly on operational data, which helped reduce complexity and maintenance overhead. To complement this capability and to fulfill our comprehensive compliance requirements that necessitate complete transaction replication, we implemented additional data ingestion pipelines.

Our data ingestion strategy for Amazon Redshift employs different AWS services depending on the source. For Amazon Aurora PostgreSQL databases, we use AWS Database Migration Service (AWS DMS) to directly replicate data into Amazon Redshift. For data from Amazon DynamoDB, we leverage Amazon Kinesis to stream the data into Amazon Redshift, where it lands in materialized views. These views are then further processed to generate tables for end-users.

This approach allows us to efficiently ingest data from our operational data stores while meeting both analytical needs and compliance requirements.

Amazon Redshift Data Sharing:

We used the Amazon Redshift’s data sharing feature to effectively decouple our data producers from consumers, allowing each group to operate within their own boundaries while maintaining a unified and simplified governed mechanism for data sharing.

Our implementation followed a clear process: once data is ingested and available in Amazon Redshift table format, we created views for consumers to access the data. We then established data shares and granted access to these views to consumer Amazon Redshift data warehouses for batch processing. In our environment with multiple bounded contexts, we’ve established a collaborative model where consumers work with various producer teams to access data from different data shares, each created per bounded context.

This access remained strictly read-only—when consumers need to update or write new data that falls outside their bounded context, they must use APIs or other designated mechanisms for such operations. This approach has proven effective for our organization, promoting clear data ownership and governance while enabling flexible data access across organizational boundaries. It simplified our data management and made sure each team can operate independently while still sharing data effectively.

Example: VG couple of cross bounded context

Disclaimer: This is provided for reference purposes only and does not represent a real example.

Let’s look at a practical example: our brokerage account statement generation process. This cross-bounded context batch process requires integrating data from multiple sources, accessing hundreds of tables and processing large volumes of data monthly. The challenge was to create an efficient, cost-effective solution that minimizes data replication while maintaining data accessibility.ORDS proved ideal for this use case, as it provides data from multiple bounded contexts without replication, offers near real-time access, and enables straightforward data aggregation using SQL-like queries in Amazon Redshift.

The following diagram shows how we implemented this solution:

ORDS example

Figure 2: Cross-Bounded Context Example for Brokerage Account Statement Generation

We need the following bounded contexts to generate brokerage statements for millions of our clients.

  1. Account:
    • Details: Includes information about the client’s brokerage accounts, such as account numbers, types, and statuses.
    • Holdings and Positions: Provides current holdings and positions within the account, detailing the securities owned, their quantities, and current market values.
    • Balance Information: Contains the balance information of the account, including cash balances, margin balances, and total account value.
  2. Client Profile:
    • Personal Information: Information about the client, such as their name, date of birth, and social security number.
    • Contact Information: Includes the client’s email address, physical address, and phone numbers.
  3. Transaction History:
    • Transaction Records: A comprehensive record of transactions associated with the account, including buys, sales, transfers, and dividends.
    • Transaction Details: Each transaction record includes details such as transaction date, type, quantity, price, and associated fees.
    • Historical Data: Historical data of transactions over time, providing a complete view of the account’s activity.

Through this architecture, we efficiently generate accurate and comprehensive brokerage account statements by consolidating data from these bounded contexts, meeting both our clients’ needs and regulatory requirements.

Business Outcome

Our journey with the Operational Read-only Data Store (ORDS) and Amazon Redshift has enhanced our client experience (CX) through improved data management and accessibility. By transitioning from our mainframe system to a cloud-based, domain-driven architecture, we have empowered our autonomous teams and established a resilient batch architecture.

This shift facilitates efficient cross-domain data access, maintains high-quality data consistency, and provides scalability. Our ORDS implementation, supported by Amazon Redshift, offers near-real-time access to large data volumes, guaranteeing high performance, reliability, and cost-effectiveness. This modernization effort aligns with our mission to deliver exceptional, personalized client experiences and sustain long-lasting client relationships.

Call to Action

If you are facing similar challenges with your batch processing systems, we encourage you to explore how an Operational Read-only Data Store (ORDS) can transform your data architecture. Start by assessing your current system’s limitations and identifying opportunities for improvement through domain-driven design and cloud-based solutions. Consider how this approach can help you manage large volumes of data from multiple sources, provide fast access to replicated data for batch processes, and support high-volume reads from various data sources.

Take the next step by conducting a proof of concept (POC) to evaluate ORDS effectiveness in achieving efficient cross-domain data access, improving the performance of batch jobs, and maintaining clear data ownership within your business domains. By implementing this solution, you can enhance your data management capabilities, reduce operational risks, and drive innovation within your organization. Embrace this opportunity to elevate your data architecture and deliver exceptional customer experiences.

Conclusion 

Our transition to a cloud-native, domain-driven architecture with ORDS using Amazon Redshift has successfully transformed our batch processing capabilities in AWS cloud. This modernization effort has significantly enhanced the performance, reliability, and scalability of our batch operations while maintaining seamless data access and integration across different business domains.

The strategic adoption of ORDS has harnessed the potential of cross-domain data access in a distributed environment, providing us with a robust solution for real-time data access and efficient batch processing. This transformation has empowered us to better meet the demands of the digital age, delivering superior customer experiences and reinforcing our commitment to innovation in the financial services industry.


About the authors

Malav Shah

Malav Shah

Malav is a Domain Architect in Vanguard’s Personal Investor Technology division, with over a decade of experience in cloud-native solutions. He focuses on architecting and designing scalable systems, and contributes hands-on through development and proof-of-concept work. Malav holds multiple AWS certifications, including AWS Certified Solutions Architect and AWS Certified AI Practitioner.

Timothy Dickens

Timothy Dickens

Timothy is a Senior Architect at Vanguard, specializing in advanced data streaming designs, AI, real-time data access, and analytics. With expertise in AWS services like Redshift, DynamoDB, and Aurora Postgres, Timothy excels in creating robust distributed architectures that drive innovation and efficiency. Passionate about leveraging cutting-edge technologies, Timothy is dedicated to delivering trustworthy, actionable data that empowers confident, timely decision-making.

Priyadharshini Selvaraj

Priyadharshini Selvaraj

Priyadharshini is a data architect with AWS Professional Services, bringing over a decade of expertise in helping customers navigate their data journeys. She specializes in data migration and modernization projects, focusing on data lakes, data warehouses, and distributed processing using Apache Spark. As an expert in Generative AI and agentic architectures, Priyadharshini enables customers to harness cutting-edge AI technologies for business transformation. Beyond her technical pursuits, she practices yoga, plays piano and enjoys hobby baking, bringing balance to her professional life.

Naresh Rajaram

Naresh Rajaram

Naresh is a seasoned Solutions Architect with over two decades of experience, with primary focus in cloud computing and artificial intelligence. Specializing in enterprise-scale AI implementations and cloud architecture, he is helping customers develop and deploy advanced AI solutions, with particular focus on autonomous AI systems and agent-based architectures. His expertise spans designing cutting-edge AI infrastructures using Amazon Bedrock, Amazon Bedrock AgentCore, and cloud-native AI services, while pioneering work in Agentic AI applications and autonomous systems.

© 2025 The Vanguard Group, Inc. All rights reserved.

How Laravel Nightwatch handles billions of observability events in real time with Amazon MSK and ClickHouse Cloud

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/how-laravel-nightwatch-handles-billions-of-observability-events-in-real-time-with-amazon-msk-and-clickhouse-cloud/

Laravel, one of the world’s most popular web frameworks, launched its first-party observability platform, Laravel Nightwatch, to provide developers with real-time insights into application performance. Built entirely on AWS managed services and ClickHouse Cloud, the service already processes over one billion events per day while maintaining sub-second query latency, giving developers instant visibility into the health of their applications.

By combining Amazon Managed Streaming for Apache Kafka (Amazon MSK) with ClickHouse Cloud and AWS Lambda, Laravel Nightwatch delivers high-volume, low-latency monitoring at scale, while maintaining the simplicity and developer experience Laravel is known for.

The challenge: Delivering real-time monitoring for a global developer community

The Laravel framework powers millions of applications worldwide, serving billions of requests each month. Each request can generate potentially hundreds of observability events, such as database queries, queued jobs, cache lookups, emails, notifications, and exceptions. For Nightwatch’s launch, Laravel anticipated instant adoption from its global community, with tens of thousands of applications sending events around the clock from day one.

Laravel Nightwatch needed an architecture that could:

  • Ingest millions of JSON events per second from customer applications reliably.
  • Provide sub-second analytical queries for real-time dashboards.
  • Scale horizontally to handle unpredictable traffic spikes.
  • Deliver all of this in a cost-effective, low-maintenance manner.

The challenge was to process data on a global scale and provide deep insights into application health without compromising on a straightforward setup experience for developers.

The solution: A decoupled streaming and analytics pipeline

Laravel Nightwatch implemented a dual-database, streaming-first architecture, shown in the preceding figure, that separates transactional and analytical workloads.

  • Transactional workloads – user accounts, organization settings, billing, and similar workloads run on Amazon RDS for PostgreSQL.
  • Analytical workloads – telemetry events, metrics, query logs, and request traces are handled by ClickHouse Cloud.

Key components

The key components of the solution include the following:

  1. Ingestion layer
    • Amazon API Gateway receives telemetry from Laravel agents embedded in customer applications
    • Lambda validates and enriches events. Validated and enriched events are published to Amazon MSK, partitioned for scalability
  2. Streaming to analytics
    • ClickPipes in ClickHouse Cloud subscribe directly to MSK topics, reducing the need to build and manage extract, transform, and load (ETL) pipelines
    • Materialized views in ClickHouse pre-aggregate and transform raw JSON into query-ready formats
  3. Dashboards and delivery

Why Amazon MSK and ClickHouse Cloud?

Nightwatch requires a durable, horizontally scalable, and low maintenance streaming backbone.

With Amazon MSK Express brokers, we have achieved over 1 million events per second during load testing, benefiting from low-latency, elastic scaling, and simplified operations. MSK Express brokers require no storage sizing or provisioning, scale up to 20 times faster, and recover 90% quicker than standard Apache Kafka brokers—all while enforcing best-practice defaults and client quotas for reliable performance. Its seamless integration with other AWS services—such as Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon CloudWatch—made it straightforward to build a resilient, end-to-end streaming architecture.

To ingest and transform these events in real time, Nightwatch uses ClickHouse Cloud and its managed integration platform, ClickPipes. ClickHouse Cloud excels at analytical workloads by delivering up to 100 times faster query performance for analytics compared to traditional row-based databases. Its advanced compression algorithms provide up to 90% storage savings, significantly reducing infrastructure costs while maintaining high performance. With its columnar architecture and optimized execution engine, ClickHouse Cloud can query billions of rows in under 1 second, enabling Laravel Nightwatch to serve real-time dashboards and analytics at global scale.

By integrating Amazon MSK and ClickHouse using ClickPipes, Laravel also reduced the operational burden of building and managing ETL pipelines, reducing latency and complexity.

Overcoming challenges

Testing complexity

While synthetic benchmarking and test datasets yield useful results, a more realistic workload is required to rigorously test infrastructure and code before deployment to production. The team used Terraform to manage infrastructure alongside application code, creating multiple dev and test environments, and allowing them to test the platform internally with their own applications before each release.

Multi-region infrastructure

The need to cater to multiple data storage regions also brought challenges—with latency, complexity, and cost the foremost concerns. However, the AWS, ClickHouse Cloud, and Cloudflare stack made available a powerful set of networking tools and scaling options. While VPC peering, RDS replication, and global server load balancing did the heavy lifting on the networking side, the ability to scale and right-size each resource kept costs to a minimum.

Query performance at scale

Materialized views, intelligent time-series partitioning, and specialized ClickHouse codecs helped ensure that queries remained sub-second even as data volumes grew into the billions. Meanwhile, compute separation allowed distinct workloads to scale separately while accessing the same data, with clusters right-sized horizontally and vertically depending on the requirements of each load.

Results

Laravel Nightwatch’s launch exceeded expectations:

  • 5,300 users registered in the first 24 hours
  • 500 million events processed on day one
  • 97 ms average dashboard request latency
  • 760,000 exceptions logged and analyzed in real time

By building on Amazon MSK and ClickHouse Cloud, we were able to scale from zero to billions of events without sacrificing performance or developer experience.

What’s next

Laravel plans to expand Nightwatch with:

  • More regions to cater to customers with data sovereignty requirements outside the US and EU
  • Broader data collection to provide even deeper insight into customers’ applications
  • SOC 2 certification to cater to customers with tighter compliance requirements
  • More advanced monitoring and analysis to identify issues before they affect users

The current architecture comfortably supports applications of all sizes, from hobby to enterprise (including a generous free tier), and is designed to handle over one trillion monthly events without performance degradation.

Conclusion

Laravel Nightwatch demonstrates how Amazon MSK, ClickHouse Cloud, and AWS serverless technologies can be combined to build a cost-effective, real-time monitoring platform at global scale. By designing for scale from day one, Laravel delivered sub-second analytics across billions of events, while maintaining the developer-friendly experience their community expects.


About the authors

Jess Archer

Jess Archer

Jess is an Engineering Manager and Head of Nightwatch at Laravel, focusing on application observability, performance monitoring, and developer experience. She leads the Nightwatch team while staying hands-on in the codebase. Prior to Laravel, Jess worked on clinical data collection platforms, software for law enforcement, and anti-phishing solutions in banking. She later contributed extensively to Laravel’s open-source ecosystem before moving into her current leadership role. Jess is deeply passionate about open source and creating tools that make developers more productive.

James Carpenter

James Carpenter

James is a Senior Infrastructure Engineer joined Laravel in 2024 as Infrastructure Lead for the Nightwatch team, bringing experience from 15 years in sport and healthcare. Specialising in DevOps and Infrastructure, he is passionate about solving complex problems and creating exceptional experiences for both customers and developers.

Johnny Mirza

Johnny Mirza

Johnny is a Solution Architect with ClickHouse, working with users across APAC. With over 20 years of background in solutions engineering, he’s experienced in architecting and enabling solutions for enterprise clients in the telecommunications, media, insurance, and financial services sectors. Johnny has a high level of expertise of integration between both public cloud and on-premise infrastructure, while focussing on service assurance, monitoring platforms, and open-source technologies. Prior to ClickHouse, Johnny was part of the solution engineering teams at Confluent, Splunk, and Optus, to name a few.

Masudur Rahaman Sayem

Masudur Rahaman Sayem

Masudur is a Streaming Data Architect at AWS with over 25 years of experience in the IT industry. He collaborates with AWS customers worldwide to architect and implement sophisticated data streaming solutions that address complex business challenges. As an expert in distributed computing, Sayem specializes in designing large-scale distributed systems architecture for maximum performance and scalability. He has a keen interest and passion for distributed architecture, which he applies to designing enterprise-grade solutions at internet scale.

Optimize Amazon EMR runtime for Apache Spark with EMR S3A

Post Syndicated from Giovanni Matteo Fumarola original https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-runtime-for-apache-spark-with-emr-s3a/

With the Amazon EMR 7.10 runtime, Amazon EMR has introduced EMR S3A, an improved implementation of the open source S3A file system connector. This enhanced connector is now automatically set as the default S3 file system connector for Amazon EMR deployment options, including Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, and Amazon EMR on AWS Outposts, maintaining complete API compatibility with open source Apache Spark.

In the Amazon EMR 7.10 runtime for Apache Spark, the EMR S3A connector exhibits performance comparable to EMRFS for read workloads, as demonstrated by TPC-DS query benchmark. The connector’s most significant performance gains are evident in write operations, with a 7% improvement in static partition overwrites and a 215% improvement for dynamic partition overwrites when compared to EMRFS. In this post, we showcase the enhanced read and write performance advantages of using Amazon EMR 7.10.0 runtime for Apache Spark with EMR S3A as compared to EMRFS and the open source S3A file system connector.

Read workload performance comparison

To evaluate the read performance, we used a test environment based on Amazon EMR runtime version 7.10.0 running Spark 3.5.5 and Hadoop 3.4.1. Our testing infrastructure featured an Amazon Elastic Compute Cloud (Amazon EC2) cluster comprised of nine r5d.4xlarge instances. The primary node has 16 vCPU and 128 GB memory, and the eight core nodes have a total of 128 vCPU and 1024 GB memory.

The performance evaluation was conducted using a comprehensive testing methodology designed to provide accurate and meaningful results. For the source data, we chose the 3 TB scale factor, which contains 17.7 billion records, approximately 924 GB of compressed data partitioned in Parquet file format. The setup instructions and technical details can be found in the GitHub repository. We used Spark’s in-memory data catalog to store metadata for TPC-DS databases and tables.

To produce a fair and accurate comparison between EMR S3A vs. EMRFS and open source S3A implementations, we implemented a three-phase testing approach:

  • Phase 1: Baseline performance:
    • Established a baseline using default Amazon EMR configuration with EMR’s S3A connector
    • Created a reference point for subsequent comparisons
  • Phase 2: EMRFS analysis:
    • Maintained the default file system as EMRFS
    • Preserved other configuration settings
  • Phase 3: Open source S3A testing:
    • Modified only the hadoop-aws.jar file by replacing it with the open source Hadoop S3A 3.4.1 version
    • Maintained identical configurations across other components

This controlled testing environment was crucial for our evaluation for the following reasons:

  • We could isolate the performance impact specifically to the S3A connector implementation
  • It removed potential variables that could skew the results
  • It provided accurate measurements of performance improvements between Amazon’s S3A implementation and the open source alternative

Test execution and results

Throughout the testing process, we maintained consistency in test conditions and configurations, making sure any observed performance differences could be directly attributed to the S3A connector implementation variations. A total of 104 SparkSQL queries were run in 10 iterations sequentially, and an average of each query’s runtime in these 10 iterations was used for comparison. The average of the 10 iterations’ runtime on the Amazon EMR 7.10 runtime for Apache Spark with EMR S3A was 1116.87 seconds, which is 1.08 times faster than open source S3A and comparable with EMRFS. The following figure illustrates the total runtime in seconds.

The following table summarizes the metrics.

Metric OSS S3A EMRFS EMR S3A
Average runtime in seconds 1208.26 1129.64 1116.87
Geometric mean over queries in seconds 7.63 7.09 6.99
Total cost * $6.53 $6.40 $6.15

*Detailed cost estimates are discussed later in this post.

The following chart demonstrates the per-query performance improvement of EMR S3A relative to open source S3A on the Amazon EMR 7.10 runtime for Apache Spark. The extent of the speedup varies from one query to another, with the fastest up to 1.51 times faster for q3, with Amazon EMR S3A outperforming open source S3A. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Read cost comparison

Our benchmark outputs the total runtime and geometric mean figures to measure the Spark runtime performance. The cost metric can provide us with additional insights. Cost estimates are computed using the following formulas. They factor in Amazon EC2, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR costs, but don’t include Amazon Simple Storage Service (Amazon S3) GET and PUT costs.

  • Amazon EC2 cost (include SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
    • r5d.4xlarge hourly rate = $1.152 per hour
  • Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
  • Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
    • r5d.4xlarge Amazon EMR cost = $0.27 per hour
  • Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

The following table summarizes these costs.

Metric EMRFS EMR S3A OSS S3A
Runtime in hours 0.5 0.48 0.51
Number of EC2 instances 9 9 9
Amazon EBS size 0 gb 0 gb 0 gb
Amazon EC2 cost $5.18 $4.98 $5.29
Amazon EBS cost $0.00 $0.00 $0.00
Amazon EMR cost $1.22 $1.17 $1.24
Total cost $6.40 $6.15 $6.53
Cost savings Baseline EMR S3A is 1.04 times better than EMRFS EMR S3A is 1.06 times better than OSS S3A

Write workload performance comparison

We conducted benchmark tests to assess the write performance of the Amazon EMR 7.10 runtime for Apache Spark.

Static table/partition overwrite

We evaluated the static table/partition overwrite write performance of the different file system by executing the following INSERT OVERWRITE Spark SQL query. The SELECT * FROM range(...) clause generated data at execution time. This produced approximately 15 GB of data across exactly 100 Parquet files in Amazon S3.

SET rows=4e9; -- 4 Billion
SET partitions=100;
INSERT OVERWRITE DIRECTORY 's3://${bucket}/perf-test/${trial_id}'
USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

The test environment was configured as follows:

  • EMR cluster with emr-7.10.0 release label
  • Single m5d.2xlarge instance (primary group)
  • Eight m5d.2xlarge instances (core group)
  • S3 bucket in the same AWS Region as the EMR cluster
  • The trial_id property used a UUID generator to avoid conflict between test runs

Results

After running 10 trials for each file system, we captured and summarized query runtimes in the following chart. Whereas EMR S3A averaged only 26.4 seconds, the EMRFS and open source S3A averaged 28.4 seconds and 31.4 seconds—a 1.07 times and 1.19 times improvement, respectively.

Dynamic partition overwrite

We also evaluated the write performance by executing the following INSERT OVERWRITE dynamic partition Spark SQL query, which joins TPC-DS 3TB partitioned Parquet data of the table web_sales and date_dim tables, which inserts approximately 2,100 partitions, where each partition contains one Parquet file with a combined size of approximately 31.2 GB in Amazon S3.

SET spark.sql.sources.partitionOverwriteMode=DYNAMIC;
INSERT OVERWRITE TABLE <TABLE_NAME> PARTITION(wsdt_year,wsdt_month, wsdt_day) 
SELECT ws_order_number,ws_quantity,ws_list_price,ws_sales_price,
ws_net_paid_inc_ship_tax,ws_net_profit,dt.d_year as wsdt_year,dt.d_moy 
as wsdt_month,dt.d_dom as wsdt_day FROM web_sales, date_dim dt 
WHERE ws_sold_date_sk = d_date_sk;

The test environment was configured as follows:

  • EMR cluster with emr-7.10.0 release label
  • Single r5d.4xlarge instance (master group)
  • Five r5d.4xlarge instances (core group)
  • Approximately 2,100 partitions with one Parquet file each
  • Combined size of approximately 31.2 GB in Amazon S3

Results

After running 10 trials for each file system, we captured and summarized query runtimes in the following chart. Whereas EMR S3A averaged only 90.9 seconds, the EMRFS and open source S3A averaged 286.4 seconds and 1,438.5 seconds—a 3.15 times and 15.82 times improvement, respectively.

Summary

Amazon EMR consistently enhances its Apache Spark runtime and S3A connector, delivering continuous performance improvements that help big data customers execute analytics workloads more cost-effectively. Beyond performance gains, the strategic shift to S3A introduces critical advantages, including enhanced standardization, improved cross-platform portability, and robust community-driven support—all while maintaining or surpassing the performance benchmarks established by the previous EMRFS implementation.

We recommend that you stay up-to-date with the latest Amazon EMR release to take advantage of the latest performance and feature benefits. Subscribe to the AWS Big Data Blog’s RSS feed to learn more about the Amazon EMR runtime for Apache Spark, configuration best practices, and tuning advice.


About the authors

Giovanni Matteo Fumarola

Giovanni Matteo Fumarola

Giovanni is the Senior Manager for the Amazon EMR Spark and Iceberg group. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

Sushil Kumar Shivashankar

Sushil Kumar Shivashankar

Sushil is the Engineering Manager for the Amazon EMR Hadoop and Flink team at Amazon Web Services. With a focus on big data analytics since 2014, he leads development, optimizations, and growth strategies for Hadoop and Flink business in Amazon EMR.

Narayanan Venkateswaran

Narayanan Venkateswaran

Narayanan is a Senior Software Development Engineer in the Amazon EMR group. He works on developing Hadoop components in Amazon EMR. He has over 20 years of work experience in the industry across several companies, including Sun Microsystems, Microsoft, Amazon, and Oracle. Narayanan also holds a PhD in databases with a focus on horizontal scalability in relational stores.

Syed Shameerur Rahman

Syed Shameerur Rahman

Syed is a Software Development Engineer at Amazon EMR. He is interested in highly scalable, distributed computing. He is an active contributor of open source projects like Apache Hive, Apache Tez, Apache ORC, and Apache Hadoop, and has contributed important features and optimizations. During his free time, he enjoys exploring new places and trying new foods.

Rajarshi Sarkar

Rajarshi Sarkar

Rajarshi is a Software Development Engineer at Amazon EMR. He works on cutting-edge features of Amazon EMR and is also involved in open source projects such as Apache Hive, Iceberg, Trino, and Hadoop. In his spare time, he likes to travel, watch movies, and hang out with friends.

Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR

Post Syndicated from Arun Shanmugam original https://aws.amazon.com/blogs/big-data/unlock-the-power-of-apache-iceberg-v3-deletion-vectors-on-amazon-emr/

As modern data architectures expand, Apache Iceberg has become a widely popular open table format, providing ACID transactions, time travel, and schema evolution. In table format v2, Iceberg introduced merge-on-read, improving delete and update handling through positional delete files. These files improve write performance but can slow down reads when not compacted, since Iceberg must merge them during query execution to return the latest snapshot. Iceberg v3 enhances merge performance during reads by replacing positional delete files with deletion vectors for handling row-level deletes in Merge-on-Read (MoR) tables. This change deprecates the use of positional delete files in v3, which marked specific row positions as deleted, in favor of the more efficient deletion vectors.

In this post, we compare and evaluate the performance of the new binary deletion vectors in Iceberg v3 with respect to traditional position delete files of Iceberg v2 using Amazon EMR version 7.10.0 with Apache Spark 3.5.5. We provide insights into the practical impacts of these advanced row-level delete mechanisms on data management efficiency and performance.

Understanding binary deletion vectors and Puffin files

Binary deletion vectors stored in Puffin files use compressed bitmaps to efficiently represent which rows have been deleted within a data file. In contrast, previous Iceberg versions (v2) relied on positional delete files—Parquet files that enumerated rows to delete by file and position. This older approach resulted in many small delete files, which placed a heavy burden on query engines due to numerous file reads and costly in-memory conversions. Puffin files reduce this overhead by compactly encoding deletions, improving query performance and resource utilization.

Iceberg v3 improves this in the following aspects:

  • Reduced I/O – Fewer small delete files lower metadata overhead by introducing deletion vectors—compressed bitmaps that efficiently represent deleted rows. These vectors are stored persistently in Puffin files, a compact binary format optimized for low-latency access.
  • Query performance – Bitmap-based deletion vectors enable faster scan filtering by allowing multiple vectors to be stored in a single Puffin file. This reduces metadata and file count overhead while preserving file-level granularity for efficient reads. The design supports continuous merging of deletion vectors, promoting ongoing compaction that maintains stable query performance and reduces fragmentation over time. It removes the trade-off between partition-level and file-level delete granularity seen in v2, enabling consistently fast reads even in heavy-update scenarios.
  • Storage efficiency – Iceberg v3 uses a compressed binary format instead of verbose Parquet positioning. Engines maintain a single deletion vector per data file at write time, enabling better compaction and consistent query performance.

Solution overview

To explore the performance characteristics of delete operations in Iceberg v2 and v3, we use PySpark to run our comparison tests focusing on delete operation runtime and delete file size. This implementation helps us effectively benchmark and compare the deletion mechanisms between Iceberg v2’s position-delete files using Parquet and v3’s newer Puffin-based deletion vectors.

Our solution demonstrates how to configure Spark with the AWS Glue Data Catalog and Iceberg, create tables, and run delete operations programmatically. We first create Iceberg tables with format versions 2 and 3, insert 10,000 rows, then perform delete operations on a range of record IDs. We also perform table compaction and then measure delete operation runtime and size and count of associated delete files.

In Iceberg v3, deleting rows introduces binary deletion vectors stored in Puffin files (compact binary sidecar files). These allow more efficient query planning and faster read performance by consolidating deletes and avoiding large numbers of small files.

For this test, the Spark job was submitted by SSH’ing into the EMR cluster and using spark-submit directly from the shell, with the required Iceberg JAR file being referenced directly from the Amazon Simple Storage Service (Amazon S3) bucket in the submission command. When running the job, make sure you provide your S3 bucket name. See the following code:

spark-submit --jars s3://< S3-BUCKET-NAME >/iceberg/jars/iceberg-spark-runtime-3.5_2.12-1.9.2.jar v3_deletion_vector_test.py

Prerequisites

To follow along with this post, you must have the following prerequisites:

  • Amazon EMR on Amazon EC2 with version 7.10.0 integrated with the Glue Data Catalog, which includes Spark 3.5.5.
  • The Iceberg 1.9.2 JAR file from the official Iceberg documentation, which includes important deletion vector improvements such as v2 to v3 rewrites and dangling deletion vector detection. Optionally, you can use the default Iceberg 1.8.1-amzn-0 bundled with Amazon EMR 7.10 if these Iceberg 1.9.x improvements are not required.
  • An S3 bucket to store Iceberg data.
  • An AWS Identity and Access management (IAM) role for Amazon EMR configured with the necessary permissions.

The upcoming Amazon EMR 7.11 will ship with Iceberg 1.9.1-amzn-1, which includes deletion vector improvements such as v2 to v3 rewrites and dangling deletion vector detection. This means you no longer need to manually download or upload the Iceberg JAR file, because it will be included and managed natively by Amazon EMR.

Code walkthrough

The following PySpark script demonstrates how to create, write, compact, and delete records in Iceberg tables with two different format versions (v2 and v3) using the Glue Data Catalog as the metastore. The main goal is to compare both write and read performance, along with storage characteristics (delete file format and size) between Iceberg format versions 2 and 3.

The code performs the following functions:

  • Creates a SparkSession configured to use Iceberg with Glue Data Catalog integration.
  • Creates a synthetic dataset simulating user records:
    • Uses a fixed random seed (42) to provide consistent data generation
    • Creates identical datasets for both v2 and v3 tables for fair comparison
  • Defines the function test_read_performance(table_name) to perform the following actions:
    • Measure full table scan performance
    • Measure filtered read performance (with WHERE clause)
    • Track record counts for both operations
  • Defines the function test_iceberg_table(version, test_df) to perform the following actions:
    • Create or use an Iceberg table for the specified format version
    • Append data to the Iceberg table
    • Trigger Iceberg’s data compaction using a system procedure
    • Delete rows with IDs between 1000–1099
    • Collect statistics about inserted data files and delete-related files
    • Measure and record read performance metrics
    • Track operation timing for inserts, deletes, and reads
  • Defines a function to print a comprehensive comparative report including the following information:
    • Delete operation performance
    • Read performance (both full table and filtered)
    • Delete file characteristics (formats, counts, sizes)
    • Performance improvements as percentages
    • Storage efficiency metrics
  • Orchestrate the main execution flow:
    • Create a single dataset to ensure identical data for both versions
    • Clean up existing tables for fresh testing
    • Run tests for Iceberg format version 2 and version 3
    • Output a detailed comparison report
    • Handle exceptions and shut down the Spark session

See the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import functions as F
import time
import random
import logging
from pyspark.sql.utils import AnalysisException
# Logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__name__)
# Constants
ROWS_COUNT = 10000
DELETE_RANGE_START = 1000
DELETE_RANGE_END = 1099
SAMPLE_NAMES = ["Alice", "Bob", "Charlie", "Diana",
                "Eve", "Frank", "Grace", "Henry", "Ivy", "Jack"]
# Spark Session
spark = (
    SparkSession.builder
    .appName("IcebergWithGlueCatalog")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://<S3-BUCKET-NAME>/blog/glue/")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .getOrCreate()
)
spark.sql("CREATE DATABASE IF NOT EXISTS glue_catalog.blog")
def create_dataset(num_rows=ROWS_COUNT):
    # Set a fixed seed for reproducibility
    random.seed(42)
    
    data = [(i,
             random.choice(SAMPLE_NAMES) + str(i),
             random.randint(18, 80))
            for i in range(1, num_rows + 1)]
    schema = StructType([
        StructField("id", IntegerType(), False),
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)
    ])
    df = spark.createDataFrame(data, schema)
    df = df.withColumn("created_at", F.current_timestamp())
    return df
def test_read_performance(table_name):
    """Test read performance of the table"""
    start_time = time.time()
    count = spark.sql(f"SELECT COUNT(*) FROM glue_catalog.blog.{table_name}").collect()[0][0]
    read_time = time.time() - start_time
    
    # Test filtered read performance
    start_time = time.time()
    filtered_count = spark.sql(f"""
        SELECT COUNT(*) 
        FROM glue_catalog.blog.{table_name} 
        WHERE age > 30
    """).collect()[0][0]
    filtered_read_time = time.time() - start_time
    
    return read_time, filtered_read_time, count, filtered_count
def test_iceberg_table(version, test_df):
    try:
        table_name = f"iceberg_table_v{version}"
        logger.info(f"\n=== TESTING ICEBERG V{version} ===")
        spark.sql(f"""
            CREATE TABLE IF NOT EXISTS glue_catalog.blog.{table_name} (
                id int,
                name string,
                age int,
                created_at timestamp
            ) USING iceberg
            TBLPROPERTIES (
                'format-version'='{version}',
                'write.delete.mode'='merge-on-read'
            )
        """)
        start_time = time.time()
        test_df.writeTo(f"glue_catalog.blog.{table_name}").append()
        insert_time = time.time() - start_time
        logger.info("Compaction...")
        spark.sql(
            f"CALL glue_catalog.system.rewrite_data_files('glue_catalog.blog.{table_name}')")
        start_time = time.time()
        spark.sql(f"""
            DELETE FROM glue_catalog.blog.{table_name}
            WHERE id BETWEEN {DELETE_RANGE_START} AND {DELETE_RANGE_END}
        """)
        delete_time = time.time() - start_time
        files_df = spark.sql(
            f"SELECT COUNT(*) as data_files FROM glue_catalog.blog.{table_name}.files")
        delete_files_df = spark.sql(f"""
            SELECT COUNT(*) as delete_files,
                   file_format,
                   SUM(file_size_in_bytes) as total_size
            FROM glue_catalog.blog.{table_name}.delete_files
            GROUP BY file_format
        """)
        data_files = files_df.collect()[0]['data_files']
        delete_stats = delete_files_df.collect()
        # Add read performance testing
        logger.info("\nTesting read performance...")
        read_time, filtered_read_time, total_count, filtered_count = test_read_performance(table_name)
        
        logger.info(f"Insert time: {insert_time:.3f}s")
        logger.info(f"Delete time: {delete_time:.3f}s")
        logger.info(f"Full table read time: {read_time:.3f}s")
        logger.info(f"Filtered read time: {filtered_read_time:.3f}s")
        logger.info(f"Data files: {data_files}")
        logger.info(f"Total records: {total_count}")
        logger.info(f"Filtered records: {filtered_count}")
        if len(delete_stats) > 0:
            stats = delete_stats[0]
            logger.info(f"Delete files: {stats.delete_files}")
            logger.info(f"Delete format: {stats.file_format}")
            logger.info(f"Delete files size: {stats.total_size} bytes")
            return delete_time, stats.total_size, stats.file_format, read_time, filtered_read_time
        else:
            logger.info("No delete files found")
            return delete_time, 0, "N/A", read_time, filtered_read_time
    except AnalysisException as e:
        logger.error(f"SQL Error: {str(e)}")
        raise
    except Exception as e:
        logger.error(f"Error: {str(e)}")
        raise
def print_comparison_results(v2_results, v3_results):
    v2_delete_time, v2_size, v2_format, v2_read_time, v2_filtered_read_time = v2_results
    v3_delete_time, v3_size, v3_format, v3_read_time, v3_filtered_read_time = v3_results
    logger.info("\n=== PERFORMANCE COMPARISON ===")
    logger.info(f"v2 delete time: {v2_delete_time:.3f}s")
    logger.info(f"v3 delete time: {v3_delete_time:.3f}s")
    if v2_delete_time > 0:
        improvement = ((v2_delete_time - v3_delete_time) / v2_delete_time) * 100
        logger.info(f"v3 Delete performance improvement: {improvement:.1f}%")
    logger.info("\n=== READ PERFORMANCE COMPARISON ===")
    logger.info(f"v2 full table read time: {v2_read_time:.3f}s")
    logger.info(f"v3 full table read time: {v3_read_time:.3f}s")
    logger.info(f"v2 filtered read time: {v2_filtered_read_time:.3f}s")
    logger.info(f"v3 filtered read time: {v3_filtered_read_time:.3f}s")
    
    if v2_read_time > 0:
        read_improvement = ((v2_read_time - v3_read_time) / v2_read_time) * 100
        logger.info(f"v3 Read performance improvement: {read_improvement:.1f}%")
    
    if v2_filtered_read_time > 0:
        filtered_improvement = ((v2_filtered_read_time - v3_filtered_read_time) / v2_filtered_read_time) * 100
        logger.info(f"v3 Filtered read performance improvement: {filtered_improvement:.1f}%")
    logger.info("\n=== DELETE FILE COMPARISON ===")
    logger.info(f"v2 delete format: {v2_format}")
    logger.info(f"v2 delete size: {v2_size} bytes")
    logger.info(f"v3 delete format: {v3_format}")
    logger.info(f"v3 delete size: {v3_size} bytes")
    if v2_size > 0:
        size_reduction = ((v2_size - v3_size) / v2_size) * 100
        logger.info(f"v3 size reduction: {size_reduction:.1f}%")
# Main
try:
    # Create dataset once and reuse for both versions
    test_dataset = create_dataset()
    
    # Drop existing tables if they exist
    spark.sql("DROP TABLE IF EXISTS glue_catalog.blog.iceberg_table_v2")
    spark.sql("DROP TABLE IF EXISTS glue_catalog.blog.iceberg_table_v3")
    
    # Test both versions with the same dataset
    v2_results = test_iceberg_table(2, test_dataset)
    v3_results = test_iceberg_table(3, test_dataset)
    print_comparison_results(v2_results, v3_results)
finally:
    spark.stop()

Results summary

The output generated by the code includes the results summary section that shows several key comparisons, as shown in the following screenshot. For delete operations, Iceberg v3 uses the Puffin file format compared to Parquet in v2, resulting in significant improvements. The delete operation time decreased from 3.126 seconds in v2 to 1.407 seconds in v3, achieving a 55.0% performance improvement. Additionally, the delete file size was reduced from 1801 bytes using Parquet in v2 to 475 bytes using Puffin in v3, representing a 73.6% reduction in storage overhead. Read operations also saw notable improvements, with full table reads 28.5% faster and filtered reads 23% faster in v3. These improvements demonstrate the efficiency gains from v3’s implementation of binary deletion vectors through the Puffin format.

style=

The actual measured performance and storage improvements depend on workload and environment and might differ from the preceding example.

This following screenshot from the S3 bucket demonstrates a Puffin delete file stored alongside data files.

style=

Clean up

After you finish your tests, it’s important to clean up your environment to avoid unnecessary costs:

  1. Drop the test tables you created to remove associated data from your S3 bucket and prevent ongoing storage charges.
  2. Delete any temporary data left in the S3 bucket used for Iceberg data.
  3. Delete the EMR cluster to stop billing for running compute resources.

Cleaning up resources promptly helps maintain cost-efficiency and resource hygiene in your AWS environment.

Considerations

Iceberg features are introduced through a phased process: first in the specification, then in the core library, and finally in engine implementations. Deletion vector support is currently available in the specification and core library, with Spark being the only supported engine. We validated this capability on Amazon EMR 7.10 with Spark 3.5.5.

Conclusion

Iceberg v3 introduces a significant advancement in managing row-level deletes for merge-on-read operations through binary deletion vectors stored in compact Puffin files. Our performance tests, conducted with Iceberg 1.9.2 on Amazon EMR 7.10.0 and EMR Spark 3.5.5, show clear improvements in both delete operation speed and read performance, along with a considerable reduction in delete file storage compared to Iceberg v2’s positional delete Parquet files. For more information about deletion vectors, refer to Iceberg v3 deletion vectors.


About the authors

Arun Shanmugam

Arun Shanmugam

Arun is a Senior Analytics Solutions Architect at AWS, with a focus on building modern data architecture. He has been successfully delivering scalable data analytics solutions for customers across diverse industries. Outside of work, Arun is an avid outdoor enthusiast who actively engages in CrossFit, road biking, and cricket.

Suthan Phillips

Suthan Phillips

Suthan is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to provide efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking, and exploring the Pacific Northwest.

Kinshuk Paharae

Kinshuk Paharae

Kinshuk is head of product for data processing, leading product teams for AWS Glue, Amazon EMR, and Amazon Athena. He has been with AWS for over 5 years.

Linda OConnor

Linda OConnor

Linda is a Seasoned Go-To-Market Leader with close to three decades of experience driving growth strategies in the data and analytics space. At AWS, she currently leads pan analytics initiatives including lakehouse architectures, helping customers transform their existing landscapes through non-disruptive innovation. She previously served as Global Vice President at a German software company for 25 years, where she spearheaded Data Warehousing and Big Data portfolios, orchestrating successful product launches and driving global market expansion.