All posts by Sonu Kumar Singh

Apache Spark encryption performance improvement with Amazon EMR 7.9

Post Syndicated from Sonu Kumar Singh original https://aws.amazon.com/blogs/big-data/apache-spark-encryption-performance-improvement-with-amazon-emr-7-9/

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open source Apache Spark. With Amazon EMR release 7.9.0, the EMR runtime for Apache Spark introduces significant performance improvements for encrypted workloads, supporting Spark version 3.5.5.

For compliance and security requirements, many customers need to enable Apache Spark’s local storage encryption (spark.io.encryption.enabled = true) in addition to Amazon Simple Storage Service (Amazon S3) encryption (such as server-side encryption (SSE) or AWS Key Management Service (AWS KMS)). This feature encrypts shuffle files, cached data, and other intermediate data written to local disk during Spark operations, protecting sensitive data at rest on Amazon EMR cluster instances.

Industries subject to regulations such as the Health Insurance Portability and Accountability Act (HIPAA) for healthcare, Payment Card Industry Data Security Standard (PCI-DSS) for financial services, General Data Protection Regulation (GDPR) for personal data, and Federal Risk and Authorization Management Program (FedRAMP) for government often require encryption of all data at rest, including temporary files on local storage. While Amazon S3 encryption protects data in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill data that Spark writes to local disk during distributed processing—data that never reaches Amazon S3 but might contain sensitive information extracted from source datasets. Generally, encrypted operations require additional computational overhead that can impact overall job performance.

With the built-in encryption optimizations of Amazon EMR 7.9.0, customers might see significant performance improvements in their Apache Spark applications without requiring any application changes. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we observed up to 20% faster performance with the EMR 7.9 optimized Spark runtime compared to Spark without these optimizations. Individual results may vary depending on specific workloads and configurations.

In this post, we analyze the results from our benchmark tests comparing the Amazon EMR 7.9 optimized Spark runtime against Spark 3.5.5 without encryption optimizations. We walk through a detailed cost analysis and provide step-by-step instructions to reproduce the benchmark.

Results observed

To evaluate the performance improvements, we used an open source Spark performance test utility derived from the TPC-DS performance test toolkit. We ran the tests on two nine-node (eight core nodes and one primary node) r5d.4xlarge Amazon EMR 7.9.0 clusters, comparing two configurations:

  • Baseline: EMR 7.9.0 cluster with a bootstrap action installing Spark 3.5.5 without encryption optimizations
  • Optimized: EMR 7.9.0 cluster using the EMR Spark 3.5.5 runtime with encryption optimizations

Both tests used data stored in Amazon Simple Storage Service (Amazon S3). All data processing was configured identically except for the Spark runtime version.

To maintain benchmarking consistency and ensure a consistent, equivalent comparison, we disabled Dynamic Resource Allocation (DRA) in both test configurations. This approach eliminates variability from dynamic scaling and so we can measure pure computational performance improvements.

The following table shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between the baseline and Amazon EMR 7.9 optimized configurations:

Configuration Total runtime (seconds) Geometric mean (seconds) Performance improvement
Baseline (Spark 3.5.5 without optimization) 1,485 10.24
EMR 7.9 (with encryption optimization) 1,176 8.15 20% faster

We observed that our TPC-DS tests with the Amazon EMR 7.9 optimized Spark runtime completed about 20% faster based on total runtime and 20% faster based on geometric mean compared to the baseline configuration.

The encryption optimizations in Amazon EMR 7.9 deliver performance benefits through:

  • Improved shuffle and decryption operations reducing overhead during data exchange without compromising security
  • Better memory management for intermediate results

Cost analysis

The performance improvements of the Amazon EMR 7.9 optimized Spark runtime directly translate to lower costs. We realized an approximately 20% cost savings running the benchmark application with encryption optimizations compared to the baseline configuration, because of reduced hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS) using General Purpose SSD (gp2).

The following table summarizes the cost comparison in the us-east-1 AWS Region:

Configuration Runtime (hours) Estimated cost Total EC2 instances Total vCPU Total memory (GiB) Root device (EBS)
Baseline: Spark 3.5.5 without optimization, 1 primary and 8 core nodes 0.41 $5.28 9 144 1152 64 GiB gp2
Amazon EMR 7.9 with optimization, 1 primary and 8 core nodes 0.33 $4.25 9 144 1152 64 GiB gp2

Cost breakdown

Formulas used:

  • Amazon EMR cost – Number of instances × EMR hourly rate × Runtime hours
  • Amazon EC2 cost – Number of instances × EC2 hourly rate × Runtime hour)
  • Amazon EBS cost(EBS cost per GB per month ÷ hours in a month) × EBS volume size × number of instances × runtime hours

Note: EBS is priced monthly ($0.1 per GB per month), so we divide by 730 hours to convert to an hourly rate. EMR and EC2 are already priced hourly, so no conversion is needed.

Baseline configuration (0.41 hours):

  • Amazon EMR cost – 9 × $0.27 × 0.41 = $1.00
  • Amazon EC2 cost – 9 × $1.152 × 0.41 = $4.25
  • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.41) = $0.032
  • Total cost – $5.28

EMR 7.9 optimized configuration (0.33 hours):

  • Amazon EMR cost – (9 × $0.27 × 0.33) = $0.80
  • Amazon EC2 cost – (9 × $1.152 × 0.33) = $3.42
  • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.33) = $0.025
  • Total cost: $4.25

Total cost savings: 20% per benchmark run, which scales linearly with your production workload frequency.

Set up EMR benchmarking

For detailed instructions and scripts, see the companion GitHub repository.

Prerequisites

To set up Amazon EMR benchmarking, start by completing the following prerequisite steps:

  1. Configure your AWS Command Line Interface (AWS CLI) by running aws configure to point to your benchmarking account,
  2. Create an S3 bucket for test data and results.
  3. Copy the TPC-DS 3TB source data from a publicly available dataset to your S3 bucket using the following command:
    aws s3 cp s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned s3://<YOUR-BUCKET-NAME>/BLOG_TPCDS-TEST-3T-partitioned --recursive

    Replace <YOUR-BUCKET-NAME> with the name of the S3 bucket you created in step 2.

  4. Build or download the benchmark application JAR file (spark-benchmark-assembly-3.3.0.jar)
  5. Ensure you have appropriate AWS Identity Access Management (IAM) roles for EMR cluster creation and Amazon S3 access

Deploy the baseline EMR cluster (without optimization)

Step 1: Launch EMR 7.9.0 cluster with bootstrap action

The baseline configuration uses a bootstrap action to install Spark 3.5.5 without encryption optimizations. We have made the bootstrap script publicly available in an S3 bucket for your convenience.

Create the default Amazon EMR roles:

aws emr create-default-roles

Now create the cluster:

aws emr create-cluster \
  --name "EMR-7.9-Baseline-Spark-3.5.5" \
  --release-label emr-7.9.0 \
  --applications Name=Spark \
  --ec2-attributes SubnetId=<YOUR-SUBNET-ID>,InstanceProfile=EMR_EC2_DefaultRole  \
  --service-role EMR_DefaultRole
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
  --bootstrap-actions \
    Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
  --use-default-roles \
  --log-uri s3://<YOUR-BUCKET-NAME>/logs/baseline/

Note: The bootstrap script is available in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 without the encryption optimizations present in the Amazon EMR runtime.

Step 2: Submit the benchmark job to the baseline cluster

Next submit the Spark job using the following commands:

aws emr add-steps \
  --cluster-id <YOUR-BASELINE-CLUSTER-ID> \  
  --steps 'Type=Spark,Name="EMR-7.9-Baseline-Spark-3.5.5 Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=false","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar","s3:// <YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned","s3:// <YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Deploy the optimized EMR cluster (with encryption optimization)

Step 1: Launch EMR 7.9.0 cluster with Spark runtime

The optimized configuration uses the EMR 7.9.0 Spark runtime without any bootstrap actions:

aws emr create-cluster \
  --name "EMR-7.9-Optimized-Native-Spark" \
  --release-label emr-7.9.0 \
  --applications Name=Spark \
  --ec2-attributes SubnetId=<YOUR-SUBNET-ID>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
  --use-default-roles \
  --log-uri s3://<YOUR-BUCKET-NAME>/logs/optimized/

Example:

aws emr create-cluster \
--name "EMR-7.9-Optimized-Native-Spark" \
--release-label emr-7.9.0 \
--applications Name=Spark \
--ec2-attributes SubnetId=subnet-08a5f71f92bc8a801 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
--bootstrap-actions \
Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
--use-default-roles \
--log-uri s3://aws-logs-123456789012-us-west-2/elasticmapreduce/

Step 2: Submit the benchmark job to optimized cluster

ext submit the Spark job using the following commands:

aws emr add-steps \
  --cluster-id <YOUR-OPTIMIZED-CLUSTER-ID> \ 
  --steps 'Type=Spark,Name="EMR-7.9-Optimized-Native-Spark Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=true","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar","s3://<YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Benchmark command parameters explained

The Amazon EMR Spark step uses the following parameters:

  • EMR step configuration:
    • Type=Spark: Specifies this is a Spark application step
    • Name=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable name for the step
    • ActionOnFailure=CONTINUE: Continue with other steps if this one fails
  • Spark submit arguments:
    • –deploy-mode client: Run the driver on the master node (not cluster mode)
    • –class com.amazonaws.eks.tpcds.BenchmarkSQL: Main class for the TPC-DS benchmark
  • Application parameters:
    • JAR file: s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar
    • Input data: s3://<YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned (3 TB TPC-DS dataset)
    • Output location: s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT (S3 path for results)
    • TPC-DS tools path: /opt/tpcds-kit/tools(local path on EMR nodes)
    • Format: parquet (output format)
    • Scale factor: 3000 (3 TB dataset size)
    • Iterations: 3 (run each query 3 times for averaging)
    • Collect results: false (don’t collect results to driver)
    • Query list: "q1-v2.4,q10-v2.4,...,ss_max-v2.4" (all 104 TPC-DS queries)
    • Final parameter: true (enable detailed logging and metrics)
  • Query coverage:
    • All 104 standard TPC-DS benchmark queries (q1-v2.4 through q99-v2.4)
    • Plus the ss_max-v2.4 query for additional testing
    • Each query runs 3 times to calculate average performance

Summarize the results

  1. Download the test result files from both output S3 locations:
    # Baseline results
    aws s3 cp s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./baseline-results.csv
       
    # Optimized results
    aws s3 cp s3://<YOUR-BUCKET-NAME>/blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./optimized-results.csv

  2. The CSV files contain four columns (without headers):
    • Query name
    • Median time (seconds)
    • Minimum time (seconds)
    • Maximum time (seconds)
  3. Calculate performance metrics for comparison:
    • Average time per query: AVERAGE(median, min, max) for each query
    • Total runtime: Sum of all median times
    • Geometric mean: GEOMEAN(average times) across all queries
    • Speedup: Calculate the ratio between baseline and optimized for each query
  4. Create comparison analysis:Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%

Testing configuration details

The following table summarizes the test environment used for this post:

Parameter Value
EMR release emr-7.9.0 (both configurations)
Baseline Spark version 3.5.5 (installed through bootstrap action)
Baseline bootstrap script s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public)
Optimized spark version Amazon EMR Spark runtime
Cluster size 9 nodes (1 primary and 8 core)
Instance type r5d.4xlarge
vCPUs per node 16
Memory per node 128 GB
Instance storage 600 GB SSD
EBS volume 64 GB gp2 (2 volumes per instance)
Total vCPUs 144 (9 × 16)
Total memory 1152 GB (9 × 128)
Dataset TPC-DS 3TB (Parquet format)
Queries 104 queries (TPC-DS v2.4)
Iterations 3 runs per query
DRA Disabled for consistent benchmarking

Clean up

To avoid incurring future charges, delete the resources you created:

  1. Terminate both EMR clusters:
    aws emr terminate-clusters --cluster-ids <YOUR-BASELINE-CLUSTER-ID> <YOUR-OPTIMIZED-CLUSTER-ID>

  2. Delete S3 test results if no longer needed:
    aws s3 rm s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3://<YOUR-BUCKET-NAME>/blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3://<YOUR-BUCKET-NAME>/logs/ --recursive

  3. Remove IAM roles if created specifically for testing

Key findings

  • Up to 20% performance improvement using the Amazon EMR 7.9’s Spark runtime with no code changes required
  • 20% cost savings because of reduced runtime
  • Significant gains for shuffle-heavy, join-intensive workloads
  • 100% API compatibility with open source Apache Spark
  • Simple migration from custom Spark builds to EMR runtime
  • Easy benchmarking using publicly available bootstrap scripts

Conclusion

You can run your Apache Spark workloads up to 20% faster and at lower cost without making any changes to your applications by using the Amazon EMR 7.9.0 optimized Spark runtime. This improvement is achieved through numerous optimizations in the EMR Spark runtime, including enhanced encryption handling, improved data serialization, and optimized shuffle operations.

To learn more about Amazon EMR 7.9 and best practices, see the EMR documentation. For configuration guidance and tuning advice, subscribe to the AWS Big Data Blog.

Related resources:

If you’re running Spark workloads on Amazon EMR today, we encourage you to test the EMR 7.9 Spark runtime with your production workloads and measure the improvements specific to your use case.


About the authors

Sonu Kumar Singh

Sonu Kumar Singh

Sonu is a Senior Solutions Architect with more than 13 years of experience, with a specialization in Analytics and Healthcare domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact.

Roshin Babu

Roshin Babu

Roshin is a Sr. Specialist Solutions architect at AWS, where he collaborates with the sales team to support public sector clients. His role focuses on developing innovative solutions that solve complex business challenges while driving increased adoption of AWS analytics services. When he’s not working, Roshin is passionate about exploring new destinations, discovering great food, and enjoying soccer both as a player and fan.Polaris Jhandi

Polaris Jhandi

Polaris Jhandi

Polaris is a Cloud Application Architect with AWS Professional Services. He has a background in AI/ML and big data. He is currently working with customers to migrate their legacy mainframe applications to the AWS Cloud.Zheng Yuan

Zheng Yuan

Zheng Yuan

Zheng is a Software Engineer on the Amazon EMR Spark team, where he focuses on improving the performance of the Spark execution engine across various use cases.

Your guide to AWS Analytics at AWS re:Invent 2025

Post Syndicated from Sonu Kumar Singh original https://aws.amazon.com/blogs/big-data/your-guide-to-aws-analytics-at-aws-reinvent-2025/

re:Invent banner

It’s that time of year again — AWS re:Invent is here! At re:Invent, bold ideas come to life. Get a front-row seat to hear inspiring stories from AWS experts, customers, and leaders as they explore today’s most impactful topics, from data analytics to AI.

For all the data enthusiasts and professionals, we’ve curated a comprehensive guide to every analytics session to help you plan your perfect agenda. Make sure to secure your seat early for must-attend sessions via the attendee portal.

Pro tip: Even if a session shows as fully reserved, we encourage you to join the walk-up line at the session location. Based on previous years’ experiences, additional seats often become available due to no-shows or last-minute schedule changes. The walk-up line operates on a first-come, first-served basis, and many attendees have successfully accessed their desired sessions this way. Just be sure to arrive at least 15 minutes before the session starts for the best chance of getting a seat.

Can’t make it in person? No problem — grab a free virtual pass to stream live sessions from anywhere.

And don’t forget to stop by the AWS Kiosk in the AWS Village Expo for AWS Analytics, Amazon SageMaker, Amazon OpenSearch Service and AWS Messaging and Streaming services! See live demos of analytics services, meet AWS experts, get your toughest data questions answered, explore the latest launches, join our data trivia, and even win exclusive AWS-authored books and many more swags.

Data Innovation Talk

INV201 | Harnessing analytics for humans and AI

Emerging trends, ranging from Open Table Formats (OTF) to agentic infrastructure, are rapidly changing how humans and applications interact with analytics to drive mission-critical business decisions. Join Mai-Lan Tomsen Bukovec, VP of AWS Technology, to explore emerging trends, the evolution of analytics engines and applications, and how to future-proof your data foundation for the rapidly changing landscape of analytics at scale. Learn how AWS is transforming data and analytics services to lead in optimized data storage, querying, streaming, processing, and governance – for both human users and agentic infrastructure.

Breakouts

Dive into cutting-edge topics with re:Invent breakout sessions. These immersive, hour-long lectures are led by AWS experts, customers, offering you unparalleled insights and knowledge in a concise format. Whether you’re exploring the latest in cloud technology, AWS Analytics advancements, or industry-specific solutions, these sessions are designed to expand your horizon and inspire your next big idea.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:30 AM – 9:30 AM PST | Venetian | Level 3 | Lido 3106

ANT203 | Enabling AI innovation with Amazon SageMaker Unified Studio

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

BIZ207 | Democratize access to insights with Amazon Quick Suite

8:30 AM – 9:30 AM PST | MGM | Level 1 | Grand 123

ANT204 | Architecting the future: Amazon SageMaker as a data and AI platform

11:00 AM – 12:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT317 | Modernize your data warehouse by moving to Amazon Redshift

8:30 AM – 9:30 AM PST | MGM | Level 3 | Chairman’s 366

ANT318 | Scaling Amazon Redshift with a multi-warehouse architecture

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT216 | What’s new with Amazon SageMaker in the era of unified data and AI

10:00 AM – 11:00 AM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT335 | Agentic data engineering with AWS Analytics MCP Servers

11:30 AM – 12:30 PM PST | Wynn | Upper Convention Promenade | Cristal 7

ANT328 | Data Processing architectures for building AI solutions

9:00 AM – 10:00 AM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Mint Green Theater

ANT307 | Operating Apache Kafka and Apache Flink at scale

1:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 364

BIZ203 | Amazon’s journey deploying Quick Suite across thousands of users

10:00 AM – 11:00 AM PST | Wynn | Upper Convention Promenade | Bollinger

ANT304 | Build an AI-ready data foundation

1:00 PM – 2:00 PM PST | MGM | Level 1 | Grand 122

BIZ227 | Generate new revenue streams with Amazon Quick Sight embedded

10:00 AM – 11:00 AM PST | Wynn | Upper Convention Promenade | Bollinger

BIZ331 | Build robust data foundations to power enterprise AI and BI

1:30 PM – 2:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Mint Green Theater

ANT206 | What’s new in Amazon Redshift and Amazon Athena

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT424 | Autonomous agents powered by streaming data and Retrieval Augmented Generation

2:00 PM – 3:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT343 | Best practices for building Apache Iceberg based lakehouse architectures on AWS

10:00 AM – 11:00 AM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT209 | Universal data connectivity with ETL and SQL queries

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

ANT308 | Explore what’s new in data and AI governance with SageMaker Catalog

11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Pink Theater

ANT310 | Powering your Agentic AI experience with AWS Streaming and Messaging

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 3 South | South Seas E

ANT344 | Build, govern, and share Amazon Quick Suite dashboards with Amazon SageMaker

10:30 AM – 11:30 AM PST | MGM | Level 1 | Grand 116

ANT314 | Build Advanced Search with Vector, Hybrid, and AI Techniques

4:30 PM – 5:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Mint Green Theater

ANT305 | Innovations in AWS analytics: Data processing

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT315 | Intelligent Observability and Modernization with Amazon OpenSearch Service

4:00 PM – 5:00 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

DAT445 | Deep dive into databases zero-ETL integrations

12:00 PM – 1:00 PM PST | MGM | Level 3 | Chairman’s 360

ANT336 | Enterprise-scale ETL optimization for Apache Spark

. 3:00 PM – 4:00 PM PST | MGM | Level 1 | Grand 122

ANT309 | Accelerate analytics and AI with an open and secure lakehouse architecture

.
12:00 PM – 1:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Orange Theater

ANT339 | Turn unstructured data in Amazon S3 into AI-ready assets with SageMaker Catalog

. . .
1:00 PM – 2:00 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Pink Theater

ANT201 | What’s new in search, observability, and vector databases with OpenSearch

. . .
1:30 PM – 2:30 PM PST | Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

BIZ228 | Reimagine business intelligence with Amazon Quick Sight

. . .
1:30 PM – 2:30 PM PST | Mandalay Bay | Level 3 South | South Seas E

OPN413 | Transforming Apache Kafka into a Scalable Message Queue

. . .
5:30 PM – 6:30 PM PST | Mandalay Bay | Level 3 South | South Seas F

ANT423 | Amazon Kinesis Data Streams under the hood

. . .

Chalk talks

These hour-long, highly engaging sessions offer a unique blend of expert insight and collaborative learning. An AWS specialist kicks off with a concise, informative lecture, setting the stage for an in-depth, interactive Q&A. With a limited audience size, you’ll have the opportunity to dive deep into topics, ask pressing questions, and engage in meaningful discussions with both the presenter and fellow attendees.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4 Friday, Dec 5
8:30 AM – 9:30 AM PST | MGM | Level 1 | Boulevard 167

ANT301-R1 | Accelerating the shift from batch to real-time streaming

11:30 AM – 12:30 PM PST | Caesars Forum | Level 1 | Academy 411

ANT302-R1 | Accelerate GenAI-powered data discovery and sharing with SageMaker Catalog

9:00 AM – 10:00 AM PST | MGM | Level 3 | Room 353

ANT301-R | Accelerating the shift from batch to real-time streaming

11:30 AM – 12:30 PM PST | MGM | Level 3 | Room 353

ANT207 | Develop with natural language and agentic AI in Amazon SageMaker Unified Studio

10:30 AM – 11:30 AM PST | Caesars Forum | Level 1 | Summit 221

ANT331 | Optimize Cost and Performance in Amazon OpenSearch Service

8:30 AM – 9:30 AM PST | Mandalay Bay | Level 2 South | Reef C

ANT347 | Build a secure and regulated data foundation for AI

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT217 | Build data pipelines in minutes with the Amazon SageMaker Visual experience

9:00 AM – 10:00 AM PST | Mandalay Bay | Level 3 South | South Seas H

ANT319-R1 | Optimizing Apache Spark workloads with AWS Analytics

12:30 PM – 1:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT346 | Architectural blueprints for your lakehouse in Amazon SageMaker

.
10:00 AM – 11:00 AM PST | Mandalay Bay | Level 3 South | South Seas A

ANT420-R | AI-driven scaling in Amazon Redshift Serverless

12:00 PM – 1:00 PM PST | Caesars Forum | Level 1 | Alliance 305

ANT301-R2 | Accelerating the shift from batch to real-time streaming

10:00 AM – 11:00 AM PST | MGM | Level 1 | Boulevard 158

ANT321 | Top 10 tips to improve query performance in Amazon Redshift

2:00 PM – 3:00 PM PST | MGM | Level 1 | Room 101

ANT303 | Implement data pipelines for analytics using Amazon SageMaker Unified Studio

.
10:30 AM – 11:30 AM PST | Wynn | Convention Promenade | Latour 5

ANT302-R | Accelerate GenAI-powered data discovery and sharing with SageMaker Catalog

1:00 PM – 2:00 PM PST | Mandalay Bay | Level 2 South | Lagoon G

ANT330-R | Design and build Intelligent Observability with Amazon OpenSearch Service

10:00 AM – 11:00 AM PST | Wynn | Convention Promenade | La Tache 2

ANT320 | Strengthening security for Apache Spark workloads

2:00 PM – 3:00 PM PST | Mandalay Bay | Level 3 South | South Seas J

ANT322 | Architectural patterns for real-time data analytics on AWS

.
11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT338 | Bring unified analytics to your data warehouse with the lakehouse architecture

1:30 PM – 2:30 PM PST | MGM | Level 3 | Premier 320

ANT325-R1 | A deep dive into AI/ML development in SageMaker Unified Studio

11:30 AM – 12:30 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT332 | Building high-quality data products for AI Agents

3:30 PM – 4:30 PM PST | MGM | Level 1 | Room 101

ANT337 | Breaking data silos with the lakehouse architecture

.
11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Montrachet 1

BIZ323 | Design AI-powered BI architectures for modern enterprises with Amazon Quick Suite

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 2 South | Lagoon G

ANT420-R1 | AI-driven scaling in Amazon Redshift Serverless

1:00 PM – 2:00 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT340 | Deep dive into data processing in SageMaker Unified Studio

. .
1:30 PM – 2:30 PM PST | MGM | Level 3 | Room 353

ANT325-R | A deep dive into AI/ML development in SageMaker Unified Studio

2:30 PM – 3:30 PM PST | Mandalay Bay | Lower Level North | South Pacific B

ANT341 | Build trust in AI with end-to-end data lineage in Amazon SageMaker Catalog

2:30 PM – 3:30 PM PST | MGM | Level 3 | Chairman’s 356

ANT345 | Building secure and scalable lakehouses for the future

. .
2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas A

ANT329 | Build Advanced AI-powered Search with OpenSearch MCP and Vectors

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas C

BIZ327 | Bridge data silos to unlock complete insights with Amazon Quick Suite

2:30 PM – 3:30 PM PST | Mandalay Bay | Level 3 South | South Seas J

ANT413 | Upgrade Amazon DataZone to Amazon SageMaker Catalog for analytics and AI

. .
3:00 PM – 4:00 PM PST | MGM | Level 3 | Premier 320

BIZ319 | Beyond chatbots: Discover conversational AI in Amazon Quick Suite

3:00 PM – 4:00 PM PST | Wynn | Convention Promenade | Latour 5

ANT421 | Advanced Stream Processing with Apache Flink

4:00 PM – 5:00 PM PST | MGM | Level 3 | Room 350

ANT324 | Building Pipelines for Analytics, ML and AI in Amazon Sagemaker Unified Studio

. .
4:00 PM – 5:00 PM PST | MGM | Level 3 | Chairman’s 356

ANT422 | Building Resilient Multi-Tenant Messaging with Amazon SQS

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Reef C

ANT319-R | Optimizing Apache Spark workloads with AWS Analytics

4:00 PM – 5:00 PM PST | Mandalay Bay | Level 3 South | South Seas C

ANT323 | Mastering materialized views: tips for fast, low-latency queries in Redshift

. .
4:30 PM – 5:30 PM PST | Caesars Forum | Level 1 | Alliance 305

ANT330-R1 | Design and build Intelligent Observability with Amazon OpenSearch Service

5:30 PM – 6:30 PM PST | MGM | Level 3 | Room 350

ANT326 | Mastering data transformations with Amazon Athena

5:30 PM – 6:30 PM PST | MGM | Level 1 | Boulevard 167

ANT316 | Orchestrating with Apache Airflow, MWAA, and SageMaker Unified Studio

Builders’ sessions

Immerse yourself in our builders’ sessions – a hands-on learning experience designed to elevate your AWS skills. These focused, hour-long workshops bring together a small group of up to ten attendees with a dedicated AWS expert at each table.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:30 AM – 9:30 AM PST | Wynn | Convention Promenade | Latour 7

ANT407-R1 | Building event-driven applications with AWS Streaming and Messaging

11:30 AM – 12:30 PM PST | MGM | Level 1 | Room 104

ANT415 | Securely monetize your data with Amazon Redshift

1:00 PM – 2:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT407-R | Building event-driven applications with AWS Streaming and Messaging

12:30 PM – 1:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Builders’ Session 1

ANT409 | Getting hands on with zero-ETL and data federation

11:30 AM – 12:30 AM PST | MGM | Level 1 | Room 104

ANT410-R | Integrate and orchestrate data workflows with AWS Glue & MWAA

2:30 PM – 3:30 PM PST | MGM | Level 3 | Room 304

ANT405-R1 | Build high performance Apache Iceberg data lakes with Amazon S3 Tables

1:00 PM – 2:00 PM PST | Wynn | Convention Promenade | Latour 7

ANT406-R | Build trust in your data with Amazon SageMaker Catalog

2:00 PM – 3:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT419-R | Vector search with Amazon OpenSearch Service

11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Latour 7

ANT406-R1 | Build trust in your data with Amazon SageMaker Catalog

4:30 PM – 5:30 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Builders’ Session 2

ANT410-R1 | Integrate and orchestrate data workflows with AWS Glue & MWAA

4:00 PM – 5:00 PM PST | MGM | Level 3 | Room 304

ANT419-R1 | Vector search with Amazon OpenSearch Service

3:30 PM – 4:30 PM PST | Caesars Forum | Level 1 | Alliance 315

OPN407-R1 | Performance tuning for streaming Ingestion into Apache Iceberg

2:30 PM – 3:30 PM PST | MGM | Level 1 | Room 104

ANT408 | Data analytics for financial organizations with Amazon SageMaker

. . .
3:00 PM – 4:00 PM PST | Caesars Forum | Level 1 | Alliance 311

OPN407-R | Performance tuning for streaming Ingestion into Apache Iceberg

. . .
4:00 PM – 5:00 PM PST | Mandalay Bay | Lower Level North | Islander H

ANT405-R | Build high performance Apache Iceberg data lakes with Amazon S3 Tables

. . .

Workshops

Roll your sleeves in our dynamic 2-hour workshops, where you’ll tackle real-world challenges using AWS services. These interactive sessions kick off with a brief, informative lecture to set the stage, then quickly transition into hands-on problem-solving. Bring your laptop and prepare to build alongside AWS experts, who will guide you through practical applications of cloud computing concepts. Whether you’re new to AWS or looking to sharpen your skills, these workshops offer a unique opportunity to learn by doing, enabling you to leave with confidence and applicable knowledge in AWS technologies.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
8:00 AM – 10:00 AM PST | Mandalay Bay | Lower Level North | Islander C

ANT402-R1 | Build a fraud detection system with Amazon SageMaker Unified Studio

12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT418 | Unleash Apache Kafka’s elasticity and cost-efficiency with Amazon MSK

8:30 AM – 10:30 AM PST | Mandalay Bay | Lower Level North | Islander C

ANT402-R | Build a fraud detection system with Amazon SageMaker Unified Studio

12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT412 | Power streaming analytics on AWS with AI-driven insights

8:00 AM – 10:00 AM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom H

ANT403 | Building Production-Ready Data Systems for AI Applications

12:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 368

ANT404-R1 | Build modern data applications with the lakehouse architecture on AWS

8:30 AM – 10:30 AM PST | Caesars Forum | Level 1 | Alliance 308

BIZ204-R1 | Experience AI-powered BI with Amazon Quick Suite

3:00 PM – 5:00 PM PST | Mandalay Bay | Lower Level North | Islander C

ANT416 | Solve complex data and AI governance challenges with Amazon SageMaker Catalog

8:30 AM – 10:30 AM PST | Wynn | Upper Convention Promenade | Cristal 3

BIZ306 | Create agentic AI chat experiences with Amazon Quick Suite

3:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom K

ANT411 | Low-cost logging and observability with Amazon OpenSearch Service

12:30 PM – 2:30 PM PST | MGM | Level 1 | Grand 113

ANT404-R | Build modern data applications with the lakehouse architecture on AWS

.
12:00 PM – 2:00 PM PST | MGM | Level 3 | Premier 317

ANT417 | Simplifying data interoperability with the lakehouse architecture on AWS

3:00 PM – 5:00 PM PST | Wynn | Upper Convention Promenade | Cristal 1

BIZ204-R | Experience AI-powered BI with Amazon Quick Suite

3:30 PM – 5:30 PM PST | Mandalay Bay | Lower Level North | Islander C

ANT401 | Build an AI-powered enterprise search with Amazon OpenSearch service

.
3:00 PM – 5:00 PM PST | Mandalay Bay | Level 2 South | Mandalay Bay Ballroom K

ANT414 | Scale intelligent analytics with Amazon Redshift multi-cluster architectures

. . .

Lightning Talks

Located in the Expo Hall, each of these 20-minute theater presentations are dedicated to a specific customer story, service demo, or AWS Partner offering.

Monday, Dec 1 Tuesday, Dec 2 Wednesday, Dec 3 Thursday, Dec 4
5:00 PM – 5:20 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 4

ANT334 | High-performance NLP & geospatial analysis with Redshift

. 3:00 PM – 3:20 PM PST | Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

ANT333 | Fast-track to insights: AWS-SAP data strategy

12:30 PM – 12:50 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 3

ANT342 | ITTI’s Cross-Company Data Mesh Blueprint with Amazon SageMaker

6:00 PM – 6:20 PM PST | Venetian | Level 2 | Hall B | Expo | Theater 3

ANT348 | Seamless data sharing in Amazon Redshift

. . .

Conclusion

We hope this post acts as your go-to resource for navigating the AWS analytics track at re:Invent 2025. For staying in the know about the most recent trends and advancements in AWS Analytics, follow our LinkedIn page.


About the authors

Navnit Shukla

Navnit Shukla

Navnit serves as an AWS Specialist Solutions Architect with a focus on Data and AI. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, he is the author of Data Wrangling on AWS and co-author of AI-Ready Data Blueprints with O’Reilly.

Sonu Kumar Singh

Sonu Kumar Singh

Sonu is a Senior Solutions Architect with over 13 years of experience, with a specialization in Analytics and Healthcare domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact.