All posts by Al MS

Apache Spark 4.0.1 preview now available on Amazon EMR Serverless

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/apache-spark-4-0-1-preview-now-available-on-amazon-emr-serverless/

Amazon EMR Serverless now supports Apache Spark 4.0.1 in preview, making analytics accessible to more users, simplifying data engineering workflows, and strengthening governance capabilities. The release introduces ANSI SQL compliance, VARIANT data types support for JSON handling, Apache Iceberg v3 table format support, and enhanced streaming capabilities. This preview is available in all regions where EMR Serverless is available.

In this post, we explore key benefits, technical capabilities, and considerations for getting started with Spark 4.0.1 on Amazon EMR Serverless—a serverless deployment option that simplifies running open-source big data frameworks, without requiring managing clusters. With the emr-spark-8.0-preview release label, you can evaluate new SQL capabilities, Python API improvements, and streaming enhancements in your existing EMR Serverless environment.

Benefits

Spark 4.0.1 helps you solve data engineering problems with specific improvements. This section shows how new capabilities help with real-world scenarios.

Make analytics accessible to more users

Simplify Extract Transform Load (ETL) development with SQL scripting. Data engineers often switch between SQL and Python to build complex ETL logic with control flow. SQL scripting in Spark 4.0.1 enables loops, conditionals, and session variables directly in SQL, reducing context-switching and simplifying pipeline development. Use pipe syntax (|>) to chain operations for more readable, maintainable queries.

Improve data quality with ANSI SQL mode. Silent type conversion failures can introduce data quality issues. ANSI SQL mode (now default) enforces standard SQL behavior, raising errors for invalid operations instead of producing unexpected results. Important: ANSI SQL mode is now enabled by default. Test your queries thoroughly during this preview evaluation.

Simplify data engineering workflows

Process JSON data efficiently with VARIANT. Teams working with semi-structured data often see slow performance from repeated JSON parsing. The VARIANT data type stores JSON in an optimized binary format, eliminating parsing overhead. You can efficiently store and query JSON data in data lakes without schema rigidity.

Build Python data sources without Scala. Integrating custom data sources previously required Scala expertise. The Python data Source API lets you build connectors entirely in Python, using existing Python skills and libraries without learning a new language.

Debug streaming applications with queryable state. Troubleshooting stateful streaming applications has historically required indirect methods. The new state data source reader shows streaming state as queryable DataFrames. You can inspect state during debugging, test state values in unit tests, and diagnose production incidents.

Strengthen governance capabilities

Establish comprehensive audit trails with Apache Iceberg v3. The Apache Iceberg v3 table format provides transaction guarantees and tracks data changes over time, giving you the audit trails needed for regulatory compliance. When combined with VARIANT data type support, you can maintain governance controls while handling semi-structured data efficiently in data lakes.

Key capabilities

Spark 4.0.1 Preview on EMR Serverless introduces four major capability areas:

  1. SQL enhancements – ANSI mode, pipe syntax, VARIANT type, SQL scripting, user-defined functions (UDFs)
  2. Python API advances – custom data sources, UDF profiling
  3. Streaming improvements – stateful processing API v2, queryable state
  4. Table format support – Amazon S3 Tables, AWS Lake Formation integration

The following sections provide technical details and code examples for each capability.

SQL enhancements

Spark 4.0.1 introduces new SQL capabilities including ANSI mode compliance, SQL UDFs, pipe syntax for readable queries, VARIANT type for JSON handling, and SQL scripting with control flow.

ANSI SQL mode by default

ANSI SQL mode is now enabled by default, enforcing standard SQL behavior for data integrity. Silent casting of out-of-range values now raises errors rather than producing unexpected results. Existing queries may behave differently, particularly around null handling, string casting, and timestamp operations. Use spark.sql.ansi.enabled=false if you need legacy behavior during migration.

SQL pipe syntax

You can now chain SQL operations using the |> operator for improved readability. The following example shows how you can replace nested subqueries with a more maintainable pipeline:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) c_count GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist GROUP BY c_count
|> ORDER BY custdist DESC

This replaces nested subqueries, making complex transformations easier to understand and maintain.

VARIANT data type

The VARIANT type handles semi-structured JSON/XML data efficiently without repeated parsing. It uses an optimized binary representation internally while maintaining schema-less flexibility. Previously, JSON expressions required repeated parsing, degrading performance. VARIANT eliminates this overhead. The following snippet shows how to parse JSON into the VARIANT type:

df = spark.sql("SELECT parse_json('{\"name\":\"Alice\",\"age\":30}') as data")

Spark 4.0.1 on EMR Serverless supports Apache Iceberg v3, enabling the VARIANT data type with Iceberg tables. This combination provides efficient storage and querying of semi-structured JSON data in your data lake. Store VARIANT columns in Iceberg tables and use Iceberg’s schema evolution and time travel capabilities alongside Spark’s optimized JSON processing. The following example shows how to create an Iceberg table with a VARIANT column:

CREATE TABLE catalog.db.events (
  event_id BIGINT,
  event_data VARIANT,
  timestamp TIMESTAMP
) USING iceberg;

INSERT INTO catalog.db.events SELECT 1, parse_json('{"user":"alice","action":"login"}'), current_timestamp();

SQL scripting with session variables

Manage state and control flow directly in SQL using session variables, and IF/WHILE/FOR statements. The following example demonstrates a loop that populates a results table:

BEGIN
  DECLARE counter INT = 10;
  WHILE counter > 0 DO
    INSERT INTO results VALUES (counter);
    SET counter = counter - 1;
  END WHILE;
END

This enables complex ETL logic entirely in SQL without switching to Python.

SQL user-defined functions

Define custom functions directly in SQL. Functions can be temporary (session-scoped) or permanent (catalog-stored). The following example shows how to register and use a simple UDF:

CREATE FUNCTION plusOne(x INT) RETURNS INT RETURN x + 1;
SELECT plusOne(5);

Python API advances

This section covers new Python capabilities including custom data sources and UDF profiling tools.

Python data source API

You can now build custom data sources in Python without Scala knowledge. The following example shows how to create a simple data source that returns sample data:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

class SampleDataSource(DataSource):
    def schema(self):
        return StructType([
            StructField("name", StringType()),
            StructField("age", IntegerType())
        ])
    
    def reader(self, schema):
        return SampleReader()

class SampleReader(DataSourceReader):
    def read(self, partition):
        yield ("Alice", 30)
        yield ("Bob", 25)

# Register and use
spark.dataSource.register(SampleDataSource)
spark.read.format("SampleDataSource").load().show()

Unified UDF profiling

Profile Python and Pandas UDFs for performance and memory insights. The following code enables performance profiling:spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") # or "memory"

Structured streaming enhancements

This section covers improvements to stateful stream processing, including queryable state and enhanced state management APIs.

Arbitrary stateful processing API v2

The transformWithState operator provides robust state management with timer and TTL support for automatic cleanup, schema evolution capabilities, and initial state support for pre-populating state from batch DataFrames.

State data source reader

Query streaming state as a DataFrame for debugging and monitoring. Previously, state data was internal to streaming queries. Now you can verify state values in unit tests, diagnose production incidents, detect state corruption, and optimize performance. Note: This feature is experimental. Source options and behavior may change in future releases.

State store improvements

Upgraded changelog checkpointing for RocksDB removes performance bottlenecks. Enhanced checkpoint coordination and improved sorted string table (SST) file reuse management optimizes streaming operations.

Table format support

This section covers support for AWS S3 Tables and full table access (FTA) with AWS Lake Formation.

AWS S3 Tables

Use Spark 4.0.1 with AWS S3 Tables, a storage solution that provides managed Apache Iceberg tables with automatic optimization and maintenance. S3 Tables simplify data lake operations by handling compaction, snapshot management, and metadata cleanup automatically.

Full table access with Lake Formation

FTA is supported for Apache Iceberg, Delta Lake, and Apache Hive tables when using AWS Lake Formation, a managed service that simplifies data access control. FTA provides coarse-grained access control at the table level. Note that fine-grained access control (FGAC) with column-level or row-level permissions is not available in this preview.

Getting started

Follow these steps to create an EMR Serverless application, run sample code to test new features, and provide feedback on the preview.

Prerequisites

Before you begin, confirm you have the following:

Note: EMR Studio Notebooks and SageMaker Unified Studio are not supported during this preview. Use the AWS CLI or AWS SDK to submit jobs.

Step 1: Create your EMR Serverless application

Create or update your application with the emr-spark-8.0-preview release label. The following command creates a new application:

aws emr-serverless create-application --type spark \
  --release-label emr-spark-8.0-preview \
  --region us-east-1 --name spark4-test

Step 2: Test sample code

Run this PySpark job to verify setup and test Spark 4.0.1 features:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark 4.0.1 Test").getOrCreate()
print(f"Spark Version: {spark.version}")

# Create sample data
data = [("Alice", 34, "Engineering"), ("Bob", 45, "Sales"),
        ("Charlie", 28, "Engineering"), ("Diana", 52, "Marketing")]
df = spark.createDataFrame(data, ["name", "age", "department"])
df.createOrReplaceTempView("employees")

# Test SQL PIPE syntax
try:
    result = spark.sql("""
        FROM employees
        |> WHERE age > 30
        |> SELECT name, age, department
        |> ORDER BY age DESC
    """)
    result.show()
    print("✓ SQL pipe syntax test passed")
except Exception as e:
    print(f"✗ SQL pipe syntax test failed: {e}")

# Test VARIANT data type
try:
    json_data = spark.sql("""
        SELECT parse_json('{"name":"Alice","skills":["Python","Spark","SQL"]}') as data
    """)
    json_data.show(truncate=False)
    print("✓ VARIANT data type test passed")
except Exception as e:
    print(f"✗ VARIANT data type test failed: {e}")

Submit the job with the following command:

aws emr-serverless start-job-run \
    --application-id <your-application-id> \
    --execution-role-arn <your-execution-role-arn> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your-bucket>/spark_4_test.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g"
        }
    }'

Step 3: Test your workloads

Review the Spark SQL Migration Guide and PySpark Migration Guide, then test production workloads in non-production environments. Focus on queries affected by ANSI SQL mode and benchmark performance.

Step 4: Clean up resources

After testing, delete all resources created during this evaluation to avoid ongoing charges:

# Delete the EMR Serverless application
aws emr-serverless delete-application \
    --application-id spark4-test \
    --region us-east-1
# Remove the test script from S3
aws s3 rm s3://<your-bucket>/spark_4_test.py

Migration considerations

Before evaluating Spark 4.0.1, review the updated runtime requirements and behavioral changes that may affect your existing code.

Runtime requirements

  • Scala: Version 2.13.16 required (2.12 support dropped)
  • Java: JDK 17 or higher required (JDK 8 and 11 support removed)
  • Python: Version 3.9+ required, continued support for 3.11 and newly added 3.12 (3.8 support removed)
  • Pandas: Minimum version 2.0.0 (previously 1.0.5)
  • SparkR: Deprecated; migrate to PySpark

Behavioral changes

With ANSI SQL mode enforcement, you may see different behavior in:

  • Null handling: Stricter null propagation in expressions
  • String casting: Invalid casts now raise errors instead of returning null
  • Map key operations: Duplicate keys now raise errors
  • Timestamp conversions: Overflow returns null instead of wrapped values
  • CREATE TABLE statements: Now respect the spark.sql.sources.defaultconfiguration instead of defaulting to Hive format when USING or STORED AS clauses are omitted

You can control many of these behaviors via legacy configuration flags. Consult the official migration guides for details refer: Spark SQL Migration Guide: 3.5 to 4.0 and PySpark Migration Guide: 3.5 to 4.0.

Preview limitations

The following capabilities are not available in this preview:

  • Fine-grained access control: Fine-grained access control (FGAC) with row-level or column-level filtering is not supported in this preview. Jobs with spark.emr-serverless.lakeformation.enabled=true will fail.
  • Spark Connect: Not supported in this preview. Use standard Spark job submission with the StartJobRun API.
  • Open Table Format limitations: Hudi is not supported in this preview. Delta 4.0.0 does not support Flink connectors (deprecated in Delta 4.0.0). Delta Universal Format is not supported in this preview.
  • Connectors: spark-sql-kinesis, emr-dynamodb, and spark-redshift are unavailable.
  • Interactive applications: Livy and JupyterEnterpriseGateway are not included. Also, SageMaker Unified Studio and EMR Studio are not supported.
  • EMR features: Serverless Storage and Materialized Views are not supported.

This preview lets you evaluate Spark 4.0.1’s core capabilities on EMR Serverless, including SQL enhancements, Python API improvements, and streaming state management. Test your migration path, assess performance improvements, and provide feedback to shape the general availability release.

Conclusion

This post showed you how to get started with the Apache Spark 4.0.1 preview release on Amazon EMR Serverless. You explored how the VARIANT data type works with Iceberg v3 to process JSON data efficiently, how SQL scripting and pipe syntax eliminate context-switching for ETL development, and how queryable streaming state simplifies debugging stateful applications. You also learned about the preview limitations, runtime requirements, and behavioral changes to consider during evaluation.

Test the Spark 4.0.1 preview on EMR Serverless and provide feedback through AWS Support to help shape the general availability release.

To learn more about Apache Spark 4.0.1 features, see the Spark 4.0.1 Release Notes. For EMR Serverless documentation, see the EMR Release Guide.

Resources

Apache Spark Documentation

Amazon EMR Resources


About the authors

Al MS

Al MS

Al is a product manager for Amazon EMR at Amazon Web Services.

Emilie Faracci

Emilie Faracci

Emilie is a Software Development Engineer Amazon Web Services, working on Amazon EMR. She focuses on Spark development and has contributed to open-source Apache Spark v4.0.1.

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Data Processing Engines Architect for Amazon EMR at AWS. He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.

Amazon EMR launches support for Amazon EC2 C7g (Graviton3) instances to improve cost performance for Spark workloads by 7–13%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c7g-graviton3-instances-to-improve-cost-performance-for-spark-workloads-by-7-13/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over twice the performance improvements compared to open-source Apache Spark and Presto.

With Amazon EMR release 6.7, you can now use Amazon Elastic Compute Cloud (Amazon EC2) C7g instances, which use the AWS Graviton3 processors. These instances improve price-performance of running Spark workloads on Amazon EMR by 7.93–13.35% over previous generation instances, depending on the instance size. In this post, we describe how we estimated the price-performance benefit.

Amazon EMR runtime performance with EC2 C7g instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.9 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with C7g instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent C6g clusters from the previous generation instance family. We measured performance improvements using the total query runtime and geometric mean of the query runtime across TPC-DS 3 TB benchmark queries.

Our results showed 13.65–18.73% improvement in total query runtime performance and 16.98–20.28% improvement in geometric mean on EMR clusters with C7g compared to equivalent EMR clusters with C6g instances, depending on the instance size. In comparing costs, we observed 7.93–13.35% reduction in cost on the EMR cluster with C7g compared to the equivalent with C6g, depending on the instance size. We did not benchmark the C6g xlarge instance because it didn’t have sufficient memory to run the queries.

The following table shows the results from running the TPC-DS 3 TB benchmark queries using Amazon EMR 6.9 compared to equivalent C7g and C6g instance EMR clusters.

Instance Size 16 XL 12 XL 8 XL 4 XL 2 XL
Total size of the cluster (1 leader + 5 core nodes) 6 6 6 6 6
Total query runtime on C6g (seconds) 2774.86205 2752.84429 3173.08086 5108.45489 8697.08117
Total query runtime on C7g (seconds) 2396.22799 2336.28224 2698.72928 4151.85869 7249.58148
Total query runtime improvement with C7g 13.65% 15.13% 14.95% 18.73% 16.64%
Geometric mean query runtime C6g (seconds) 22.2113 21.75459 23.38081 31.97192 45.41656
Geometric mean query runtime C7g (seconds) 18.43905 17.65898 19.01684 25.48695 37.43737
Geometric mean query runtime improvement with C7g 16.98% 18.83% 18.66% 20.28% 17.57%
EC2 C6g instance price ($ per hour) $2.1760 $1.6320 $1.0880 $0.5440 $0.2720
EMR C6g instance price ($ per hour) $0.5440 $0.4080 $0.2720 $0.1360 $0.0680
(EC2 + EMR) instance price ($ per hour) $2.7200 $2.0400 $1.3600 $0.6800 $0.3400
Cost of running on C6g ($ per instance) $2.09656 $1.55995 $1.19872 $0.96493 $0.82139
EC2 C7g instance price ($ per hour) $2.3200 $1.7400 $1.1600 $0.5800 $0.2900
EMR C7g price ($ per hour per instance) $0.5800 $0.4350 $0.2900 $0.1450 $0.0725
(EC2 + EMR) C7g instance price ($ per hour) $2.9000 $2.1750 $1.4500 $0.7250 $0.3625
Cost of running on C7g ($ per instance) $1.930290 $1.411500 $1.086990 $0.836140 $0.729990
Total cost reduction with C7g including performance improvement -7.93% -9.52% -9.32% -13.35% -11.13%

The following graph shows per-query improvements observed on C7g 2xlarge instances compared to equivalent C6g generations.

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C7g instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 7–13%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Yuzhou Sun is a software development engineer for EMR at Amazon Web Services.

Steve Koonce is an Engineering Manager for EMR at Amazon Web Services.

Amazon EMR launches support for Amazon EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50% 

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-m6a-r6a-instances-to-improve-cost-performance-for-spark-workloads-by-15-50/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as M6A and C6A, which use the third generation AMD EPYC processors. These instances improve the price performance of running Spark workloads on Amazon EMR by 15–50 percent over previous generation instances. In this blog post, we describe how we estimated this price performance benefit.

Amazon EMR runtime performance with EC2 M6A instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with M6a instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent clusters with M5a, which is the previous generation instance family. We measured performance improvements using the total query runtime and the geometric mean of query runtime across TPC-DS 3 TB benchmark queries.

Our results showed a 23.6–50.3 percent improvement in total query runtime performance and 22.8–52.4 percent in geometric mean on an EMR cluster with M6a compared to an equivalent EMR cluster with M5a instances. In comparing costs, we observed a 23.2–41.4 percent reduction in cost on the EMR cluster with M6a compared to the equivalent with M5a. M6A 48 XL and 32 XL instances were not benchmarked because the M5A generation does not offer equivalent sizes.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6a and M5a instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Total size of the cluster (1 Leader + 5 core nodes) 6 6 6 6 6 6 6
Total query runtime on M5A (seconds) 6624.1713838714 5466.7251180433 5269.0578151495 5366.1486275129 7753.6218015794 12118.0922180235 21070.6905510002
Total query runtime on M6A (seconds) 3295.2894058371 3063.7807673078 3399.1509249577 3482.8401591909 4906.2216891762 9184.4366036450 16107.9707619002
Total query runtime improvement with M6A 50.25% 43.96% 35.49% 35.10% 36.72% 24.21% 23.55%
Geometric mean query runtime M5A (sec) 51.1422829354 40.9550798753 38.4890223194 35.3863834186 44.8454957416 61.0454658020 92.6414502105
Geometric mean query runtime M6A (sec) 24.3406154481 22.3484713891 22.9913163520 23.0351017440 28.2855683398 46.4363267349 71.5498816854
Geometric mean query runtime improvement with M6A 52.41% 45.43% 40.27% 34.90% 36.93% 23.93% 22.77%
EC2 M5A instance price ($ per hour) $4.12800 $2.75200 $2.06400 $1.37600 $0.68800 $0.34400 $0.17200
EMR M5A instance price ($ per hour) $0.27000 $0.27000 $0.27000 $0.27000 $0.17200 $0.08600 $0.04300
(EC2 + EMR) M5A instance price ($ per hour) $4.39800 $3.02200 $2.33400 $1.64600 $0.86000 $0.43000 $0.21500
Cost of running on M5A ($ per instance) $8.09253 $4.58901 $3.41611 $2.45352 $1.85225 $1.44744 $1.25839
EC2 M6A instance price ($ per hour) $4.14720 $2.76480 $2.07360 $1.38240 $0.69120 $0.34560 $0.17280
EMR M6A price ($ per hour per instance) $1.03680 $0.69120 $0.51840 $0.34560 $0.17280 $0.08640 $0.04320
(EC2 + EMR) M6A instance price ($ per hour) $5.18400 $3.45600 $2.59200 $1.72800 $0.86400 $0.43200 $0.21600
Cost of running on M6A ($ per instance) $4.74522 $2.94123 $2.44739 $1.67176 $1.17749 $1.10213 $0.96648
Total cost reduction with M6A including performance improvement -41.36% -35.91% -28.36% -31.86% -36.43% -23.86% -23.20%

The following graph shows per query improvements observed on M6a 2XL instances compared to equivalent M5a generation. We observed that two queries take longer to execute on M6a instance clusters compared to M5a instance clusters. Q91 regressed up to 6.64 percent and Q55 regressed up to 1.86 percent on 4 XL instance clusters.

Amazon EMR runtime performance with EC2 R6A instances

R6A instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5A instances. R6A 32XL and 48XL instances were not benchmarked since R5A instances do not have 32XL and 48XL sizes available. Our results showed 16–58.22 percent improvement in total query runtime for seven different instance sizes within the instance family and 20.04–59.59 percent improvement in geometric mean. In comparing costs, we observed 15.85–-50.07 percent reduction in cost on R6A instance EMR clusters compared to R5A EMR instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6A and R5A instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Total size of the cluster (1 Leader + 5 core nodes) 6 6 6 6 6 6 6
Total query runtime on R5A (seconds) 6934.22936 5530.74672 5834.32344 5718.72582 7615.58392 11431.37368 20688.58642
Total query runtime on R6A (seconds) 2897.44817 2906.49952 3017.85315 3488.83875 4661.32856 7717.33575 17378.49043
Total query runtime improvement with R6A 58.22% 47.45% 48.27% 38.99% 38.79% 32.49% 16.00%
Geometric mean query runtime R5A (sec) 53.27574 41.76973 42.50324 37.62155 44.58173 58.88182 91.72095
Geometric mean query runtime R6A (sec) 21.52803 21.36831 19.94607 21.59493 26.90097 36.57557 73.3405
Geometric mean query runtime improvement with R6A 59.59% 48.84% 53.07% 42.60% 39.66% 37.88% 20.04%
EC2 R5A instance price ($ per hour) $5.42400 $3.61600 $2.71200 $1.80800 $0.90400 $0.45200 $0.22600
EMR R5A instance price ($ per hour) $0.27000 $0.27000 $0.27000 $0.27000 $0.22600 $0.11300 $0.05700
(EC2 + EMR) R5A instance price ($ per hour) $5.69400 $3.88600 $2.98200 $2.07800 $1.13000 $0.56500 $0.28300
Cost of running on R5A ($ per instance) $10.96764 $5.97013 $4.83276 $3.30098 $2.39045 $1.79409 $1.62635
EC2 R6A instance price ($ per hour) $5.44320 $3.62880 $2.72160 $1.81440 $0.90720 $0.45360 $0.22680
EMR R6A price ($ per hour per instance) $1.36080 $0.90720 $0.68040 $0.45360 $0.22680 $0.11340 $0.05670
(EC2 + EMR) R6A instance price ($ per hour) $6.80400 $4.53600 $3.40200 $2.26800 $1.13400 $0.56700 $0.28350
Cost of running on R6A ($ per instance) $5.47618 $3.66219 $2.85187 $2.19797 $1.46832 $1.21548 $1.36856
Total cost reduction with R6A including performance improvement -50.07% -38.66% -40.99% -33.41% -38.58% -32.25% -15.85%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with M6A and R6A instances compared to using equivalent previous-generation instances. Using these new instances with Amazon EMR improves price performance by 15–50%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size 16 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5
Total query runtime on I3 (seconds) 4752.15457 4506.43694 7110.03042 11853.40336 21333.05743
Total query runtime on I4I (seconds) 2642.77407 2812.05517 4415.0023 7537.52779 12981.20251
Total query runtime improvement with I4I 44.39% 37.60% 37.90% 36.41% 39.15%
Geometric mean query runtime on I3 (sec) 34.99551 29.14821 41.53093 60.8069 95.46128
Geometric mean query runtime on I4I (sec) 19.17906 18.65311 25.66263 38.13503 56.95073
Geometric mean query runtime improvement with I4I 45.20% 36.01% 38.21% 37.29% 40.34%
EC2 I3 instance price ($ per hour) $4.990 $2.496 $1.248 $0.624 $0.312
EMR I3 instance price ($ per hour) $0.270 $0.270 $0.270 $0.156 $0.078
(EC2 + EMR) I3 instance price ($ per hour) $5.260 $2.766 $1.518 $0.780 $0.390
Cost of running on I3 ($ per instance) $6.943 $3.462 $2.998 $2.568 $2.311
EC2 I4I instance price ($ per hour) $5.491 $2.746 $1.373 $0.686 $0.343
EMR I4I price ($ per hour per instance) $1.373 $0.687 $0.343 $0.172 $0.086
(EC2 + EMR) I4I instance price ($ per hour) $6.864 $3.433 $1.716 $0.858 $0.429
Cost of running on I4I ($ per instance) $5.039 $2.681 $2.105 $1.795 $1.546
Total cost reduction with I4I including performance improvement -27.43% -22.56% -29.79% -30.09% -33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on M5 (seconds) 4027.58043 3782.10766 3348.05362 3516.4308 5621.22532 10075.45109 17278.15146
Total query runtime on M6I (seconds) 3106.43834 2665.70607 2714.69862 3043.5975 4195.02715 8226.88301 14515.50394
Total query runtime improvement with M6I 22.87% 29.52% 18.92% 13.45% 25.37% 18.35% 15.99%
Geometric mean query runtime M5 (sec) 30.45437 28.5207 23.95314 23.55958 32.95975 49.43178 75.95984
Geometric mean query runtime M6I (sec) 23.76853 19.21783 19.16869 19.9574 24.23012 39.09965 60.79494
Geometric mean query runtime improvement with M6I 21.95% 32.62% 19.97% 15.29% 26.49% 20.90% 19.96%
EC2 M5 instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M5 instance price ($ per hour) $0.27 $0.27 $0.27 $0.27 $0.19 $0.10 $0.05
(EC2 + EMR) M5 instance price ($ per hour) $4.88 $3.34 $2.57 $1.81 $0.96 $0.48 $0.24
Cost of running on M5 ($ per instance) $5.46 $3.51 $2.39 $1.76 $1.50 $1.34 $1.15
EC2 M6I instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M6I price ($ per hour per instance) $1.15 $0.77 $0.58 $0.38 $0.19 $0.10 $0.05
(EC2 + EMR) M6I instance price ($ per hour) $5.76 $3.84 $2.88 $1.92 $0.96 $0.48 $0.24
Cost of running on M6I ($ per instance) $4.97 $2.84 $2.17 $1.62 $1.12 $1.10 $0.97
Total cost reduction with M6I including performance improvement -8.92% -19.02% -9.28% -7.98% -25.37% -18.35% -15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2XL
Number of core instances in EMR cluster 5 5 5 5 5 5
Total query runtime on R5 (seconds) 4024.4737 3715.74432 3552.97298 3535.69879 5379.73168 9121.41532
Total query runtime on R6I (seconds) 2865.83169 2518.24192 2513.4849 3031.71973 4544.44854 6977.9508
Total query runtime improvement with R6I 28.79% 32.23% 29.26% 14.25% 15.53% 23.50%
Geometric mean query runtime R5 (sec) 30.59066 28.30849 25.30903 23.85511 32.33391 47.28424
Geometric mean query runtime R6I (sec) 21.87897 17.97587 17.54117 20.00918 26.6277 34.52817
Geometric mean query runtime improvement with R6I 28.48% 36.50% 30.69% 16.12% 17.65% 26.98%
EC2 R5 instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R5 instance price ($ per hour) $0.2700 $0.2700 $0.2700 $0.2700 $0.2520 $0.1260
(EC2 + EMR) R5 instance price ($ per hour) $6.3180 $4.3020 $3.2940 $2.2860 $1.2600 $0.6300
Cost of running on R5 ($ per instance) $7.0630 $4.4403 $3.2510 $2.2452 $1.8829 $1.5962
EC2 R6I instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R6I price ($ per hour per instance) $1.5120 $1.0080 $0.7560 $0.5040 $0.2520 $0.1260
(EC2 + EMR) R6I instance price ($ per hour) $7.5600 $5.0400 $3.7800 $2.5200 $1.2600 $0.6300
Cost of running on R6I ($ per instance) $6.0182 $3.5255 $2.6392 $2.1222 $1.5906 $1.2211
Total cost reduction with R6I including performance improvement -14.79% -20.60% -18.82% -5.48% -15.53% -23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size * 24 XL 12 XL 4 XL 2 XL
Number of core instances in EMR cluster 5 5 5 5
Total query runtime on C5 (seconds) 3435.59808 2900.84981 5945.12879 10173.00757
Total query runtime on C6I (seconds) 2711.16147 2471.86778 5195.30093 8787.43422
Total query runtime improvement with C6I 21.09% 14.79% 12.61% 13.62%
Geometric mean query runtime C5 (sec) 25.67058 20.06539 31.76582 46.78632
Geometric mean query runtime C6I (sec) 20.4458 17.14133 26.92196 39.32622
Geometric mean query runtime improvement with C6I 20.35% 14.57% 15.25% 15.95%
EC2 C5 instance price ($ per hour) $4.080 $2.040 $0.680 $0.340
EMR C5 instance price ($ per hour) $0.270 $0.270 $0.170 $0.085
(EC2 + EMR) C5 instance price ($ per hour) $4.35000 $2.31000 $0.85000 $0.42500
Cost of running on C5 ($ per instance) $4.15135 $1.86138 $1.40371 $1.20098
EC2 C6I instance price ($ per hour) $4.0800 $2.0400 $0.6800 $0.3400
EMR C6I price ($ per hour per instance) $1.02000 $0.51000 $0.17000 $0.08500
(EC2 + EMR) C6I instance price ($ per hour) $5.10000 $2.55000 $0.85000 $0.42500
Cost of running on C6I ($ per instance) $3.84081 $1.75091 $1.22667 $1.03741
Total cost reduction with C6I including performance improvement -7.48% -5.93% -12.61% -13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on R5D (seconds) 4054.4492975042 3691.7569385583 3598.6869168064 3532.7398928104 5397.5330161574 9281.2627059927 16862.8766838096
Total query runtime on R6ID (seconds) 2992.1198446983 2633.7131630720 2632.3186613402 2729.8860537867 4583.1040980373 7921.9960917943 14867.5391541445
Total query runtime improvement with R6ID 26.20% 28.66% 26.85% 22.73% 15.09% 14.65% 11.83%
Geometric mean query runtime R5D (sec) 31.0238156851 28.1432927726 25.7532157307 24.0596427675 32.5800246829 48.2306670294 76.6771994376
Geometric mean query runtime R6ID (sec) 22.8681174894 19.1282742957 18.6161830746 18.0498249257 25.9500918360 39.6580341258 65.0947323858
Geometric mean query runtime improvement with R6ID 26.29% 32.03% 27.71% 24.98% 20.35% 17.77% 15.11%
EC2 R5D instance price ($ per hour) $6.912000 $4.608000 $3.456000 $2.304000 $1.152000 $0.576000 $0.288000
EMR R5D instance price ($ per hour) $0.270000 $0.270000 $0.270000 $0.270000 $0.270000 $0.144000 $0.072000
(EC2 + EMR) R5D instance price ($ per hour) $7.182000 $4.878000 $3.726000 $2.574000 $1.422000 $0.720000 $0.360000
Cost of running on R5D ($ per instance) $8.088626 $5.002331 $3.724641 $2.525909 $2.132026 $1.856253 $1.686288
EC2 R6ID instance price ($ per hour) $7.257600 $4.838400 $3.628800 $2.419200 $1.209600 $0.604800 $0.302400
EMR R6ID price ($ per hour per instance) $1.814400 $1.209600 $0.907200 $0.604800 $0.302400 $0.151200 $0.075600
(EC2 + EMR) R6ID instance price ($ per hour) $9.072000 $6.048000 $4.536000 $3.024000 $1.512000 $0.756000 $0.378000
Cost of running on R6ID ($ per instance) $7.540142 $4.424638 $3.316722 $2.293104 $1.924904 $1.663619 $1.561092
Total cost reduction with R6ID including performance improvement -6.78% -11.55% -10.95% -9.22% -9.71% -10.38% -7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Run Apache Spark 3.0 workloads 1.7 times faster with Amazon EMR runtime for Apache Spark

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark/

With Amazon EMR release 6.1.0, Amazon EMR runtime for Apache Spark is now available for Spark 3.0.0. EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark.

In our benchmark performance tests using TPC-DS benchmark queries at 3 TB scale, we found EMR runtime for Apache Spark 3.0 provides a 1.7 times performance improvement on average, and up to 8 times improved performance for individual queries over open-source Apache Spark 3.0.0. With Amazon EMR 6.1.0, you can now run your Apache Spark 3.0 applications faster and cheaper without requiring any changes to your applications.

Results observed using TPC-DS benchmarks

To evaluate the performance improvements, we used TPC-DS benchmark queries with 3 TB scale and ran them on a 6-node c4.8xlarge EMR cluster with data in Amazon Simple Storage Service (Amazon S3). We ran the tests with and without the EMR runtime for Apache Spark. The following two graphs compare the total aggregate runtime and geometric mean for all queries in the TPC-DS 3 TB query dataset between the Amazon EMR releases.

The following table shows the total runtime in seconds.

The following table shows the total runtime in seconds.

The following table shows the geometric mean of the runtime in seconds.

The following table shows the geometric mean of the runtime in seconds.

In our tests, all queries ran successfully on EMR clusters that used the EMR runtime for Apache Spark. However, when using Spark 3.0 without the EMR runtime, 34 out of the 104 benchmark queries failed due to SPARK-32663. To work around these issues, we disabled spark.shuffle.readHostLocalDisk configuration. However, even after this change, queries 14a and 14b continued to fail. Therefore, we chose to exclude these queries from our benchmark comparison.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is illustrated in the following chart. The horizontal axis shows each query in the TPC-DS 3 TB benchmark. The vertical axis shows the speedup of each query due to the EMR runtime. We found a 1.7 times performance improvement as measured by the geometric mean of the per-query speedups, with all queries showing a performance improvement with the EMR Runtime.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is also illustrated in the following chart.

Conclusion

You can run your Apache Spark 3.0 workloads faster and cheaper without making any changes to your applications by using Amazon EMR 6.1. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more great Apache Spark optimizations, configuration best practices, and tuning advice.


About the Authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.