All posts by Al MS

Apache Spark 4.0.1 preview now available on Amazon EMR Serverless

2026-01-26 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/apache-spark-4-0-1-preview-now-available-on-amazon-emr-serverless/

Amazon EMR Serverless now supports Apache Spark 4.0.1 in preview, making analytics accessible to more users, simplifying data engineering workflows, and strengthening governance capabilities. The release introduces ANSI SQL compliance, VARIANT data types support for JSON handling, Apache Iceberg v3 table format support, and enhanced streaming capabilities. This preview is available in all regions where EMR Serverless is available.

In this post, we explore key benefits, technical capabilities, and considerations for getting started with Spark 4.0.1 on Amazon EMR Serverless—a serverless deployment option that simplifies running open-source big data frameworks, without requiring managing clusters. With the emr-spark-8.0-preview release label, you can evaluate new SQL capabilities, Python API improvements, and streaming enhancements in your existing EMR Serverless environment.

Benefits

Spark 4.0.1 helps you solve data engineering problems with specific improvements. This section shows how new capabilities help with real-world scenarios.

Make analytics accessible to more users

Simplify Extract Transform Load (ETL) development with SQL scripting. Data engineers often switch between SQL and Python to build complex ETL logic with control flow. SQL scripting in Spark 4.0.1 enables loops, conditionals, and session variables directly in SQL, reducing context-switching and simplifying pipeline development. Use pipe syntax (|>) to chain operations for more readable, maintainable queries.

Improve data quality with ANSI SQL mode. Silent type conversion failures can introduce data quality issues. ANSI SQL mode (now default) enforces standard SQL behavior, raising errors for invalid operations instead of producing unexpected results. Important: ANSI SQL mode is now enabled by default. Test your queries thoroughly during this preview evaluation.

Simplify data engineering workflows

Process JSON data efficiently with VARIANT. Teams working with semi-structured data often see slow performance from repeated JSON parsing. The VARIANT data type stores JSON in an optimized binary format, eliminating parsing overhead. You can efficiently store and query JSON data in data lakes without schema rigidity.

Build Python data sources without Scala. Integrating custom data sources previously required Scala expertise. The Python data Source API lets you build connectors entirely in Python, using existing Python skills and libraries without learning a new language.

Debug streaming applications with queryable state. Troubleshooting stateful streaming applications has historically required indirect methods. The new state data source reader shows streaming state as queryable DataFrames. You can inspect state during debugging, test state values in unit tests, and diagnose production incidents.

Strengthen governance capabilities

Establish comprehensive audit trails with Apache Iceberg v3. The Apache Iceberg v3 table format provides transaction guarantees and tracks data changes over time, giving you the audit trails needed for regulatory compliance. When combined with VARIANT data type support, you can maintain governance controls while handling semi-structured data efficiently in data lakes.

Key capabilities

Spark 4.0.1 Preview on EMR Serverless introduces four major capability areas:

SQL enhancements – ANSI mode, pipe syntax, VARIANT type, SQL scripting, user-defined functions (UDFs)
Python API advances – custom data sources, UDF profiling
Streaming improvements – stateful processing API v2, queryable state
Table format support – Amazon S3 Tables, AWS Lake Formation integration

The following sections provide technical details and code examples for each capability.

SQL enhancements

Spark 4.0.1 introduces new SQL capabilities including ANSI mode compliance, SQL UDFs, pipe syntax for readable queries, VARIANT type for JSON handling, and SQL scripting with control flow.

ANSI SQL mode by default

ANSI SQL mode is now enabled by default, enforcing standard SQL behavior for data integrity. Silent casting of out-of-range values now raises errors rather than producing unexpected results. Existing queries may behave differently, particularly around null handling, string casting, and timestamp operations. Use spark.sql.ansi.enabled=false if you need legacy behavior during migration.

SQL pipe syntax

You can now chain SQL operations using the |> operator for improved readability. The following example shows how you can replace nested subqueries with a more maintainable pipeline:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) c_count GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist GROUP BY c_count
|> ORDER BY custdist DESC

This replaces nested subqueries, making complex transformations easier to understand and maintain.

VARIANT data type

The VARIANT type handles semi-structured JSON/XML data efficiently without repeated parsing. It uses an optimized binary representation internally while maintaining schema-less flexibility. Previously, JSON expressions required repeated parsing, degrading performance. VARIANT eliminates this overhead. The following snippet shows how to parse JSON into the VARIANT type:

df = spark.sql("SELECT parse_json('{\"name\":\"Alice\",\"age\":30}') as data")

Spark 4.0.1 on EMR Serverless supports Apache Iceberg v3, enabling the VARIANT data type with Iceberg tables. This combination provides efficient storage and querying of semi-structured JSON data in your data lake. Store VARIANT columns in Iceberg tables and use Iceberg’s schema evolution and time travel capabilities alongside Spark’s optimized JSON processing. The following example shows how to create an Iceberg table with a VARIANT column:

CREATE TABLE catalog.db.events (
  event_id BIGINT,
  event_data VARIANT,
  timestamp TIMESTAMP
) USING iceberg;

INSERT INTO catalog.db.events SELECT 1, parse_json('{"user":"alice","action":"login"}'), current_timestamp();

SQL scripting with session variables

Manage state and control flow directly in SQL using session variables, and IF/WHILE/FOR statements. The following example demonstrates a loop that populates a results table:

BEGIN
  DECLARE counter INT = 10;
  WHILE counter > 0 DO
    INSERT INTO results VALUES (counter);
    SET counter = counter - 1;
  END WHILE;
END

This enables complex ETL logic entirely in SQL without switching to Python.

SQL user-defined functions

Define custom functions directly in SQL. Functions can be temporary (session-scoped) or permanent (catalog-stored). The following example shows how to register and use a simple UDF:

CREATE FUNCTION plusOne(x INT) RETURNS INT RETURN x + 1;
SELECT plusOne(5);

Python API advances

This section covers new Python capabilities including custom data sources and UDF profiling tools.

Python data source API

You can now build custom data sources in Python without Scala knowledge. The following example shows how to create a simple data source that returns sample data:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

class SampleDataSource(DataSource):
    def schema(self):
        return StructType([
            StructField("name", StringType()),
            StructField("age", IntegerType())
        ])
    
    def reader(self, schema):
        return SampleReader()

class SampleReader(DataSourceReader):
    def read(self, partition):
        yield ("Alice", 30)
        yield ("Bob", 25)

# Register and use
spark.dataSource.register(SampleDataSource)
spark.read.format("SampleDataSource").load().show()

Unified UDF profiling

Profile Python and Pandas UDFs for performance and memory insights. The following code enables performance profiling:spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") # or "memory"

Structured streaming enhancements

This section covers improvements to stateful stream processing, including queryable state and enhanced state management APIs.

Arbitrary stateful processing API v2

The transformWithState operator provides robust state management with timer and TTL support for automatic cleanup, schema evolution capabilities, and initial state support for pre-populating state from batch DataFrames.

State data source reader

Query streaming state as a DataFrame for debugging and monitoring. Previously, state data was internal to streaming queries. Now you can verify state values in unit tests, diagnose production incidents, detect state corruption, and optimize performance. Note: This feature is experimental. Source options and behavior may change in future releases.

State store improvements

Upgraded changelog checkpointing for RocksDB removes performance bottlenecks. Enhanced checkpoint coordination and improved sorted string table (SST) file reuse management optimizes streaming operations.

Table format support

This section covers support for AWS S3 Tables and full table access (FTA) with AWS Lake Formation.

AWS S3 Tables

Use Spark 4.0.1 with AWS S3 Tables, a storage solution that provides managed Apache Iceberg tables with automatic optimization and maintenance. S3 Tables simplify data lake operations by handling compaction, snapshot management, and metadata cleanup automatically.

Full table access with Lake Formation

FTA is supported for Apache Iceberg, Delta Lake, and Apache Hive tables when using AWS Lake Formation, a managed service that simplifies data access control. FTA provides coarse-grained access control at the table level. Note that fine-grained access control (FGAC) with column-level or row-level permissions is not available in this preview.

Getting started

Follow these steps to create an EMR Serverless application, run sample code to test new features, and provide feedback on the preview.

Prerequisites

Before you begin, confirm you have the following:

AWS Command Line Interface (CLI): Install and configure the AWS CLI with credentials that have permissions to create and manage EMR Serverless applications
EMR Serverless access: Your Identity and Access Management (IAM) role must have permissions for EMR Serverless operations (emr-serverless:CreateApplication,emr-serverless:StartJobRun). See the EMR Serverless IAM policy examples for minimum required permissions
S3 bucket: An S3 bucket to store your job scripts and data
Runtime role: An IAM execution role for EMR Serverless with access to your S3 resources

Note: EMR Studio Notebooks and SageMaker Unified Studio are not supported during this preview. Use the AWS CLI or AWS SDK to submit jobs.

Step 1: Create your EMR Serverless application

Create or update your application with the emr-spark-8.0-preview release label. The following command creates a new application:

aws emr-serverless create-application --type spark \
  --release-label emr-spark-8.0-preview \
  --region us-east-1 --name spark4-test

Step 2: Test sample code

Run this PySpark job to verify setup and test Spark 4.0.1 features:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark 4.0.1 Test").getOrCreate()
print(f"Spark Version: {spark.version}")

# Create sample data
data = [("Alice", 34, "Engineering"), ("Bob", 45, "Sales"),
        ("Charlie", 28, "Engineering"), ("Diana", 52, "Marketing")]
df = spark.createDataFrame(data, ["name", "age", "department"])
df.createOrReplaceTempView("employees")

# Test SQL PIPE syntax
try:
    result = spark.sql("""
        FROM employees
        |> WHERE age > 30
        |> SELECT name, age, department
        |> ORDER BY age DESC
    """)
    result.show()
    print("✓ SQL pipe syntax test passed")
except Exception as e:
    print(f"✗ SQL pipe syntax test failed: {e}")

# Test VARIANT data type
try:
    json_data = spark.sql("""
        SELECT parse_json('{"name":"Alice","skills":["Python","Spark","SQL"]}') as data
    """)
    json_data.show(truncate=False)
    print("✓ VARIANT data type test passed")
except Exception as e:
    print(f"✗ VARIANT data type test failed: {e}")

Submit the job with the following command:

aws emr-serverless start-job-run \
    --application-id <your-application-id> \
    --execution-role-arn <your-execution-role-arn> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your-bucket>/spark_4_test.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g"
        }
    }'

Step 3: Test your workloads

Review the Spark SQL Migration Guide and PySpark Migration Guide, then test production workloads in non-production environments. Focus on queries affected by ANSI SQL mode and benchmark performance.

Step 4: Clean up resources

After testing, delete all resources created during this evaluation to avoid ongoing charges:

# Delete the EMR Serverless application
aws emr-serverless delete-application \
    --application-id spark4-test \
    --region us-east-1
# Remove the test script from S3
aws s3 rm s3://<your-bucket>/spark_4_test.py

Migration considerations

Before evaluating Spark 4.0.1, review the updated runtime requirements and behavioral changes that may affect your existing code.

Runtime requirements

Scala: Version 2.13.16 required (2.12 support dropped)
Java: JDK 17 or higher required (JDK 8 and 11 support removed)
Python: Version 3.9+ required, continued support for 3.11 and newly added 3.12 (3.8 support removed)
Pandas: Minimum version 2.0.0 (previously 1.0.5)
SparkR: Deprecated; migrate to PySpark

Behavioral changes

With ANSI SQL mode enforcement, you may see different behavior in:

Null handling: Stricter null propagation in expressions
String casting: Invalid casts now raise errors instead of returning null
Map key operations: Duplicate keys now raise errors
Timestamp conversions: Overflow returns null instead of wrapped values
CREATE TABLE statements: Now respect the spark.sql.sources.defaultconfiguration instead of defaulting to Hive format when USING or STORED AS clauses are omitted

You can control many of these behaviors via legacy configuration flags. Consult the official migration guides for details refer: Spark SQL Migration Guide: 3.5 to 4.0 and PySpark Migration Guide: 3.5 to 4.0.

Preview limitations

The following capabilities are not available in this preview:

Fine-grained access control: Fine-grained access control (FGAC) with row-level or column-level filtering is not supported in this preview. Jobs with spark.emr-serverless.lakeformation.enabled=true will fail.
Spark Connect: Not supported in this preview. Use standard Spark job submission with the StartJobRun API.
Open Table Format limitations: Hudi is not supported in this preview. Delta 4.0.0 does not support Flink connectors (deprecated in Delta 4.0.0). Delta Universal Format is not supported in this preview.
Connectors: spark-sql-kinesis, emr-dynamodb, and spark-redshift are unavailable.
Interactive applications: Livy and JupyterEnterpriseGateway are not included. Also, SageMaker Unified Studio and EMR Studio are not supported.
EMR features: Serverless Storage and Materialized Views are not supported.

This preview lets you evaluate Spark 4.0.1’s core capabilities on EMR Serverless, including SQL enhancements, Python API improvements, and streaming state management. Test your migration path, assess performance improvements, and provide feedback to shape the general availability release.

Conclusion

This post showed you how to get started with the Apache Spark 4.0.1 preview release on Amazon EMR Serverless. You explored how the VARIANT data type works with Iceberg v3 to process JSON data efficiently, how SQL scripting and pipe syntax eliminate context-switching for ETL development, and how queryable streaming state simplifies debugging stateful applications. You also learned about the preview limitations, runtime requirements, and behavioral changes to consider during evaluation.

Test the Spark 4.0.1 preview on EMR Serverless and provide feedback through AWS Support to help shape the general availability release.

To learn more about Apache Spark 4.0.1 features, see the Spark 4.0.1 Release Notes. For EMR Serverless documentation, see the EMR Release Guide.

Resources

Apache Spark Documentation

Amazon EMR Resources

Amazon EMR Spark Features

About the authors

Amazon EMR launches support for Amazon EC2 C7g (Graviton3) instances to improve cost performance for Spark workloads by 7–13%

2023-02-01 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c7g-graviton3-instances-to-improve-cost-performance-for-spark-workloads-by-7-13/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over twice the performance improvements compared to open-source Apache Spark and Presto.

With Amazon EMR release 6.7, you can now use Amazon Elastic Compute Cloud (Amazon EC2) C7g instances, which use the AWS Graviton3 processors. These instances improve price-performance of running Spark workloads on Amazon EMR by 7.93–13.35% over previous generation instances, depending on the instance size. In this post, we describe how we estimated the price-performance benefit.

Amazon EMR runtime performance with EC2 C7g instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.9 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with C7g instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent C6g clusters from the previous generation instance family. We measured performance improvements using the total query runtime and geometric mean of the query runtime across TPC-DS 3 TB benchmark queries.

Our results showed 13.65–18.73% improvement in total query runtime performance and 16.98–20.28% improvement in geometric mean on EMR clusters with C7g compared to equivalent EMR clusters with C6g instances, depending on the instance size. In comparing costs, we observed 7.93–13.35% reduction in cost on the EMR cluster with C7g compared to the equivalent with C6g, depending on the instance size. We did not benchmark the C6g xlarge instance because it didn’t have sufficient memory to run the queries.

The following table shows the results from running the TPC-DS 3 TB benchmark queries using Amazon EMR 6.9 compared to equivalent C7g and C6g instance EMR clusters.

Instance Size	16 XL	12 XL	8 XL	4 XL	2 XL
Total size of the cluster (1 leader + 5 core nodes)	6	6	6	6	6
Total query runtime on C6g (seconds)	2774.86205	2752.84429	3173.08086	5108.45489	8697.08117
Total query runtime on C7g (seconds)	2396.22799	2336.28224	2698.72928	4151.85869	7249.58148
Total query runtime improvement with C7g	13.65%	15.13%	14.95%	18.73%	16.64%
Geometric mean query runtime C6g (seconds)	22.2113	21.75459	23.38081	31.97192	45.41656
Geometric mean query runtime C7g (seconds)	18.43905	17.65898	19.01684	25.48695	37.43737
Geometric mean query runtime improvement with C7g	16.98%	18.83%	18.66%	20.28%	17.57%
EC2 C6g instance price ($ per hour)	$2.1760	$1.6320	$1.0880	$0.5440	$0.2720
EMR C6g instance price ($ per hour)	$0.5440	$0.4080	$0.2720	$0.1360	$0.0680
(EC2 + EMR) instance price ($ per hour)	$2.7200	$2.0400	$1.3600	$0.6800	$0.3400
Cost of running on C6g ($ per instance)	$2.09656	$1.55995	$1.19872	$0.96493	$0.82139
EC2 C7g instance price ($ per hour)	$2.3200	$1.7400	$1.1600	$0.5800	$0.2900
EMR C7g price ($ per hour per instance)	$0.5800	$0.4350	$0.2900	$0.1450	$0.0725
(EC2 + EMR) C7g instance price ($ per hour)	$2.9000	$2.1750	$1.4500	$0.7250	$0.3625
Cost of running on C7g ($ per instance)	$1.930290	$1.411500	$1.086990	$0.836140	$0.729990
Total cost reduction with C7g including performance improvement	-7.93%	-9.52%	-9.32%	-13.35%	-11.13%

The following graph shows per-query improvements observed on C7g 2xlarge instances compared to equivalent C6g generations.

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C7g instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 7–13%.

About the authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Yuzhou Sun is a software development engineer for EMR at Amazon Web Services.

Steve Koonce is an Engineering Manager for EMR at Amazon Web Services.

Amazon EMR launches support for Amazon EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50%

2022-12-13 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-m6a-r6a-instances-to-improve-cost-performance-for-spark-workloads-by-15-50/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as M6A and C6A, which use the third generation AMD EPYC processors. These instances improve the price performance of running Spark workloads on Amazon EMR by 15–50 percent over previous generation instances. In this blog post, we describe how we estimated this price performance benefit.

Amazon EMR runtime performance with EC2 M6A instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with M6a instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent clusters with M5a, which is the previous generation instance family. We measured performance improvements using the total query runtime and the geometric mean of query runtime across TPC-DS 3 TB benchmark queries.

Our results showed a 23.6–50.3 percent improvement in total query runtime performance and 22.8–52.4 percent in geometric mean on an EMR cluster with M6a compared to an equivalent EMR cluster with M5a instances. In comparing costs, we observed a 23.2–41.4 percent reduction in cost on the EMR cluster with M6a compared to the equivalent with M5a. M6A 48 XL and 32 XL instances were not benchmarked because the M5A generation does not offer equivalent sizes.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6a and M5a instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Total size of the cluster (1 Leader + 5 core nodes)	6	6	6	6	6	6	6
Total query runtime on M5A (seconds)	6624.1713838714	5466.7251180433	5269.0578151495	5366.1486275129	7753.6218015794	12118.0922180235	21070.6905510002
Total query runtime on M6A (seconds)	3295.2894058371	3063.7807673078	3399.1509249577	3482.8401591909	4906.2216891762	9184.4366036450	16107.9707619002
Total query runtime improvement with M6A	50.25%	43.96%	35.49%	35.10%	36.72%	24.21%	23.55%
Geometric mean query runtime M5A (sec)	51.1422829354	40.9550798753	38.4890223194	35.3863834186	44.8454957416	61.0454658020	92.6414502105
Geometric mean query runtime M6A (sec)	24.3406154481	22.3484713891	22.9913163520	23.0351017440	28.2855683398	46.4363267349	71.5498816854
Geometric mean query runtime improvement with M6A	52.41%	45.43%	40.27%	34.90%	36.93%	23.93%	22.77%
EC2 M5A instance price ($ per hour)	$4.12800	$2.75200	$2.06400	$1.37600	$0.68800	$0.34400	$0.17200
EMR M5A instance price ($ per hour)	$0.27000	$0.27000	$0.27000	$0.27000	$0.17200	$0.08600	$0.04300
(EC2 + EMR) M5A instance price ($ per hour)	$4.39800	$3.02200	$2.33400	$1.64600	$0.86000	$0.43000	$0.21500
Cost of running on M5A ($ per instance)	$8.09253	$4.58901	$3.41611	$2.45352	$1.85225	$1.44744	$1.25839
EC2 M6A instance price ($ per hour)	$4.14720	$2.76480	$2.07360	$1.38240	$0.69120	$0.34560	$0.17280
EMR M6A price ($ per hour per instance)	$1.03680	$0.69120	$0.51840	$0.34560	$0.17280	$0.08640	$0.04320
(EC2 + EMR) M6A instance price ($ per hour)	$5.18400	$3.45600	$2.59200	$1.72800	$0.86400	$0.43200	$0.21600
Cost of running on M6A ($ per instance)	$4.74522	$2.94123	$2.44739	$1.67176	$1.17749	$1.10213	$0.96648
Total cost reduction with M6A including performance improvement	-41.36%	-35.91%	-28.36%	-31.86%	-36.43%	-23.86%	-23.20%

The following graph shows per query improvements observed on M6a 2XL instances compared to equivalent M5a generation. We observed that two queries take longer to execute on M6a instance clusters compared to M5a instance clusters. Q91 regressed up to 6.64 percent and Q55 regressed up to 1.86 percent on 4 XL instance clusters.

Amazon EMR runtime performance with EC2 R6A instances

R6A instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5A instances. R6A 32XL and 48XL instances were not benchmarked since R5A instances do not have 32XL and 48XL sizes available. Our results showed 16–58.22 percent improvement in total query runtime for seven different instance sizes within the instance family and 20.04–59.59 percent improvement in geometric mean. In comparing costs, we observed 15.85–-50.07 percent reduction in cost on R6A instance EMR clusters compared to R5A EMR instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6A and R5A instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Total size of the cluster (1 Leader + 5 core nodes)	6	6	6	6	6	6	6
Total query runtime on R5A (seconds)	6934.22936	5530.74672	5834.32344	5718.72582	7615.58392	11431.37368	20688.58642
Total query runtime on R6A (seconds)	2897.44817	2906.49952	3017.85315	3488.83875	4661.32856	7717.33575	17378.49043
Total query runtime improvement with R6A	58.22%	47.45%	48.27%	38.99%	38.79%	32.49%	16.00%
Geometric mean query runtime R5A (sec)	53.27574	41.76973	42.50324	37.62155	44.58173	58.88182	91.72095
Geometric mean query runtime R6A (sec)	21.52803	21.36831	19.94607	21.59493	26.90097	36.57557	73.3405
Geometric mean query runtime improvement with R6A	59.59%	48.84%	53.07%	42.60%	39.66%	37.88%	20.04%
EC2 R5A instance price ($ per hour)	$5.42400	$3.61600	$2.71200	$1.80800	$0.90400	$0.45200	$0.22600
EMR R5A instance price ($ per hour)	$0.27000	$0.27000	$0.27000	$0.27000	$0.22600	$0.11300	$0.05700
(EC2 + EMR) R5A instance price ($ per hour)	$5.69400	$3.88600	$2.98200	$2.07800	$1.13000	$0.56500	$0.28300
Cost of running on R5A ($ per instance)	$10.96764	$5.97013	$4.83276	$3.30098	$2.39045	$1.79409	$1.62635
EC2 R6A instance price ($ per hour)	$5.44320	$3.62880	$2.72160	$1.81440	$0.90720	$0.45360	$0.22680
EMR R6A price ($ per hour per instance)	$1.36080	$0.90720	$0.68040	$0.45360	$0.22680	$0.11340	$0.05670
(EC2 + EMR) R6A instance price ($ per hour)	$6.80400	$4.53600	$3.40200	$2.26800	$1.13400	$0.56700	$0.28350
Cost of running on R6A ($ per instance)	$5.47618	$3.66219	$2.85187	$2.19797	$1.46832	$1.21548	$1.36856
Total cost reduction with R6A including performance improvement	-50.07%	-38.66%	-40.99%	-33.41%	-38.58%	-32.25%	-15.85%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with M6A and R6A instances compared to using equivalent previous-generation instances. Using these new instances with Amazon EMR improves price performance by 15–50%.

About the authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

2022-12-07 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size	16 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5
Total query runtime on I3 (seconds)	4752.15457	4506.43694	7110.03042	11853.40336	21333.05743
Total query runtime on I4I (seconds)	2642.77407	2812.05517	4415.0023	7537.52779	12981.20251
Total query runtime improvement with I4I	44.39%	37.60%	37.90%	36.41%	39.15%
Geometric mean query runtime on I3 (sec)	34.99551	29.14821	41.53093	60.8069	95.46128
Geometric mean query runtime on I4I (sec)	19.17906	18.65311	25.66263	38.13503	56.95073
Geometric mean query runtime improvement with I4I	45.20%	36.01%	38.21%	37.29%	40.34%
EC2 I3 instance price ($ per hour)	$4.990	$2.496	$1.248	$0.624	$0.312
EMR I3 instance price ($ per hour)	$0.270	$0.270	$0.270	$0.156	$0.078
(EC2 + EMR) I3 instance price ($ per hour)	$5.260	$2.766	$1.518	$0.780	$0.390
Cost of running on I3 ($ per instance)	$6.943	$3.462	$2.998	$2.568	$2.311
EC2 I4I instance price ($ per hour)	$5.491	$2.746	$1.373	$0.686	$0.343
EMR I4I price ($ per hour per instance)	$1.373	$0.687	$0.343	$0.172	$0.086
(EC2 + EMR) I4I instance price ($ per hour)	$6.864	$3.433	$1.716	$0.858	$0.429
Cost of running on I4I ($ per instance)	$5.039	$2.681	$2.105	$1.795	$1.546
Total cost reduction with I4I including performance improvement	-27.43%	-22.56%	-29.79%	-30.09%	-33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5	5	5
Total query runtime on M5 (seconds)	4027.58043	3782.10766	3348.05362	3516.4308	5621.22532	10075.45109	17278.15146
Total query runtime on M6I (seconds)	3106.43834	2665.70607	2714.69862	3043.5975	4195.02715	8226.88301	14515.50394
Total query runtime improvement with M6I	22.87%	29.52%	18.92%	13.45%	25.37%	18.35%	15.99%
Geometric mean query runtime M5 (sec)	30.45437	28.5207	23.95314	23.55958	32.95975	49.43178	75.95984
Geometric mean query runtime M6I (sec)	23.76853	19.21783	19.16869	19.9574	24.23012	39.09965	60.79494
Geometric mean query runtime improvement with M6I	21.95%	32.62%	19.97%	15.29%	26.49%	20.90%	19.96%
EC2 M5 instance price ($ per hour)	$4.61	$3.07	$2.30	$1.54	$0.77	$0.38	$0.19
EMR M5 instance price ($ per hour)	$0.27	$0.27	$0.27	$0.27	$0.19	$0.10	$0.05
(EC2 + EMR) M5 instance price ($ per hour)	$4.88	$3.34	$2.57	$1.81	$0.96	$0.48	$0.24
Cost of running on M5 ($ per instance)	$5.46	$3.51	$2.39	$1.76	$1.50	$1.34	$1.15
EC2 M6I instance price ($ per hour)	$4.61	$3.07	$2.30	$1.54	$0.77	$0.38	$0.19
EMR M6I price ($ per hour per instance)	$1.15	$0.77	$0.58	$0.38	$0.19	$0.10	$0.05
(EC2 + EMR) M6I instance price ($ per hour)	$5.76	$3.84	$2.88	$1.92	$0.96	$0.48	$0.24
Cost of running on M6I ($ per instance)	$4.97	$2.84	$2.17	$1.62	$1.12	$1.10	$0.97
Total cost reduction with M6I including performance improvement	-8.92%	-19.02%	-9.28%	-7.98%	-25.37%	-18.35%	-15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2XL
Number of core instances in EMR cluster	5	5	5	5	5	5

Total query runtime on R5 (seconds)	4024.4737	3715.74432	3552.97298	3535.69879	5379.73168	9121.41532
Total query runtime on R6I (seconds)	2865.83169	2518.24192	2513.4849	3031.71973	4544.44854	6977.9508
Total query runtime improvement with R6I	28.79%	32.23%	29.26%	14.25%	15.53%	23.50%
Geometric mean query runtime R5 (sec)	30.59066	28.30849	25.30903	23.85511	32.33391	47.28424
Geometric mean query runtime R6I (sec)	21.87897	17.97587	17.54117	20.00918	26.6277	34.52817
Geometric mean query runtime improvement with R6I	28.48%	36.50%	30.69%	16.12%	17.65%	26.98%
EC2 R5 instance price ($ per hour)	$6.0480	$4.0320	$3.0240	$2.0160	$1.0080	$0.5040
EMR R5 instance price ($ per hour)	$0.2700	$0.2700	$0.2700	$0.2700	$0.2520	$0.1260
(EC2 + EMR) R5 instance price ($ per hour)	$6.3180	$4.3020	$3.2940	$2.2860	$1.2600	$0.6300
Cost of running on R5 ($ per instance)	$7.0630	$4.4403	$3.2510	$2.2452	$1.8829	$1.5962
EC2 R6I instance price ($ per hour)	$6.0480	$4.0320	$3.0240	$2.0160	$1.0080	$0.5040
EMR R6I price ($ per hour per instance)	$1.5120	$1.0080	$0.7560	$0.5040	$0.2520	$0.1260
(EC2 + EMR) R6I instance price ($ per hour)	$7.5600	$5.0400	$3.7800	$2.5200	$1.2600	$0.6300
Cost of running on R6I ($ per instance)	$6.0182	$3.5255	$2.6392	$2.1222	$1.5906	$1.2211
Total cost reduction with R6I including performance improvement	-14.79%	-20.60%	-18.82%	-5.48%	-15.53%	-23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size *	24 XL	12 XL	4 XL	2 XL
Number of core instances in EMR cluster	5	5	5	5
Total query runtime on C5 (seconds)	3435.59808	2900.84981	5945.12879	10173.00757
Total query runtime on C6I (seconds)	2711.16147	2471.86778	5195.30093	8787.43422
Total query runtime improvement with C6I	21.09%	14.79%	12.61%	13.62%
Geometric mean query runtime C5 (sec)	25.67058	20.06539	31.76582	46.78632
Geometric mean query runtime C6I (sec)	20.4458	17.14133	26.92196	39.32622
Geometric mean query runtime improvement with C6I	20.35%	14.57%	15.25%	15.95%
EC2 C5 instance price ($ per hour)	$4.080	$2.040	$0.680	$0.340
EMR C5 instance price ($ per hour)	$0.270	$0.270	$0.170	$0.085
(EC2 + EMR) C5 instance price ($ per hour)	$4.35000	$2.31000	$0.85000	$0.42500
Cost of running on C5 ($ per instance)	$4.15135	$1.86138	$1.40371	$1.20098
EC2 C6I instance price ($ per hour)	$4.0800	$2.0400	$0.6800	$0.3400
EMR C6I price ($ per hour per instance)	$1.02000	$0.51000	$0.17000	$0.08500
(EC2 + EMR) C6I instance price ($ per hour)	$5.10000	$2.55000	$0.85000	$0.42500
Cost of running on C6I ($ per instance)	$3.84081	$1.75091	$1.22667	$1.03741
Total cost reduction with C6I including performance improvement	-7.48%	-5.93%	-12.61%	-13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5	5	5
Total query runtime on R5D (seconds)	4054.4492975042	3691.7569385583	3598.6869168064	3532.7398928104	5397.5330161574	9281.2627059927	16862.8766838096
Total query runtime on R6ID (seconds)	2992.1198446983	2633.7131630720	2632.3186613402	2729.8860537867	4583.1040980373	7921.9960917943	14867.5391541445
Total query runtime improvement with R6ID	26.20%	28.66%	26.85%	22.73%	15.09%	14.65%	11.83%
Geometric mean query runtime R5D (sec)	31.0238156851	28.1432927726	25.7532157307	24.0596427675	32.5800246829	48.2306670294	76.6771994376
Geometric mean query runtime R6ID (sec)	22.8681174894	19.1282742957	18.6161830746	18.0498249257	25.9500918360	39.6580341258	65.0947323858
Geometric mean query runtime improvement with R6ID	26.29%	32.03%	27.71%	24.98%	20.35%	17.77%	15.11%
EC2 R5D instance price ($ per hour)	$6.912000	$4.608000	$3.456000	$2.304000	$1.152000	$0.576000	$0.288000
EMR R5D instance price ($ per hour)	$0.270000	$0.270000	$0.270000	$0.270000	$0.270000	$0.144000	$0.072000
(EC2 + EMR) R5D instance price ($ per hour)	$7.182000	$4.878000	$3.726000	$2.574000	$1.422000	$0.720000	$0.360000
Cost of running on R5D ($ per instance)	$8.088626	$5.002331	$3.724641	$2.525909	$2.132026	$1.856253	$1.686288
EC2 R6ID instance price ($ per hour)	$7.257600	$4.838400	$3.628800	$2.419200	$1.209600	$0.604800	$0.302400
EMR R6ID price ($ per hour per instance)	$1.814400	$1.209600	$0.907200	$0.604800	$0.302400	$0.151200	$0.075600
(EC2 + EMR) R6ID instance price ($ per hour)	$9.072000	$6.048000	$4.536000	$3.024000	$1.512000	$0.756000	$0.378000
Cost of running on R6ID ($ per instance)	$7.540142	$4.424638	$3.316722	$2.293104	$1.924904	$1.663619	$1.561092
Total cost reduction with R6ID including performance improvement	-6.78%	-11.55%	-10.95%	-9.22%	-9.71%	-10.38%	-7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.

About the authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Run Apache Spark 3.0 workloads 1.7 times faster with Amazon EMR runtime for Apache Spark

2021-01-19 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark/

With Amazon EMR release 6.1.0, Amazon EMR runtime for Apache Spark is now available for Spark 3.0.0. EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark.

In our benchmark performance tests using TPC-DS benchmark queries at 3 TB scale, we found EMR runtime for Apache Spark 3.0 provides a 1.7 times performance improvement on average, and up to 8 times improved performance for individual queries over open-source Apache Spark 3.0.0. With Amazon EMR 6.1.0, you can now run your Apache Spark 3.0 applications faster and cheaper without requiring any changes to your applications.

Results observed using TPC-DS benchmarks

To evaluate the performance improvements, we used TPC-DS benchmark queries with 3 TB scale and ran them on a 6-node c4.8xlarge EMR cluster with data in Amazon Simple Storage Service (Amazon S3). We ran the tests with and without the EMR runtime for Apache Spark. The following two graphs compare the total aggregate runtime and geometric mean for all queries in the TPC-DS 3 TB query dataset between the Amazon EMR releases.

The following table shows the total runtime in seconds.

The following table shows the geometric mean of the runtime in seconds.

In our tests, all queries ran successfully on EMR clusters that used the EMR runtime for Apache Spark. However, when using Spark 3.0 without the EMR runtime, 34 out of the 104 benchmark queries failed due to SPARK-32663. To work around these issues, we disabled spark.shuffle.readHostLocalDisk configuration. However, even after this change, queries 14a and 14b continued to fail. Therefore, we chose to exclude these queries from our benchmark comparison.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is illustrated in the following chart. The horizontal axis shows each query in the TPC-DS 3 TB benchmark. The vertical axis shows the speedup of each query due to the EMR runtime. We found a 1.7 times performance improvement as measured by the geometric mean of the per-query speedups, with all queries showing a performance improvement with the EMR Runtime.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is also illustrated in the following chart.

Conclusion

You can run your Apache Spark 3.0 workloads faster and cheaper without making any changes to your applications by using Amazon EMR 6.1. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more great Apache Spark optimizations, configuration best practices, and tuning advice.

About the Authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

Benefits

Make analytics accessible to more users

Simplify data engineering workflows

Strengthen governance capabilities

Key capabilities

SQL enhancements

Python API advances

Structured streaming enhancements

Table format support

Getting started

Prerequisites

Step 1: Create your EMR Serverless application

Step 2: Test sample code

Step 3: Test your workloads

Step 4: Clean up resources

Migration considerations

Preview limitations

Conclusion

Resources

About the authors

Amazon EMR runtime performance with EC2 C7g instances

Benchmarking methodology

Conclusion

About the authors

Amazon EMR runtime performance with EC2 M6A instances

Amazon EMR runtime performance with EC2 R6A instances

Benchmarking methodology

Conclusion

About the authors

Amazon EMR runtime performance improvements with EC2 I4i instances

Amazon EMR runtime performance improvements with EC2 M6i instances

Amazon EMR runtime performance improvements with EC2 R6i instances

Amazon EMR runtime performance improvements with EC2 C6i instances

Amazon EMR runtime performance improvements with EC2 R6id instances

Benchmarking methodology

Conclusion

About the authors

Results observed using TPC-DS benchmarks

Conclusion

About the Authors

The collective thoughts of the interwebz