All posts by Al MS

Amazon EMR launches support for Amazon EC2 C7g (Graviton3) instances to improve cost performance for Spark workloads by 7–13%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c7g-graviton3-instances-to-improve-cost-performance-for-spark-workloads-by-7-13/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over twice the performance improvements compared to open-source Apache Spark and Presto.

With Amazon EMR release 6.7, you can now use Amazon Elastic Compute Cloud (Amazon EC2) C7g instances, which use the AWS Graviton3 processors. These instances improve price-performance of running Spark workloads on Amazon EMR by 7.93–13.35% over previous generation instances, depending on the instance size. In this post, we describe how we estimated the price-performance benefit.

Amazon EMR runtime performance with EC2 C7g instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.9 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with C7g instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent C6g clusters from the previous generation instance family. We measured performance improvements using the total query runtime and geometric mean of the query runtime across TPC-DS 3 TB benchmark queries.

Our results showed 13.65–18.73% improvement in total query runtime performance and 16.98–20.28% improvement in geometric mean on EMR clusters with C7g compared to equivalent EMR clusters with C6g instances, depending on the instance size. In comparing costs, we observed 7.93–13.35% reduction in cost on the EMR cluster with C7g compared to the equivalent with C6g, depending on the instance size. We did not benchmark the C6g xlarge instance because it didn’t have sufficient memory to run the queries.

The following table shows the results from running the TPC-DS 3 TB benchmark queries using Amazon EMR 6.9 compared to equivalent C7g and C6g instance EMR clusters.

Instance Size 16 XL 12 XL 8 XL 4 XL 2 XL
Total size of the cluster (1 leader + 5 core nodes) 6 6 6 6 6
Total query runtime on C6g (seconds) 2774.86205 2752.84429 3173.08086 5108.45489 8697.08117
Total query runtime on C7g (seconds) 2396.22799 2336.28224 2698.72928 4151.85869 7249.58148
Total query runtime improvement with C7g 13.65% 15.13% 14.95% 18.73% 16.64%
Geometric mean query runtime C6g (seconds) 22.2113 21.75459 23.38081 31.97192 45.41656
Geometric mean query runtime C7g (seconds) 18.43905 17.65898 19.01684 25.48695 37.43737
Geometric mean query runtime improvement with C7g 16.98% 18.83% 18.66% 20.28% 17.57%
EC2 C6g instance price ($ per hour) $2.1760 $1.6320 $1.0880 $0.5440 $0.2720
EMR C6g instance price ($ per hour) $0.5440 $0.4080 $0.2720 $0.1360 $0.0680
(EC2 + EMR) instance price ($ per hour) $2.7200 $2.0400 $1.3600 $0.6800 $0.3400
Cost of running on C6g ($ per instance) $2.09656 $1.55995 $1.19872 $0.96493 $0.82139
EC2 C7g instance price ($ per hour) $2.3200 $1.7400 $1.1600 $0.5800 $0.2900
EMR C7g price ($ per hour per instance) $0.5800 $0.4350 $0.2900 $0.1450 $0.0725
(EC2 + EMR) C7g instance price ($ per hour) $2.9000 $2.1750 $1.4500 $0.7250 $0.3625
Cost of running on C7g ($ per instance) $1.930290 $1.411500 $1.086990 $0.836140 $0.729990
Total cost reduction with C7g including performance improvement -7.93% -9.52% -9.32% -13.35% -11.13%

The following graph shows per-query improvements observed on C7g 2xlarge instances compared to equivalent C6g generations.

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C7g instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 7–13%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Yuzhou Sun is a software development engineer for EMR at Amazon Web Services.

Steve Koonce is an Engineering Manager for EMR at Amazon Web Services.

Amazon EMR launches support for Amazon EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50% 

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-m6a-r6a-instances-to-improve-cost-performance-for-spark-workloads-by-15-50/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as M6A and C6A, which use the third generation AMD EPYC processors. These instances improve the price performance of running Spark workloads on Amazon EMR by 15–50 percent over previous generation instances. In this blog post, we describe how we estimated this price performance benefit.

Amazon EMR runtime performance with EC2 M6A instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with M6a instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent clusters with M5a, which is the previous generation instance family. We measured performance improvements using the total query runtime and the geometric mean of query runtime across TPC-DS 3 TB benchmark queries.

Our results showed a 23.6–50.3 percent improvement in total query runtime performance and 22.8–52.4 percent in geometric mean on an EMR cluster with M6a compared to an equivalent EMR cluster with M5a instances. In comparing costs, we observed a 23.2–41.4 percent reduction in cost on the EMR cluster with M6a compared to the equivalent with M5a. M6A 48 XL and 32 XL instances were not benchmarked because the M5A generation does not offer equivalent sizes.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6a and M5a instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Total size of the cluster (1 Leader + 5 core nodes) 6 6 6 6 6 6 6
Total query runtime on M5A (seconds) 6624.1713838714 5466.7251180433 5269.0578151495 5366.1486275129 7753.6218015794 12118.0922180235 21070.6905510002
Total query runtime on M6A (seconds) 3295.2894058371 3063.7807673078 3399.1509249577 3482.8401591909 4906.2216891762 9184.4366036450 16107.9707619002
Total query runtime improvement with M6A 50.25% 43.96% 35.49% 35.10% 36.72% 24.21% 23.55%
Geometric mean query runtime M5A (sec) 51.1422829354 40.9550798753 38.4890223194 35.3863834186 44.8454957416 61.0454658020 92.6414502105
Geometric mean query runtime M6A (sec) 24.3406154481 22.3484713891 22.9913163520 23.0351017440 28.2855683398 46.4363267349 71.5498816854
Geometric mean query runtime improvement with M6A 52.41% 45.43% 40.27% 34.90% 36.93% 23.93% 22.77%
EC2 M5A instance price ($ per hour) $4.12800 $2.75200 $2.06400 $1.37600 $0.68800 $0.34400 $0.17200
EMR M5A instance price ($ per hour) $0.27000 $0.27000 $0.27000 $0.27000 $0.17200 $0.08600 $0.04300
(EC2 + EMR) M5A instance price ($ per hour) $4.39800 $3.02200 $2.33400 $1.64600 $0.86000 $0.43000 $0.21500
Cost of running on M5A ($ per instance) $8.09253 $4.58901 $3.41611 $2.45352 $1.85225 $1.44744 $1.25839
EC2 M6A instance price ($ per hour) $4.14720 $2.76480 $2.07360 $1.38240 $0.69120 $0.34560 $0.17280
EMR M6A price ($ per hour per instance) $1.03680 $0.69120 $0.51840 $0.34560 $0.17280 $0.08640 $0.04320
(EC2 + EMR) M6A instance price ($ per hour) $5.18400 $3.45600 $2.59200 $1.72800 $0.86400 $0.43200 $0.21600
Cost of running on M6A ($ per instance) $4.74522 $2.94123 $2.44739 $1.67176 $1.17749 $1.10213 $0.96648
Total cost reduction with M6A including performance improvement -41.36% -35.91% -28.36% -31.86% -36.43% -23.86% -23.20%

The following graph shows per query improvements observed on M6a 2XL instances compared to equivalent M5a generation. We observed that two queries take longer to execute on M6a instance clusters compared to M5a instance clusters. Q91 regressed up to 6.64 percent and Q55 regressed up to 1.86 percent on 4 XL instance clusters.

Amazon EMR runtime performance with EC2 R6A instances

R6A instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5A instances. R6A 32XL and 48XL instances were not benchmarked since R5A instances do not have 32XL and 48XL sizes available. Our results showed 16–58.22 percent improvement in total query runtime for seven different instance sizes within the instance family and 20.04–59.59 percent improvement in geometric mean. In comparing costs, we observed 15.85–-50.07 percent reduction in cost on R6A instance EMR clusters compared to R5A EMR instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6A and R5A instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Total size of the cluster (1 Leader + 5 core nodes) 6 6 6 6 6 6 6
Total query runtime on R5A (seconds) 6934.22936 5530.74672 5834.32344 5718.72582 7615.58392 11431.37368 20688.58642
Total query runtime on R6A (seconds) 2897.44817 2906.49952 3017.85315 3488.83875 4661.32856 7717.33575 17378.49043
Total query runtime improvement with R6A 58.22% 47.45% 48.27% 38.99% 38.79% 32.49% 16.00%
Geometric mean query runtime R5A (sec) 53.27574 41.76973 42.50324 37.62155 44.58173 58.88182 91.72095
Geometric mean query runtime R6A (sec) 21.52803 21.36831 19.94607 21.59493 26.90097 36.57557 73.3405
Geometric mean query runtime improvement with R6A 59.59% 48.84% 53.07% 42.60% 39.66% 37.88% 20.04%
EC2 R5A instance price ($ per hour) $5.42400 $3.61600 $2.71200 $1.80800 $0.90400 $0.45200 $0.22600
EMR R5A instance price ($ per hour) $0.27000 $0.27000 $0.27000 $0.27000 $0.22600 $0.11300 $0.05700
(EC2 + EMR) R5A instance price ($ per hour) $5.69400 $3.88600 $2.98200 $2.07800 $1.13000 $0.56500 $0.28300
Cost of running on R5A ($ per instance) $10.96764 $5.97013 $4.83276 $3.30098 $2.39045 $1.79409 $1.62635
EC2 R6A instance price ($ per hour) $5.44320 $3.62880 $2.72160 $1.81440 $0.90720 $0.45360 $0.22680
EMR R6A price ($ per hour per instance) $1.36080 $0.90720 $0.68040 $0.45360 $0.22680 $0.11340 $0.05670
(EC2 + EMR) R6A instance price ($ per hour) $6.80400 $4.53600 $3.40200 $2.26800 $1.13400 $0.56700 $0.28350
Cost of running on R6A ($ per instance) $5.47618 $3.66219 $2.85187 $2.19797 $1.46832 $1.21548 $1.36856
Total cost reduction with R6A including performance improvement -50.07% -38.66% -40.99% -33.41% -38.58% -32.25% -15.85%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with M6A and R6A instances compared to using equivalent previous-generation instances. Using these new instances with Amazon EMR improves price performance by 15–50%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size 16 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5
Total query runtime on I3 (seconds) 4752.15457 4506.43694 7110.03042 11853.40336 21333.05743
Total query runtime on I4I (seconds) 2642.77407 2812.05517 4415.0023 7537.52779 12981.20251
Total query runtime improvement with I4I 44.39% 37.60% 37.90% 36.41% 39.15%
Geometric mean query runtime on I3 (sec) 34.99551 29.14821 41.53093 60.8069 95.46128
Geometric mean query runtime on I4I (sec) 19.17906 18.65311 25.66263 38.13503 56.95073
Geometric mean query runtime improvement with I4I 45.20% 36.01% 38.21% 37.29% 40.34%
EC2 I3 instance price ($ per hour) $4.990 $2.496 $1.248 $0.624 $0.312
EMR I3 instance price ($ per hour) $0.270 $0.270 $0.270 $0.156 $0.078
(EC2 + EMR) I3 instance price ($ per hour) $5.260 $2.766 $1.518 $0.780 $0.390
Cost of running on I3 ($ per instance) $6.943 $3.462 $2.998 $2.568 $2.311
EC2 I4I instance price ($ per hour) $5.491 $2.746 $1.373 $0.686 $0.343
EMR I4I price ($ per hour per instance) $1.373 $0.687 $0.343 $0.172 $0.086
(EC2 + EMR) I4I instance price ($ per hour) $6.864 $3.433 $1.716 $0.858 $0.429
Cost of running on I4I ($ per instance) $5.039 $2.681 $2.105 $1.795 $1.546
Total cost reduction with I4I including performance improvement -27.43% -22.56% -29.79% -30.09% -33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on M5 (seconds) 4027.58043 3782.10766 3348.05362 3516.4308 5621.22532 10075.45109 17278.15146
Total query runtime on M6I (seconds) 3106.43834 2665.70607 2714.69862 3043.5975 4195.02715 8226.88301 14515.50394
Total query runtime improvement with M6I 22.87% 29.52% 18.92% 13.45% 25.37% 18.35% 15.99%
Geometric mean query runtime M5 (sec) 30.45437 28.5207 23.95314 23.55958 32.95975 49.43178 75.95984
Geometric mean query runtime M6I (sec) 23.76853 19.21783 19.16869 19.9574 24.23012 39.09965 60.79494
Geometric mean query runtime improvement with M6I 21.95% 32.62% 19.97% 15.29% 26.49% 20.90% 19.96%
EC2 M5 instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M5 instance price ($ per hour) $0.27 $0.27 $0.27 $0.27 $0.19 $0.10 $0.05
(EC2 + EMR) M5 instance price ($ per hour) $4.88 $3.34 $2.57 $1.81 $0.96 $0.48 $0.24
Cost of running on M5 ($ per instance) $5.46 $3.51 $2.39 $1.76 $1.50 $1.34 $1.15
EC2 M6I instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M6I price ($ per hour per instance) $1.15 $0.77 $0.58 $0.38 $0.19 $0.10 $0.05
(EC2 + EMR) M6I instance price ($ per hour) $5.76 $3.84 $2.88 $1.92 $0.96 $0.48 $0.24
Cost of running on M6I ($ per instance) $4.97 $2.84 $2.17 $1.62 $1.12 $1.10 $0.97
Total cost reduction with M6I including performance improvement -8.92% -19.02% -9.28% -7.98% -25.37% -18.35% -15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2XL
Number of core instances in EMR cluster 5 5 5 5 5 5
Total query runtime on R5 (seconds) 4024.4737 3715.74432 3552.97298 3535.69879 5379.73168 9121.41532
Total query runtime on R6I (seconds) 2865.83169 2518.24192 2513.4849 3031.71973 4544.44854 6977.9508
Total query runtime improvement with R6I 28.79% 32.23% 29.26% 14.25% 15.53% 23.50%
Geometric mean query runtime R5 (sec) 30.59066 28.30849 25.30903 23.85511 32.33391 47.28424
Geometric mean query runtime R6I (sec) 21.87897 17.97587 17.54117 20.00918 26.6277 34.52817
Geometric mean query runtime improvement with R6I 28.48% 36.50% 30.69% 16.12% 17.65% 26.98%
EC2 R5 instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R5 instance price ($ per hour) $0.2700 $0.2700 $0.2700 $0.2700 $0.2520 $0.1260
(EC2 + EMR) R5 instance price ($ per hour) $6.3180 $4.3020 $3.2940 $2.2860 $1.2600 $0.6300
Cost of running on R5 ($ per instance) $7.0630 $4.4403 $3.2510 $2.2452 $1.8829 $1.5962
EC2 R6I instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R6I price ($ per hour per instance) $1.5120 $1.0080 $0.7560 $0.5040 $0.2520 $0.1260
(EC2 + EMR) R6I instance price ($ per hour) $7.5600 $5.0400 $3.7800 $2.5200 $1.2600 $0.6300
Cost of running on R6I ($ per instance) $6.0182 $3.5255 $2.6392 $2.1222 $1.5906 $1.2211
Total cost reduction with R6I including performance improvement -14.79% -20.60% -18.82% -5.48% -15.53% -23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size * 24 XL 12 XL 4 XL 2 XL
Number of core instances in EMR cluster 5 5 5 5
Total query runtime on C5 (seconds) 3435.59808 2900.84981 5945.12879 10173.00757
Total query runtime on C6I (seconds) 2711.16147 2471.86778 5195.30093 8787.43422
Total query runtime improvement with C6I 21.09% 14.79% 12.61% 13.62%
Geometric mean query runtime C5 (sec) 25.67058 20.06539 31.76582 46.78632
Geometric mean query runtime C6I (sec) 20.4458 17.14133 26.92196 39.32622
Geometric mean query runtime improvement with C6I 20.35% 14.57% 15.25% 15.95%
EC2 C5 instance price ($ per hour) $4.080 $2.040 $0.680 $0.340
EMR C5 instance price ($ per hour) $0.270 $0.270 $0.170 $0.085
(EC2 + EMR) C5 instance price ($ per hour) $4.35000 $2.31000 $0.85000 $0.42500
Cost of running on C5 ($ per instance) $4.15135 $1.86138 $1.40371 $1.20098
EC2 C6I instance price ($ per hour) $4.0800 $2.0400 $0.6800 $0.3400
EMR C6I price ($ per hour per instance) $1.02000 $0.51000 $0.17000 $0.08500
(EC2 + EMR) C6I instance price ($ per hour) $5.10000 $2.55000 $0.85000 $0.42500
Cost of running on C6I ($ per instance) $3.84081 $1.75091 $1.22667 $1.03741
Total cost reduction with C6I including performance improvement -7.48% -5.93% -12.61% -13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on R5D (seconds) 4054.4492975042 3691.7569385583 3598.6869168064 3532.7398928104 5397.5330161574 9281.2627059927 16862.8766838096
Total query runtime on R6ID (seconds) 2992.1198446983 2633.7131630720 2632.3186613402 2729.8860537867 4583.1040980373 7921.9960917943 14867.5391541445
Total query runtime improvement with R6ID 26.20% 28.66% 26.85% 22.73% 15.09% 14.65% 11.83%
Geometric mean query runtime R5D (sec) 31.0238156851 28.1432927726 25.7532157307 24.0596427675 32.5800246829 48.2306670294 76.6771994376
Geometric mean query runtime R6ID (sec) 22.8681174894 19.1282742957 18.6161830746 18.0498249257 25.9500918360 39.6580341258 65.0947323858
Geometric mean query runtime improvement with R6ID 26.29% 32.03% 27.71% 24.98% 20.35% 17.77% 15.11%
EC2 R5D instance price ($ per hour) $6.912000 $4.608000 $3.456000 $2.304000 $1.152000 $0.576000 $0.288000
EMR R5D instance price ($ per hour) $0.270000 $0.270000 $0.270000 $0.270000 $0.270000 $0.144000 $0.072000
(EC2 + EMR) R5D instance price ($ per hour) $7.182000 $4.878000 $3.726000 $2.574000 $1.422000 $0.720000 $0.360000
Cost of running on R5D ($ per instance) $8.088626 $5.002331 $3.724641 $2.525909 $2.132026 $1.856253 $1.686288
EC2 R6ID instance price ($ per hour) $7.257600 $4.838400 $3.628800 $2.419200 $1.209600 $0.604800 $0.302400
EMR R6ID price ($ per hour per instance) $1.814400 $1.209600 $0.907200 $0.604800 $0.302400 $0.151200 $0.075600
(EC2 + EMR) R6ID instance price ($ per hour) $9.072000 $6.048000 $4.536000 $3.024000 $1.512000 $0.756000 $0.378000
Cost of running on R6ID ($ per instance) $7.540142 $4.424638 $3.316722 $2.293104 $1.924904 $1.663619 $1.561092
Total cost reduction with R6ID including performance improvement -6.78% -11.55% -10.95% -9.22% -9.71% -10.38% -7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Run Apache Spark 3.0 workloads 1.7 times faster with Amazon EMR runtime for Apache Spark

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark/

With Amazon EMR release 6.1.0, Amazon EMR runtime for Apache Spark is now available for Spark 3.0.0. EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark.

In our benchmark performance tests using TPC-DS benchmark queries at 3 TB scale, we found EMR runtime for Apache Spark 3.0 provides a 1.7 times performance improvement on average, and up to 8 times improved performance for individual queries over open-source Apache Spark 3.0.0. With Amazon EMR 6.1.0, you can now run your Apache Spark 3.0 applications faster and cheaper without requiring any changes to your applications.

Results observed using TPC-DS benchmarks

To evaluate the performance improvements, we used TPC-DS benchmark queries with 3 TB scale and ran them on a 6-node c4.8xlarge EMR cluster with data in Amazon Simple Storage Service (Amazon S3). We ran the tests with and without the EMR runtime for Apache Spark. The following two graphs compare the total aggregate runtime and geometric mean for all queries in the TPC-DS 3 TB query dataset between the Amazon EMR releases.

The following table shows the total runtime in seconds.

The following table shows the total runtime in seconds.

The following table shows the geometric mean of the runtime in seconds.

The following table shows the geometric mean of the runtime in seconds.

In our tests, all queries ran successfully on EMR clusters that used the EMR runtime for Apache Spark. However, when using Spark 3.0 without the EMR runtime, 34 out of the 104 benchmark queries failed due to SPARK-32663. To work around these issues, we disabled spark.shuffle.readHostLocalDisk configuration. However, even after this change, queries 14a and 14b continued to fail. Therefore, we chose to exclude these queries from our benchmark comparison.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is illustrated in the following chart. The horizontal axis shows each query in the TPC-DS 3 TB benchmark. The vertical axis shows the speedup of each query due to the EMR runtime. We found a 1.7 times performance improvement as measured by the geometric mean of the per-query speedups, with all queries showing a performance improvement with the EMR Runtime.

The per-query speedup on Amazon EMR 6.1 with and without EMR runtime is also illustrated in the following chart.

Conclusion

You can run your Apache Spark 3.0 workloads faster and cheaper without making any changes to your applications by using Amazon EMR 6.1. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more great Apache Spark optimizations, configuration best practices, and tuning advice.


About the Authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.