Amazon EMR Serverless cost estimator

Post Syndicated from Radhika Ravirala original https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-cost-estimator/

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run applications using open-source big data analytics frameworks such as Apache Spark and Hive without configuring, managing, and scaling clusters or servers. You get all the features of the latest open-source frameworks with the performance-optimized runtime of Amazon EMR, and without having to plan and operate instances and clusters.

With Amazon EMR, you can run your analytics applications on dedicated EMR clusters, on existing Amazon Elastic Kubernetes Service (Amazon EKS) clusters, or using the new EMR Serverless deployment option where you don’t have to manage clusters or instances. When you build a Spark or Hive application using an Amazon EMR release, say Amazon EMR 6.8, you can run the application on EMR clusters, on EKS clusters using Amazon EMR on EKS, or using EMR Serverless without having to change the application.

To learn about the benefits of each deployment option in EMR Serverless, refer to What are some of the feature differences between EMR Serverless and Amazon EMR on EC2? in the Amazon EMR FAQ. You can also learn about the pricing for these options from the Amazon EMR pricing page. Many customers already run data analytics applications on EMR clusters, and find that the new serverless option is simpler and less expensive.

In this post, we discuss how you can estimate what it may cost to run an application that currently runs on EMR clusters using the new serverless option, and perform this analysis simply by using your current application metrics. This approach helps you evaluate and adopt the deployment option that is most cost effective for the application. However, the Amazon EMR pricing page doesn’t tell you how you can easily estimate the cost of running your existing EMR cluster applications on EMR Serverless. In the following sections, we describe an approach that enables you to do that.

Although the example in this post discusses how you can get a cost estimate for applications running on EMR clusters, you can also use the approach if you’re running a Spark or Hive application elsewhere, and want to estimate the cost of running it on EMR Serverless. For example, if you run self-managed Spark or Hive applications on Amazon Elastic Compute Cloud (Amazon EC2) clusters, or if you run Spark jobs on AWS Glue, we show you how you can use this approach to estimate the cost of running the application on EMR Serverless.

Estimating the cost of running applications on your EMR cluster

When you run applications on Amazon EMR clusters, you’re separately charged for the following:

The Amazon EC2 price of running cluster instances (the price for the underlying servers)
The price for Amazon Elastic Block Store (Amazon EBS) volumes, if you choose to attach EBS volumes
The Amazon EMR price for the cluster instances

The total cost of running the cluster includes all three. There are a variety of Amazon EC2 pricing options you can choose from, including On Demand, 1-year and 3-year Reserved Instances, Capacity Savings Plans, and Spot Instances. The Amazon EC2 pricing option that you choose determines (a), the Amazon EC2 price. The cost of running the application on EMR clusters is the sum of (a), (b), and (c). You can compute this cost for the lifetime of running the cluster (from the time a cluster is started to the time the cluster is terminated), or for a specific period of time while the cluster is running. We recommend running the former, that is to compute (a), (b), and (c) from the time the cluster is started to the time the cluster is terminated. If you have set up tags for your Amazon EMR cluster, you can easily get the detailed cost report for your EMR cluster using AWS Cost Explorer.

Estimating the cost of running the same applications using EMR Serverless

When you run the same applications using EMR Serverless, you pay for the amount of vCPU, memory, and storage resources consumed by your applications. There is no separate charge for EC2 instances or EBS volumes. And, you only pay for the resources that are actually used by the application and not for EC2 instances provisioned. For example, when running applications on EMR clusters, when an EC2 instance in the cluster is partially utilized (say, 16 GB memory is used out of 64 GB available on the instance, or 4 VCPUs are utilized out of 16 VCPUs available on the instance), or when the EC2 instance is idle (for example, when the instance is initializing or waiting for an application to start), you still incur Amazon EC2, Amazon EMR, and Amazon EBS charges for the full EC2 instance and for the duration that the instance is active in the EMR cluster. With EMR Serverless, you only pay for the vCPU, memory, and storage resources used from the time workers start to run your Spark or Hive job until the time they stop.

To estimate the cost of running your EMR Spark or Hive application on EMR Serverless, you need to first aggregate the total compute vCore-seconds, memory MB-seconds, and storage GB-seconds consumed by each YARN application that ran on your EMR cluster, from the time the YARN container is started to the time the YARN container is terminated. You can obtain these metrics from YARN resource manager logs accessible from YARN timeline server or YARN CLI tools. You can retrieve the running time, vCore-seconds, and memory MB-seconds used by each of the YARN applications.

If your cluster only runs Spark applications, there is a simpler approach to estimate. Instead of obtaining the vCore-seconds, memory MB-seconds, and storage GB-seconds from YARN resource manager logs, you can obtain these metrics from Spark event logs. We have provided the tool EMR Servless Estimator, which can parse the Spark event logs for your applications and provide the aggregated metrics for your cost estimate.

After you get the usage metrics for your application, you can compute the estimated EMR Serverless cost using EMR Serverless pricing. Simply multiple your aggregated vCore-seconds with EMR Serverless vCPU pricing per second, multiply aggregated memory MB-seconds with the EMR Serverless memory pricing per second, and multiply storage GB-seconds with the EMR Serverless storage pricing per second (only if the storage requirements exceed 20 GB per worker). By adding up these costs for vCPU, memory, and storage, you can compare the cost of running the same applications on EMR Serverless.

In this approach, we assume that the performance of the application is equivalent. In other words, the size (vCPU, memory) and runtime duration for each YARN container on the EMR cluster is the same as the number, size, and runtime duration of workers needed to run the application on EMR Serverless. We make this assumption because the EMR runtime for an EMR release is the same regardless of whether the application is run on an EMR cluster or on EMR Serverless.

Example

Let’s do a sample cost comparison of Amazon EMR on EC2 and EMR Serverless using a single cluster.

We ran a Spark application on an EMR cluster with five nodes (one primary, two core, and two task and gathered YARN metrics using the YARN CLI. The following code shows our aggregate resource allocation.

aggregate resource allocation

We computed the Amazon EMR on EC2 costs as follows:

Cluster instances
- Primary: m5.2xlarge:1
- Core: r5.2xlarge:2
- Task: r5.2xlarge:2
Cluster runtime = 18 min
Instance on-demand cost
- m5.2xlarge (8 vCPU, 32 GiB memory)
  - Amazon EC2: $0.384/hr
  - Amazon EMR incremental: $0.096/hr
- r5.2xlarge (8 vCPU, 64 GiB memory)
  - Amazon EC2: $0.504/hr
  - Amazon EMR incremental: $0.126/hr

The following is the EMR on EC2 cost calculation:

Amazon EMR cost = ((1 primary node x $0.096/hr) + (2 core nodes x $ 0.126/hr) + (2 task nodes x $0.126/hr)) = $0.60
Amazon EC2 cost = ((1 primary x $0.384 /hr ) + (2 core nodes x $0.504/hr) + (2 task nodes x $0.504/hr)) = $2.40
Amazon EMR on EC2 cluster cost/hr = $0.6 + $2.40 = $3/hr * 8/60 hr (runtime in hrs)

The total Amazon EMR on Amazon EC2 cost is $0.40/hr.

To calculate EMR Serverless cost, aggregate the vCore-seconds and memory MB-seconds for the same application you ran previously on the EMR cluster. Then multiply those numbers with the EMR Serverless vCPU and memory price. Our calculation results are as follows:

Total_vcore_seconds = 5737
Total_Memory_mb_seconds = 120156631
Convert to vCPU/hr and memory-GB/hr:
- Aggregated vCPU/hr: 5737/(60*60)=1.59
- Aggregated memory/hr: 120156631/(60*60*1024)=32.5
Total vCPU-hours cost = 33 vCPU * 0.052624 VCPU/hr * 8/60 = $0.23
Total memory GB cost = 1.59 MB * 0.0057785 memory/hr * 8/60 = $0.00122

In this example, the total EMR Serverless cost is $0.231, a 42% reduction.

Conclusion

Amazon EMR Serverless is a recently launched serverless option in Amazon EMR that makes it easy to run open-source frameworks such as Spark and Hive without configuring, managing, and scaling clusters. Customers that already use EMR clusters want to understand how they can estimate the cost of running their EMR applications using EMR Serverless. We have presented an approach that you can use to conduct a cost analysis based on analyzing application metrics from your EMR clusters.

We hope you give this a try, and share your feedback with us!

About the authors

Radhika Ravirala is the Principal Product Manager at AWS.

Matthew Liem is the Senior Solution Architecture Manager at AWS.

Noise