Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-on-eks-job-submission-with-spark-operator-and-spark-submit/
Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, Spark applications run on the Amazon EMR runtime for Apache Spark. This performance-optimized runtime offered by Amazon EMR makes your Spark jobs run fast and cost-effectively. The EMR runtime provides up to 5.37 times better performance and 76.8% cost savings, when compared to using open-source Apache Spark on Amazon EKS.
Building on the success of Amazon EMR on EKS, customers have been running and managing jobs using the emr-containers API, creating EMR virtual clusters, and submitting jobs to the EKS cluster, either through the AWS Command Line Interface (AWS CLI) or Apache Airflow scheduler. However, other customers running Spark applications have chosen Spark Operator or native spark-submit to define and run Apache Spark jobs on Amazon EKS, but without taking advantage of the performance gains from running Spark on the optimized EMR runtime. In response to this need, starting from EMR 6.10, we have introduced a new feature that lets you use the optimized EMR runtime while submitting and managing Spark jobs through either Spark Operator or
spark-submit. This means that anyone running Spark workloads on EKS can take advantage of EMR’s optimized runtime.
In this post, we walk through the process of setting up and running Spark jobs using both Spark Operator and
spark-submit, integrated with the EMR runtime feature. We provide step-by-step instructions to assist you in setting up the infrastructure and submitting a job with both methods. Additionally, you can use the Data on EKS blueprint to deploy the entire infrastructure using Terraform templates.
In this post, we walk through the process of deploying a comprehensive solution using
eksctl, Helm, and AWS CLI. Our deployment includes the following resources:
- A VPC, EKS cluster, and managed node group, set up with the
- Essential Amazon EKS managed add-ons, such as the VPC CNI, CoreDNS, and KubeProxy set up with the
- Cluster Autoscaler and Spark Operator add-ons, set up using Helm
- A Spark job execution AWS Identity and Access Management (IAM) role, IAM policy for Amazon Simple Storage Service (Amazon S3) bucket access, service account, and role-based access control, set up using the AWS CLI and
Verify that the following prerequisites are installed on your machine:
- The AWS CLI, in order to interact with AWS services. For instructions, refer to Installing or updating the latest version of the AWS CLI.
- kubectl, which allows you to run commands against Kubernetes clusters.
- eksctl, a simple CLI tool for creating EKS clusters.
- Helm 3.7+, the package manager for Kubernetes.
Set up AWS credentials
Before proceeding to the next step and running the eksctl command, you need to set up your local AWS credentials profile. For instructions, refer to Configuration and credential file settings.
Deploy the VPC, EKS cluster, and managed add-ons
The following configuration uses
us-west-1 as the default Region. To run in a different Region, update the
availabilityZones fields accordingly. Also, verify that the same Region is used in the subsequent steps throughout the post.
Enter the following code snippet into the terminal where your AWS credentials are set up. Make sure to update the
publicAccessCIDRs field with your IP before you run the command below. This will create a file named
Use the following command to create the EKS cluster :
eksctl create cluster -f eks-cluster.yaml
Deploy Cluster Autoscaler
Cluster Autoscaler is crucial for automatically adjusting the size of your Kubernetes cluster based on the current resource demands, optimizing resource utilization and cost. Create an
autoscaler-helm-values.yaml file and install the Cluster Autoscaler using Helm:
You can also set up Karpenter as a cluster autoscaler to automatically launch the right compute resources to handle your EKS cluster’s applications. You can follow this blog on how to setup and configure Karpenter.
Deploy Spark Operator
Spark Operator is an open-source Kubernetes operator specifically designed to manage and monitor Spark applications running on Kubernetes. It streamlines the process of deploying and managing Spark jobs, by providing a Kubernetes custom resource to define, configure and run Spark applications, as well as manage the job life cycle through Kubernetes API. Some customers prefer using Spark Operator to manage Spark jobs because it enables them to manage Spark applications just like other Kubernetes resources.
Currently, customers are building their open-source Spark images and using S3a committers as part of job submissions with Spark Operator or
spark-submit. However, with the new job submission option, you can now benefit from the EMR runtime in conjunction with EMRFS. Starting with Amazon EMR 6.10 and for each upcoming version of the EMR runtime, we will release the Spark Operator and its Helm chart to use the EMR runtime.
In this section, we show you how to deploy a Spark Operator Helm chart from an Amazon Elastic Container Registry (Amazon ECR) repository and submit jobs using EMR runtime images, benefiting from the performance enhancements provided by the EMR runtime.
Install Spark Operator with Helm from Amazon ECR
The Spark Operator Helm chart is stored in an ECR repository. To install the Spark Operator, you first need to authenticate your Helm client with the ECR repository. The charts are stored under the following path:
Authenticate your Helm client and install the Spark Operator:
You can authenticate to other EMR on EKS supported Regions by obtaining the AWS account ID for the corresponding Region. For more information, refer to how to select a base image URI.
Install Spark Operator
You can now install Spark Operator using the following command:
To verify that the operator has been installed correctly, run the following command:
Set up the Spark job execution role and service account
In this step, we create a Spark job execution IAM role and a service account, which will be used in Spark Operator and
spark-submit job submission examples.
First, we create an IAM policy that will be used by the IAM Roles for Service Accounts (IRSA). This policy enables the driver and executor pods to access the AWS services specified in the policy. Complete the following steps:
- As a prerequisite, either create an S3 bucket (
aws s3api create-bucket --bucket <ENTER-S3-BUCKET> --create-bucket-configuration LocationConstraint=us-west-1 --region us-west-1) or use an existing S3 bucket. Replace <ENTER-S3-BUCKET> in the following code with the bucket name.
- Create a policy file that allows read and write access to an S3 bucket:
- Create the IAM policy with the following command:
- Next, create the service account named
emr-job-execution-sa-roleas well as the IAM roles. The following
eksctlcommand creates a service account scoped to the namespace and service account defined to be used by the executor and driver. Make sure to replace <ENTER_YOUR_ACCOUNT_ID> with your account ID before running the command:
- Create an S3 bucket policy to allow only the execution role create in step 4 to write and read from the S3 bucket create in step 1. Make sure to replace <ENTER_YOUR_ACCOUNT_ID> with your account ID before running the command:
- Create a Kubernetes role and role binding required for the service account used in the Spark job run:
- Apply the Kubernetes role and role binding definition with the following command:
So far, we have completed the infrastructure setup, including the Spark job execution roles. In the following steps, we run sample Spark jobs using both Spark Operator and
spark-submit with the EMR runtime.
Configure the Spark Operator job with the EMR runtime
In this section, we present a sample Spark job that reads data from public datasets stored in S3 buckets, processes them, and writes the results to your own S3 bucket. Make sure that you update the S3 bucket in the following configuration by replacing <ENTER_S3_BUCKET> with the URI to your own S3 bucket refered in step 2 of the “Set up the Spark job execution role and service account” section. Also, note that we are using
data-team-a as a namespace and
emr-job-execution-sa as a service account, which we created in the previous step. These are necessary to run the Spark job pods in the dedicated namespace, and the IAM role linked to the service account is used to access the S3 bucket for reading and writing data.
Most importantly, notice the
image field with the EMR optimized runtime Docker image, which is currently set to
emr-6.10.0. You can change this to a newer version when it’s released by the Amazon EMR team. Also, when configuring your jobs, make sure that you include the
hadoopConf settings as defined in the following manifest. These configurations enable you to benefit from EMR runtime performance, AWS Glue Data Catalog integration, and the EMRFS optimized connector.
- Create the file (
emr-spark-operator-example.yaml) locally and update the S3 bucket location so that you can submit the job as part of the next step:
- Run the following command to submit the job to the EKS cluster:
The job may take 4–5 minutes to complete, and you can verify the successful message in the driver pod logs.
- Verify the job by running the following command:
Enable access to the Spark UI
The Spark UI is an important tool for data engineers because it allows you to track the progress of tasks, view detailed job and stage information, and analyze resource utilization to identify bottlenecks and optimize your code. For Spark jobs running on Kubernetes, the Spark UI is hosted on the driver pod and its access is restricted to the internal network of Kubernetes. To access it, we need to forward the traffic to the pod with
kubectl. The following steps take you through how to set it up.
Run the following command to forward traffic to the driver pod:
You should see text similar to the following:
If you didn’t specify the driver pod name at the submission of the
SparkApplication, you can get it with the following command:
Open a browser and enter
http://localhost:4040 in the address bar. You should be able to connect to the Spark UI.
Spark History Server
If you want to explore your job after its run, you can view it through the Spark History Server. The preceding
SparkApplication definition has the event log enabled and stores the events in an S3 bucket with the following path:
s3://YOUR-S3-BUCKET/. For instructions on setting up the Spark History Server and exploring the logs, refer to Launching the Spark history server and viewing the Spark UI using Docker.
spark-submit is a command line interface for running Apache Spark applications on a cluster or locally. It allows you to submit applications to Spark clusters. The tool enables simple configuration of application properties, resource allocation, and custom libraries, streamlining the deployment and management of Spark jobs.
Beginning with Amazon EMR 6.10,
spark-submit is supported as a job submission method. This method currently only supports cluster mode as the submission mechanism. To submit jobs using the
spark-submit method, we reuse the IAM role for the service account we set up earlier. We also use the S3 bucket used for the Spark Operator method. The steps in this section take you through how to configure and submit jobs with
spark-submit and benefit from EMR runtime improvements.
- In order to submit a job, you need to use the Spark version that matches the one available in Amazon EMR. For Amazon EMR 6.10, you need to download the Spark 3.3 version.
- You also need to make sure you have Java installed in your environment.
- Unzip the file and navigate to the root of the Spark directory.
- In the following code, replace the EKS endpoint as well as the S3 bucket then run the script:
The job takes about 7 minutes to complete with two executors of one core and 1 G of memory.
Using custom kubernetes schedulers
Customers running a large volume of jobs concurrently might face challenges related to providing fair access to compute capacity that they aren’t able to solve with the standard scheduling and resource utilization management Kubernetes offers. In addition, customers that are migrating from Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) and are managing their scheduling with YARN queues will not be able to transpose them to Kubernetes scheduling capabilities.
To overcome this issue, you can use custom schedulers like Apache Yunikorn or Volcano.Spark Operator natively supports these schedulers, and with them you can schedule Spark applications based on factors such as priority, resource requirements, and fairness policies, while Spark Operator simplifies application deployment and management. To set up Yunikorn with gang scheduling and use it in Spark applications submitted through Spark Operator, refer to Spark Operator with YuniKorn.
To avoid unwanted charges to your AWS account, delete all the AWS resources created during this deployment:
In this post, we introduced the EMR runtime feature for Spark Operator and
spark-submit, and explored the advantages of using this feature on an EKS cluster. With the optimized EMR runtime, you can significantly enhance the performance of your Spark applications while optimizing costs. We demonstrated the deployment of the cluster using the
eksctl tool, , you can also use the Data on EKS blueprints for deploying a production-ready EKS which you can use for EMR on EKS and leverage these new deployment methods in addition to the EMR on EKS API job submission method. By running your applications on the optimized EMR runtime, you can further enhance your Spark application workflows and drive innovation in your data processing pipelines.
About the Authors
Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.
Vara Bonthu is a dedicated technology professional and Worldwide Tech Leader for Data on EKS, specializing in assisting AWS customers ranging from strategic accounts to diverse organizations. He is passionate about open-source technologies, data analytics, AI/ML, and Kubernetes, and boasts an extensive background in development, DevOps, and architecture. Vara’s primary focus is on building highly scalable data and AI/ML solutions on Kubernetes platforms, helping customers harness the full potential of cutting-edge technology for their data-driven pursuits.