Configure Amazon EMR Studio and Amazon EKS to run notebooks with Amazon EMR on EKS

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/big-data/configure-amazon-emr-studio-and-amazon-eks-to-run-notebooks-with-amazon-emr-on-eks/

Amazon EMR on Amazon EKS provides a deployment option for Amazon EMR that allows you to run analytics workloads on Amazon Elastic Kubernetes Service (Amazon EKS). This is an attractive option because it allows you to run applications on a common pool of resources without having to provision infrastructure. In addition, you can use Amazon EMR Studio to build analytics code running on Amazon EKS clusters. EMR Studio is a web-based, integrated development environment (IDE) using fully managed Jupyter notebooks that can be attached to any EMR cluster, including EMR on EKS. It uses AWS Single Sign-On (SSO) or a compatible identity provider (IdP) to log directly in to EMR Studio through a secure URL using corporate credentials.

Deploying EMR Studio to attach to EMR on EKS requires integrating several AWS services:

In addition, you need to install the following EMR on EKS components:

This post helps you build all the necessary components and stitch them together by running a single script. We also describe the architecture of this setup and how the components work together.

Architecture overview

With EMR on EKS, you can run Spark applications alongside other types of applications on the same Amazon EKS cluster, which improves resource allocation and simplifies infrastructure management. For more information about how Amazon EMR operates inside an Amazon EKS cluster, see New – Amazon EMR on Amazon Elastic Kubernetes Service (EKS). EMR Studio provides a web-based IDE that makes it easy to develop, visualize, and debug applications that run in EMR. For more information, see Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR.

Spark kernels are scheduled pods in a namespace in an Amazon EKS cluster. EMR Studio uses Jupyter Enterprise Gateway (JEG) to launch Spark kernels on Amazon EKS. A managed endpoint of type JEG is provisioned as a Kubernetes deployment in the EMR virtual cluster’s associated namespace and exposed as a Kubernetes service. Each EMR virtual cluster maps to a Kubernetes namespace registered with the Amazon EKS cluster; virtual clusters don’t manage physical compute or storage, but point to the Kubernetes namespace where the workload is scheduled. Each virtual cluster can have several managed endpoints, each with their own configured kernels for different use cases and needs. JEG managed endpoints provide HTTPS endpoints, serviced by an Application Load Balancer (ALB), that are reachable only from EMR Studio and self-hosted notebooks that are created within a private subnet of the Amazon EKS VPC.

The following diagram illustrates the solution architecture.

The managed endpoint is created in the virtual cluster’s Amazon EKS namespace (in this case, sparkns) and the HTTPS endpoints are serviced from private subnets. The kernel pods run with the job-execution IAM role defined in the managed endpoint. During managed endpoint creation, EMR on EKS uses the AWS Load Balancer Controller in the kube-system namespace to create an ALB with a target group that connects with the JEG managed endpoint in the virtual cluster’s Kubernetes namespace.

You can configure each managed endpoint’s kernel differently. For example, to permit a Spark kernel to use AWS Glue as their catalog, you can apply the following configuration JSON file in the —configuration-overrides flag when creating a managed endpoint:

aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${virtclusterid} \
--name ${virtendpointname} \
--execution-role-arn ${role_arn} \
--release-label ${emr_release_label} \
--certificate-arn ${certarn} \
--region ${region} \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
          "spark.sql.catalogImplementation": "hive"
        }
      }
    ]
  }'

The managed endpoint is a Kubernetes deployment fronted by a service inside the configured namespace (in this case, sparkns). When we trace the endpoint information, we can see how the Jupyter Enterprise Gateway deployment connects with the ALB and the target group:

# Get the endpoint ID
aws emr-containers list-managed-endpoints --region us-east-1 --virtual-cluster-id idzdhw2qltdr0dxkgx2oh4bp1
{
    "endpoints": [
        {
            "id": "5vbuwntrbzil1",
            "name": "virtual-emr-endpoint-demo",
            ...
            "serverUrl": "https://internal-k8s-default-ingress5-4f482e2d41-2097665209.us-east-1.elb.amazonaws.com:18888",

# List the deployment
kubectl get deployments -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                READY   UP-TO-DATE   AVAILABLE   AGE
jeg-5vbuwntrbzil1   1/1     1            1           4h54m


# List the service
kubectl get svc -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                    TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
service-5vbuwntrbzil1   NodePort   10.100.172.157   <none>        18888:30091/TCP   4h58m

# List the TargetGroups to get the TargetGroup ARN

kubectl get targetgroupbinding -n sparkns -o json | jq .items | jq .[].spec.targetGroupARN

"arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8"

# Get the TargetGroup Port number

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].Port

30091


# Get Load Balancer ARN

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].LoadBalancerArns | jq .[]

"arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273"

# Get Listener Port number

aws elbv2 describe-listeners --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273 | jq .Listeners | jq .[].Port

18888

To look at how this connects, consider two EMR Studio sessions. The ALB exposes port 18888 to the EMR Studio sessions. The JEG service maps the external port 18888 on the ALB to the dynamic NodePort on the JEG service (in this case, 30091). The JEG service forwards the traffic to the TargetPort 9547, which routes the traffic to the appropriate Spark driver pod. Each notebook session has its own kernel, which has its own respective Spark driver and executor pods, as the following diagram illustrates.

Attach EMR Studio to a virtual cluster and managed endpoint

Each time a user attaches a virtual cluster and a managed endpoint to their Studio Workspace and launches a Spark session, Spark drivers and Spark executors are scheduled. You can see that when you run kubectl to check what pods were launched:

$ kubectl get all -l app=enterprise-gateway
NAME                                  READY   STATUS      RESTARTS   AGE
pod/kb1a317e8-b77b-448c-9b7d-exec-1   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-exec-2   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-driver   2/2     Running     0          2m38s

$ kubectl get pods -n sparkns
NAME                                  READY   STATUS      RESTARTS   AGE
jeg-5vbuwntrbzil1-5fc8469d5f-pfdv9    1/1     Running     0          3d7h
kb1a317e8-b77b-448c-9b7d-exec-1       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-exec-2       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-driver       2/2     Running     0          2m46s

Each notebook Spark kernel session deploys a driver pod and executor pods that continue running until the kernel session is shut down.

The code in the notebook cells runs in the executor pods that were deployed in the Amazon EKS cluster.

Set up EMR on EKS and EMR Studio

Several steps and pieces are required to set up both EMR on EKS and EMR Studio. Enabling AWS SSO is a prerequisite. You can use the two provided launch scripts in this section or manually deploy it using the steps provided later in this post.

We provide two launch scripts in this post. One is a bash script that uses AWS CloudFormation, eksctl, and AWS Command Line Interface (AWS CLI) commands to provide an end-to-end deployment of a complete solution. The other uses the AWS Cloud Development Kit (AWS CDK) to do so.

The following diagram shows the architecture and components that we deploy.

Prerequisites

Make sure to complete the following prerequisites:

For information about the supported IdPs, see Enable AWS Single Sign-On for Amazon EMR Studio.

Bash script

The script is available on GitHub.

Prerequisites

The script requires you to use AWS Cloud9. Follow the instructions in the Amazon EKS Workshop. Make sure to follow these instructions carefully:

After you deploy the AWS Cloud9 desktop, proceed to the next steps.

Preparation

Use the following code to clone the GitHub repo and prepare the AWS Cloud9 prerequisites:

# Download script from the repository
$ git clone https://github.com/aws-samples/amazon-emr-on-eks-emr-studio.git

# Prepare the Cloud9 Desktop pre-requisites
$ cd amazon-emr-on-eks-emr-studio
$ bash ./prepare_cloud9.sh

Deploy the stack

Before running the script, provide the following information:

  • The AWS account ID and Region, if your AWS Cloud9 desktop isn’t in the same account ID or Region where you want to deploy EMR on EKS
  • The name of the Amazon Simple Storage Service (Amazon S3) bucket to create
  • The AWS SSO user to be associated with the EMR Studio session

After the script deploys the stack, the URL to the deployed EMR Studio is displayed:

# Launch the script and follow the instructions to provide user parameters
$ bash ./deploy_eks_cluster_bash.sh

...
Go to https://***. emrstudio-prod.us-east-1.amazonaws.com and login using < SSO user > ...

AWS CDK script

The AWS CDK scripts are available on GitHub. You need to checkout the main branch. The stacks deploy an Amazon EKS cluster and EMR on EKS virtual cluster in a new VPC with private subnets, and optionally an Amazon Managed Apache Airflow (Amazon MWAA) environment and EMR Studio.

Prerequisites

You need the AWS CDK version 1.90.1 or higher. For more information, see Getting started with the AWS CDK.

We use a prefix list to restrict access to some resources to network IP ranges that you approve. Create a prefix list if you don’t already have one.

If you plan to use EMR Studio, you need AWS SSO configured in your account.

Preparation

After you clone the repository and checkout the main branch, create and activate a new Python virtual environment:

# Clone the repository
$ git clone https://github.com/aws-samples/aws-cdk-for-emr-on-eks.git
$ cd aws-cdk-for-emr-on-eks/
$ git checkout main

# 
$ python3 -m venv .venv
$ source .venv/bin/activate

Now install the Python dependencies:

$ pip install -r requirements.txt

Lastly, bootstrap the AWS CDK:

$ cdk bootstrap aws://<account>/<region> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

Deploy the stacks

Synthesize the AWS CDK stacks with the following code:

$ cdk synth \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

This command generates four stacks:

  • emr-eks-cdk – The main stack
  • mwaa-cdk – Adds Amazon MWAA
  • studio-cdk – Adds EMR Studio prerequisites
  • studio-cdk-live – Adds EMR Studio

The following diagram illustrates the resources deployed by the AWS CDK stacks.

Start by deploying the first stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge  \
  --context username=<SSO user name> \
  emr-eks-cdk

If you want to use Apache Airflow as your orchestrator, deploy that stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  mwaa-cdk

Deploy the first EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-cdk

Wait for the managed endpoint to become active. You can check the status by running the following code:

$ aws emr-containers list-managed-endpoints --virtual-cluster-id <cluster ID> | jq '.endpoints[].state'

The virtual cluster ID is available in the AWS CDK output from the emr-eks-cdk stack.

When the endpoint is active, deploy the second EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-live-cdk

Manual deployment

If you prefer to manually deploy EMR on EKS and EMR Studio, use the steps in this section.

Set up a VPC

If you’re using Amazon EKS v. 1.18, set up a VPC that also has private subnets and appropriately tagged for external load balancers. For tagging, see: Application load balancing on Amazon EKS and Create an EMR Studio service role.

Create an Amazon EKS cluster

Launch an Amazon EKS cluster with at least one managed node group. For instructions, see Setting up and Getting Started with Amazon EKS.

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

To create your IAM policies, roles, IdP, and SSL/TLS certificate, complete the following steps:

  1. Enable cluster access for EMR on EKS.
  2. Create an IdP in IAM based on the EKS OIDC provider URL.
  3. Create an SSL/TLS certificate and place it in AWS Certificate Manager.
  4. Create the relevant IAM policies and roles:
    1. Job execution role
    2. Update the trust policy for the job execution role
    3. Deploy and create the IAM policy for the AWS Load Balancer Controller
    4. EMR Studio service role
    5. EMR Studio user role
    6. EMR Studio user policies associated with AWS SSO users and groups
  5. Register the Amazon EKS cluster with Amazon EMR to create the virtual EMR cluster
  6. Create the appropriate security groups to be attached to each EMR Studio created:
    1. Workspace security group
    2. Engine security group
  7. Tag the security groups with the appropriate tags. For instructions, see Create an EMR Studio service role.

Required installs in Amazon EKS

Deploy the AWS Load Balancer Controller in the Amazon EKS cluster if you haven’t already done so.

Create EMR on EKS relevant pieces and map the user to EMR Studio

Complete the following steps:

  1. Create at least one EMR virtual cluster associated with the Amazon EKS cluster. For instructions, see Step 1 of Set up Amazon EMR on EKS for EMR Studio.
  2. Create at least one managed endpoint. For instructions, see Step 2 of Set up Amazon EMR on EKS for EMR Studio.
  3. Create at least one EMR Studio; associate the EMR Studio with the private subnets configured with the Amazon EKS cluster. For instructions, see Create an EMR Studio.
  4. When the EMR Studio is available, map an AWS SSO user or group to the EMR Studio and apply an appropriate IAM policy to that user.

Use EMR Studio

To start using EMR Studio, complete the following steps:

  1. Find the URL for EMR Studio by the studios in a Region:
$ aws emr list-studios --region us-east-1
{
    "Studios": [
        {
            "StudioId": "es-XXXXXXXXXXXXXXXXXXXXXX",
            "Name": "emr_studio_1",
            "VpcId": "vpc-XXXXXXXXXXXXXXXXXXXX",
            "Url": "https://es-XXXXXXXXXXXXXXXXXXXXXX.emrstudio-prod.us-east-1.amazonaws.com",
            "CreationTime": "2021-02-10T14:04:13.672000+00:00"
        }
    ]
}
  1. With the listed URL, log in using the AWS SSO username you used earlier.

After authentication, the user is routed to the EMR Studio dashboard.

  1. Choose Create Workspace.
  2. For Workspace name, enter a name.
  3. For Subnet, choose the subnet that corresponds to one of the subnets associated with the managed node group.
  4. For S3 location, enter an S3 bucket where you can store the notebook content.

  1. After you create the Workspace, choose one that is in the Ready status.

  1. In the sidebar, choose the EMR cluster icon.
  2. Under Cluster type¸ choose EMR Cluster on EKS.
  3. Choose the available virtual cluster and available managed endpoint.
  4. Choose Attach.

After it’s attached, EMR Studio displays the kernels available in the Notebook and Console section.

  1. Choose PySpark (Kubernetes) to launch a notebook kernel and start a Spark session.

Because the endpoint configuration here uses AWS Glue for its metastore, you can list the databases and tables connected to the AWS Glue Data Catalog. You can use the following example script to test the setup. Modify the script as necessary for the appropriate database and table that you have in your Data Catalog:

words='Welcome to Amazon EMR Studio'.split(' ')
wordRDD = sc.parallelize(words)
wc = wordRDD.map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)
print(wc.collect())

# Connect to Glue Catalog
spark.sql("""show databases like '< Database Name >'""").show(truncate=False)
spark.sql("""show tables in < Database Name >""").show(truncate=False)
# Run a simple select
spark.sql("""select * from < Database Name >.< Table Name > limit 10""").show(truncate=False)


Clean up

To avoid incurring future charges, delete the resources launched here by running remove_setup.sh:

# Launch the script
$ bash ./remove_setup.sh</p>

Conclusion

EMR on EKS allows you to run applications on a common pool of resources inside an Amazon EKS cluster without having to provision infrastructure. EMR Studio is a fully managed Jupyter notebook and tool that provisions kernels that run on EMR clusters, including virtual clusters on Amazon EKS. In this post, we described the architecture of how EMR Studio connects with EMR on EKS and provided scripts to automatically deploy all the components to connect the two services.

If you have questions or suggestions, please leave a comment.


About the Authors

Randy DeFauw is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.