All posts by Damon Cortesi

Use Amazon EMR with S3 Access Grants to scale Spark access to Amazon S3

Post Syndicated from Damon Cortesi original https://aws.amazon.com/blogs/big-data/use-amazon-emr-with-s3-access-grants-to-scale-spark-access-to-amazon-s3/

Amazon EMR is pleased to announce integration with Amazon Simple Storage Service (Amazon S3) Access Grants that simplifies Amazon S3 permission management and allows you to enforce granular access at scale. With this integration, you can scale job-based Amazon S3 access for Apache Spark jobs across all Amazon EMR deployment options and enforce granular Amazon S3 access for better security posture.

In this post, we’ll walk through a few different scenarios of how to use Amazon S3 Access Grants. Before we get started on walking through the Amazon EMR and Amazon S3 Access Grants integration, we’ll set up and configure S3 Access Grants. Then, we’ll use the AWS CloudFormation template below to create an Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) Cluster, an EMR Serverless application and two different job roles.

After the setup, we’ll run a few scenarios of how you can use Amazon EMR with S3 Access Grants. First, we’ll run a batch job on EMR on Amazon EC2 to import CSV data and convert to Parquet. Second, we’ll use Amazon EMR Studio with an interactive EMR Serverless application to analyze the data. Finally, we’ll show how to set up cross-account access for Amazon S3 Access Grants. Many customers use different accounts across their organization and even outside their organization to share data. Amazon S3 Access Grants make it easy to grant cross-account access to your data even when filtering by different prefixes.

Besides this post, you can learn more about Amazon S3 Access Grants from Scaling data access with Amazon S3 Access Grants.

Prerequisites

Before you launch the AWS CloudFormation stack, ensure you have the following:

  • An AWS account that provides access to AWS services
  • The latest version of the AWS Command Line Interface (AWS CLI)
  • An AWS Identity and Access Management (AWS IAM) user with an access key and secret key to configure the AWS CLI, and permissions to create an IAM role, IAM policies, and stacks in AWS CloudFormation
  • A second AWS account if you wish to test the cross-account functionality

Walkthrough

Create resources with AWS CloudFormation

In order to use Amazon S3 Access Grants, you’ll need a cluster with Amazon EMR 6.15.0 or later. For more information, see the documentation for using Amazon S3 Access Grants with an Amazon EMR cluster, an Amazon EMR on EKS cluster, and an Amazon EMR Serverless application. For the purpose of this post, we’ll assume that you have two different types of data access users in your organization—analytics engineers with read and write access to the data in the bucket and business analysts with read-only access. We’ll utilize two different AWS IAM roles, but you can also connect your own identity provider directly to IAM Identity Center if you like.

Here’s the architecture for this first portion. The AWS CloudFormation stack creates the following AWS resources:

  • A Virtual Private Cloud (VPC) stack with private and public subnets to use with EMR Studio, route tables, and Network Address Translation (NAT) gateway.
  • An Amazon S3 bucket for EMR artifacts like log files, Spark code, and Jupyter notebooks.
  • An Amazon S3 bucket with sample data to use with S3 Access Grants.
  • An Amazon EMR cluster configured to use runtime roles and S3 Access Grants.
  • An Amazon EMR Serverless application configured to use S3 Access Grants.
  • An Amazon EMR Studio where users can login and create workspace notebooks with the EMR Serverless application.
  • Two AWS IAM roles we’ll use for our EMR job runs: one for Amazon EC2 with write access and another for Serverless with read access.
  • One AWS IAM role that will be used by S3 Access Grants to access bucket data (i.e., the Role to use when registering a location with S3 Access Grants. S3 Access Grants use this role to create temporary credentials).

To get started, complete the following steps:

  1. Choose Launch Stack:
  2. Accept the defaults and select I acknowledge that this template may create IAM resources.

The AWS CloudFormation stack takes approximately 10–15 minutes to complete. Once the stack is finished, go to the outputs tab where you will find information necessary for the following steps.

Create Amazon S3 Access Grants resources

First, we’re going to create an Amazon S3 Access Grants resources in our account. We create an S3 Access Grants instance, an S3 Access Grants location that refers to our data bucket created by the AWS CloudFormation stack that is only accessible by our data bucket AWS IAM role, and grant different levels of access to our reader and writer roles.

To create the necessary S3 Access Grants resources, use the following AWS CLI commands as an administrative user and replace any of the fields between the arrows with the output from your CloudFormation stack.

aws s3control create-access-grants-instance \
  --account-id <YOUR_ACCOUNT_ID>

Next, we create a new S3 Access Grants location. What is a Location? Amazon S3 Access Grants works by vending AWS IAM credentials with access scoped to a particular S3 prefix. An S3 Access Grants location will be associated with an AWS IAM Role from which these temporary sessions will be created.

In our case, we’re going to scope the AWS IAM Role to the bucket created with our AWS CloudFormation stack and give access to the data bucket role created by the stack. Go to the outputs tab to find the values to replace with the following code snippet:

aws s3control create-access-grants-location \
  --account-id <YOUR_ACCOUNT_ID> \
  --location-scope "s3://<DATA_BUCKET>/" \
  --iam-role-arn <DATA_BUCKET_ROLE>

Note the AccessGrantsLocationId value in the response. We’ll need that for the next steps where we’ll walk through creating the necessary S3 Access Grants to limit read and write access to your bucket.

  • For the read/write user, use s3-control create-access-grant to allow READWRITE access to the “output/*” prefix:
    aws s3control create-access-grant \
      --account-id <YOUR_ACCOUNT_ID> \
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> \
      --access-grants-location-configuration S3SubPrefix="output/*" \
      --permission READWRITE \
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_WRITER_ROLE>

  • For the read user, use s3control create-access-grant again to allow only READ access to the same prefix:
    aws s3control create-access-grant \
      --account-id <YOUR_ACCOUNT_ID> \
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> \
      --access-grants-location-configuration S3SubPrefix="output/*" \
      --permission READ \
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_READER_ROLE>

Demo Scenario 1: Amazon EMR on EC2 Spark Job to generate Parquet data

Now that we’ve got our Amazon EMR environments set up and granted access to our roles via S3 Access Grants, it’s important to note that the two AWS IAM roles for our EMR cluster and EMR Serverless application have an IAM policy that only allow access to our EMR artifacts bucket. They have no IAM access to our S3 data bucket and instead use S3 Access Grants to fetch short-lived credentials scoped to the bucket and prefix. Specifically, the roles are granted s3:GetDataAccess and s3:GetDataAccessGrantsInstanceForPrefix permissions to request access via the specific S3 Access Grants instance created in our region. This allows you to easily manage your S3 access in one place in a highly scoped and granular fashion that enhances your security posture. By combining S3 Access Grants with job roles on EMR on Amazon Elastic Kubernetes Service (Amazon EKS) and EMR Serverless as well as runtime roles for Amazon EMR steps beginning with EMR 6.7.0, you can easily manage access control for individual jobs or queries. S3 Access Grants are available on EMR 6.15.0 and later. Let’s first run a Spark job on EMR on EC2 as our analytics engineer to convert some sample data into Parquet.

For this, use the sample code provided in converter.py. Download the file and copy it to the EMR_ARTIFACTS_BUCKET created by the AWS CloudFormation stack. We’ll submit our job with the ReadWrite AWS IAM role. Note that for the EMR cluster, we configured S3 Access Grants to fall back to the IAM role if access is not provided by S3 Access Grants. The DATA_WRITER_ROLE has read access to the EMR artifacts bucket through an IAM policy so it can read our script. As before, replace all the values with the <> symbols from the Outputs tab of your CloudFormation stack.

aws s3 cp converter.py s3://<EMR_ARTIFACTS_BUCKET>/code/
aws emr add-steps --cluster-id <EMR_CLUSTER_ID> \
    --execution-role-arn <DATA_WRITER_ROLE> \
    --steps '[
        {
            "Type": "CUSTOM_JAR",
            "Name": "converter",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Args": [
                    "spark-submit",
                    "--deploy-mode",
                    "client",
                    "s3://<EMR_ARTIFACTS_BUCKET>/code/converter.py",
                    "s3://<DATA_BUCKET>/output/weather-data/"
            ]
        }
    ]'

Once the job finishes, we should see some Parquet data in s3://<DATA_BUCKET>/output/weather-data/. You can see the status of the job in the Steps tab of the EMR console.

Demo Scenario 2: EMR Studio with an interactive EMR Serverless application to analyze data

Now let’s go ahead and login to EMR Studio and connect to your EMR Serverless application with the ReadOnly runtime role to analyze the data from scenario 1. First we need to enable the interactive endpoint on your Serverless application.

  • Select the EMRStudioURL in the Outputs tab of your AWS CloudFormation stack.
  • Select Applications under the Serverless section on the left-hand side.
  • Select the EMRBlog application, then the Action dropdown, and Configure.
  • Expand the Interactive endpoint section and make sure that Enable interactive endpoint is checked.
  • Scroll down and click Configure application to save your changes.
  • Back on the Applications page, select EMRBlog application, then the Start application button.

Next, create a new workspace in our Studio.

  • Choose Workspaces on the left-hand side, then the Create workspace button.
  • Enter a Workspace name, leave the remaining defaults, and choose Create Workspace.
  • After creating the workspace, it should launch in a new tab in a few seconds.

Now connect your Workspace to your EMR Serverless application.

  • Select the EMR Compute button on the left-hand side as shown in the following code.
  • Choose EMR Serverless as the compute type.
  • Choose the EMRBlog application and the runtime role that starts with EMRBlog.
  • Choose Attach. The window will refresh and you can open a new PySpark notebook and follow along below. To execute the code yourself, download the AccessGrantsReadOnly.ipynb notebook and upload it into your workspace using the Upload Files button in the file browser.

Let’s do a quick read of the data.

df = spark.read.parquet(f"s3://{DATA_BUCKET}/output/weather-data/")
df.createOrReplaceTempView("weather")
df.show()

We’ll do a simple count(*):

spark.sql("SELECT year, COUNT(*) FROM weather GROUP BY 1").show()


You can also see that if we try to write data into the output location, we get an Amazon S3 error.

df.write.format("csv").mode("overwrite").save("s3://<DATA_BUCKET>/output/weather-data-2/")

While you can also grant similar access via AWS IAM policies, Amazon S3 Access Grants can be useful for situations where your organization has outgrown managing access via IAM, wants to map S3 Access Grants to IAM Identity Center principals or roles, or has previously used EMR File System (EMRFS) Role Mappings. S3 Access Grants credentials are also temporary providing more secure access to your data. In addition, as shown below, cross-account access also benefits from the simplicity of S3 Access Grants.

Demo Scenario 3 – Cross-account access

One of the other more common access patterns is accessing data across accounts. This pattern has become increasingly common with the emergence of data mesh, where data producers and consumers are decentralized across different AWS accounts.

Previously, cross-account access required setting up complex cross-account assume role actions and custom credentials providers when configuring your Spark job. With S3 Access Grants, we only need to do the following:

  • Create an Amazon EMR job role and cluster in a second data consumer account
  • The data producer account grants access to the data consumer account with a new instance resource policy
  • The data producer account creates an access grant for the data consumer job role

And that’s it! If you have a second account handy, go ahead and deploy this AWS CloudFormation stack in the data consumer account, to create a new EMR Serverless application and job role. If not, just follow along below. The AWS CloudFormation stack should finish creating in under a minute. Next, let’s go ahead and grant our data consumer access to the S3 Access Grants instance in our data producer account.

  • Replace <DATA_PRODUCER_ACCOUNT_ID> and <DATA_CONSUMER_ACCOUNT_ID> with the relevant 12-digit AWS account IDs.
  • You may also need to change the region in the command and policy.
    aws s3control put-access-grants-instance-resource-policy \
        --account-id <DATA_PRODUCER_ACCOUNT_ID> \
        --region us-east-2 \
        --policy '{
        "Version": "2012-10-17",
        "Id": "S3AccessGrantsPolicy",
        "Statement": [
            {
                "Sid": "AllowAccessToS3AccessGrants",
                "Principal": {
                    "AWS": "<DATA_CONSUMER_ACCOUNT_ID>"
                },
                "Effect": "Allow",
                "Action": [
                    "s3:ListAccessGrants",
                    "s3:ListAccessGrantsLocations",
                    "s3:GetDataAccess"
                ],
                "Resource": "arn:aws:s3:us-east-2:<DATA_PRODUCER_ACCOUNT_ID>:access-grants/default"
            }
        ]
    }'

  • And then grant READ access to the output folder to our EMR Serverless job role in the data consumer account.
    aws s3control create-access-grant \
        --account-id <DATA_PRODUCER_ACCOUNT_ID> \
        --region us-east-2 \
        --access-grants-location-id default \
        --access-grants-location-configuration S3SubPrefix="output/*" \
        --permission READ \
        --grantee GranteeType=IAM,GranteeIdentifier=arn:aws:iam::<DATA_CONSUMER_ACCOUNT_ID>:role/<EMR_SERVERLESS_JOB_ROLE> \
        --region us-east-2

Now that we’ve done that, we can read data in the data consumer account from the bucket in the data producer account. We’ll just run a simple COUNT(*) again. Replace the <APPLICATION_ID>, <DATA_CONSUMER_JOB_ROLE>, and <DATA_CONSUMER_LOG_BUCKET> with the values from the Outputs tab on the AWS CloudFormation stack created in your second account.

And replace <DATA_PRODUCER_BUCKET> with the bucket from your first account.

aws emr-serverless start-job-run \
  --application-id <APPLICATION_ID> \
  --execution-role-arn <DATA_CONSUMER_JOB_ROLE> \
  --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<DATA_CONSUMER_LOG_BUCKET>/logs/"
            }
        }
    }' \
  --job-driver '{
    "sparkSubmit": {
        "entryPoint": "SELECT COUNT(*) FROM parquet.`s3://<DATA_PRODUCER_BUCKET>/output/weather-data/`",
        "sparkSubmitParameters": "--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -e"
    }
  }'

Wait for the job to reach a completed state, and then fetch the stdout log from your bucket, replacing the <APPLICATION_ID>, <JOB_RUN_ID> from the job above, and <DATA_CONSUMER_LOG_BUCKET>.

aws emr-serverless get-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>
{
    "jobRun": {
        "applicationId": "00feq2s6g89r2n0d",
        "jobRunId": "00feqnp2ih45d80e",
        "state": "SUCCESS",
        ...
}

If you are on a unix-based machine and have gunzip installed, then you can use the following command as your administrative user.

Note that this command only uses AWS IAM Role Policies, not Amazon S3 Access Grants.

aws s3 ls s3:// <DATA_CONSUMER_LOG_BUCKET>/logs/applications/<APPLICATION_ID>/jobs/<JOB_RUN_ID>/SPARK_DRIVER/stdout.gz - | gunzip

Otherwise, you can use the get-dashboard-for-job-run command and open the resulting URL in your browser to view the Driver stdout logs in the Executors tab of the Spark UI.

aws emr-serverless get-dashboard-for-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>

Cleaning up

In order to avoid incurring future costs for examples resources in your AWS accounts, be sure to take the following steps:

  • You must manually delete the Amazon EMR Studio workspace created in the first part of the post
  • Empty the Amazon S3 buckets created by the AWS CloudFormation stacks
  • Make sure you delete the Amazon S3 Access Grants, resource policies, and S3 Access Grants location created in the steps above using the delete-access-grant, delete-access-grants-instance-resource-policy, delete-access-grants-location, and delete-access-grants-instance commands.
  • Delete the AWS CloudFormation Stacks created in each account

Comparison to AWS IAM Role Mapping

In 2018, EMR introduced EMRFS role mapping as a way to provide storage-level authorization by configuring EMRFS with multiple IAM roles. While effective, role mapping required managing users or groups locally on your EMR cluster in addition to maintaining the mappings between those identities and their corresponding IAM roles. In combination with runtime roles on EMR on EC2 and job roles for EMR on EKS and EMR Serverless, it is now easier to grant access to your data on S3 directly to the relevant principal on a per-job basis.

Conclusion

In this post, we showed you how to set up and use Amazon S3 Access Grants with Amazon EMR in order to easily manage data access for your Amazon EMR workloads. With S3 Access Grants and EMR, you can easily configure access to data on S3 for IAM identities or using your corporate directory in IAM Identity Center as your identity source. S3 Access Grants is supported across EMR on EC2, EMR on EKS, and EMR Serverless starting in EMR release 6.15.0.

To learn more, see the S3 Access Grants and EMR documentation and feel free to ask any questions in the comments!


About the author

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services. He builds tools and content to help make the lives of data engineers easier. When not hard at work, he still builds data pipelines and splits logs in his spare time.

Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool

Post Syndicated from Damon Cortesi original https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

Today, we’re pleased to introduce the Amazon EMR CLI, a new command line tool to package and deploy PySpark projects across different Amazon EMR environments. With the introduction of the EMR CLI, you now have a simple way to not only deploy a wide range of PySpark projects to remote EMR environments, but also integrate with your CI/CD solution of choice.

In this post, we show how you can use the EMR CLI to create a new PySpark project from scratch and deploy it to Amazon EMR Serverless in one command.

Overview of solution

The EMR CLI is an open-source tool to help improve the developer experience of developing and deploying jobs on Amazon EMR. When you’re just getting started with Apache Spark, there are a variety of options with respect to how to package, deploy, and run jobs that can be overwhelming or require deep domain expertise. The EMR CLI provides simple commands for these actions that remove the guesswork from deploying Spark jobs. You can use it to create new projects or alongside existing PySpark projects.

In this post, we walk through creating a new PySpark project that analyzes weather data from the NOAA Global Surface Summary of Day open dataset. We’ll use the EMR CLI to do the following:

  1. Initialize the project.
  2. Package the dependencies.
  3. Deploy the code and dependencies to Amazon Simple Storage Service (Amazon S3).
  4. Run the job on EMR Serverless.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An EMR Serverless application in the us-east-1 Region
  • An S3 bucket for your code and logs in the us-east-1 Region
  • An AWS Identity and Access Management (IAM) job role that can run EMR Serverless jobs and access S3 buckets
  • Python version >= 3.7
  • Docker

If you don’t already have an existing EMR Serverless application, you can use the following AWS CloudFormation template or use the emr bootstrap command after you’ve installed the CLI.

BDB-2063-launch-cloudformation-stack

Install the EMR CLI

You can find the source for the EMR CLI in the GitHub repo, but it’s also distributed via PyPI. It requires Python version >= 3.7 to run and is tested on macOS, Linux, and Windows. To install the latest version, use the following command:

pip3 install emr-cli

You should now be able to run the emr --help command and see the different subcommands you can use:

❯ emr --help                                                                                                
Usage: emr [OPTIONS] COMMAND [ARGS]...                                                                      
                                                                                                            
  Package, deploy, and run PySpark projects on EMR.                                                         
                                                                                                            
Options:                                                                                                    
  --help  Show this message and exit.                                                                       
                                                                                                            
Commands:                                                                                                   
  bootstrap  Bootstrap an EMR Serverless environment.                                                       
  deploy     Copy a local project to S3.                                                                    
  init       Initialize a local PySpark project.                                                            
  package    Package a project and dependencies into dist/                                                  
  run        Run a project on EMR, optionally build and deploy                                              
  status 

If you didn’t already create an EMR Serverless application, the bootstrap command can create a sample environment for you and a configuration file with the relevant settings. Assuming you used the provided CloudFormation stack, set the following environment variables using the information on the Outputs tab of your stack. Set the Region in the terminal to us-east-1 and set a few other environment variables we’ll need along the way:

export AWS_REGION=us-east-1
export APPLICATION_ID=<YOUR_EMR_SERVERLESS_APPLICATION_ID>
export JOB_ROLE_ARN=<YOUR_EMR_SERVERLESS_JOB_ROLE_ARN>
export S3_BUCKET=<YOUR_S3_BUCKET_NAME>

We use us-east-1 because that’s where the NOAA GSOD data bucket is. EMR Serverless can access S3 buckets and other AWS resources in the same Region by default. To access other services, configure EMR Serverless with VPC access.

Initialize a project

Next, we use the emr init command to initialize a default PySpark project for us in the provided directory. The default templates create a standard Python project that uses pyproject.toml to define its dependencies. In this case, we use Pandas and PyArrow in our script, so those are already pre-populated.

❯ emr init my-project
[emr-cli]: Initializing project in my-project
[emr-cli]: Project initialized.

After the project is initialized, you can run cd my-project or open the my-project directory in your code editor of choice. You should see the following set of files:

my-project  
├── Dockerfile  
├── entrypoint.py  
├── jobs  
│ └── extreme_weather.py  
└── pyproject.toml

Note that we also have a Dockerfile here. This is used by the package command to ensure that our project dependencies are built on the right architecture and operating system for Amazon EMR.

If you use Poetry to manage your Python dependencies, you can also add a --project-type poetry flag to the emr init command to create a Poetry project.

If you already have an existing PySpark project, you can use emr init --dockerfile to create the Dockerfile necessary to package things up.

Run the project

Now that we’ve got our sample project created, we need to package our dependencies, deploy the code to Amazon S3, and start a job on EMR Serverless. With the EMR CLI, you can do all of that in one command. Make sure to run the command from the my-project directory:

emr run \
--entry-point entrypoint.py \
--application-id ${APPLICATION_ID} \
--job-role ${JOB_ROLE_ARN} \
--s3-code-uri s3://${S3_BUCKET}/tmp/emr-cli-demo/ \
--build \
--wait

This command performs several actions:

  1. Auto-detects the type of Spark project in the current directory.
  2. Initiates a build for your project to package up dependencies.
  3. Copies your entry point and resulting build files to Amazon S3.
  4. Starts an EMR Serverless job.
  5. Waits for the job to finish, exiting with an error status if it fails.

You should now see the following output in your terminal as the job begins running in EMR Serverless:

[emr-cli]: Job submitted to EMR Serverless (Job Run ID: 00f8uf1gpdb12r0l)
[emr-cli]: Waiting for job to complete...
[emr-cli]: Job state is now: SCHEDULED
[emr-cli]: Job state is now: RUNNING
[emr-cli]: Job state is now: SUCCESS
[emr-cli]: Job completed successfully!

And that’s it! If you want to run the same code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), you can replace --application-id with --cluster-id j-11111111. The CLI will take care of sending the right spark-submit commands to your EMR cluster.

Now let’s walk through some of the other commands.

emr package

PySpark projects can be packaged in numerous ways, from a single .py file to a complex Poetry project with various dependencies. The EMR CLI can help consistently package your projects without having to worry about the details.

For example, if you have a single .py file in your project directory, the package command doesn’t need to do anything. If, however, you have multiple .py files in a typical Python project style, the emr package command will zip these files up as a package that can later be uploaded to Amazon S3 and provided to your PySpark job using the --py-files option. If you have third party dependencies defined in pyproject.toml, emr package will create a virtual environment archive and start your EMR job with the spark.archive option.

The EMR CLI also supports Poetry for dependency management and packaging. If you have a Poetry project with a corresponding poetry.lock file, there’s nothing else you need to do. The emr package command will detect your poetry.lock file and automatically build the project using the Poetry Bundle plugin. You can use a Poetry project in two ways:

  • Create a project using the emr init command. The commands take a --project-type poetry option that create a Poetry project for you:
    ❯ emr init --project-type poetry emr-poetry  
    [emr-cli]: Initializing project in emr-poetry  
    [emr-cli]: Project initialized.
    ❯ cd emr-poetry
    ❯ poetry install

  • If you have a pre-existing project, you can use the emr init --dockerfile option, which creates a Dockerfile that is automatically used when you run emr package.

Finally, as noted earlier, the EMR CLI provides you a default Dockerfile based on Amazon Linux 2 that you can use to reliably build package artifacts that are compatible with different EMR environments.

emr deploy

The emr deploy command takes care of copying the necessary artifacts for your project to Amazon S3, so you don’t have to worry about it. Regardless of how the project is packaged, emr deploy will copy the resulting files to your Amazon S3 location of choice.

One use case for this is with CI/CD pipelines. Sometimes you want to deploy a specific version of code to Amazon S3 to be used in your data pipelines. With emr deploy, this is as simple as changing the --s3-code-uri parameter.

For example, let’s assume you’ve already packaged your project using the emr package command. Most CI/CD pipelines allow you to access the git tag. You can use that as part of the emr deploy command to deploy a new version of your artifacts. In GitHub actions, this is github.ref_name, and you can use this in an action to deploy a versioned artifact to Amazon S3. See the following code:

emr deploy \
    --entry-point entrypoint.py \
    --s3-code-uri s3://<BUCKET_NAME>/<PREFIX>/${{github.ref_name}}/

In your downstream jobs, you could then update the location of your entry point files to point to this new location when you’re ready, or you can use the emr run command discussed in the next section.

emr run

Let’s take a quick look at the emr run command. We’ve used it before to package, deploy, and run in one command, but you can also use it to run on already-deployed artifacts. Let’s look at the specific options:

❯ emr run --help                                                                                            
Usage: emr run [OPTIONS]                                                                                    
                                                                                                            
  Run a project on EMR, optionally build and deploy                                                         
                                                                                                            
Options:                                                                                                    
  --application-id TEXT     EMR Serverless Application ID                                                   
  --cluster-id TEXT         EMR on EC2 Cluster ID                                                           
  --entry-point FILE        Python or Jar file for the main entrypoint                                      
  --job-role TEXT           IAM Role ARN to use for the job execution                                       
  --wait                    Wait for job to finish                                                          
  --s3-code-uri TEXT        Where to copy/run code artifacts to/from                                        
  --job-name TEXT           The name of the job                                                             
  --job-args TEXT           Comma-delimited string of arguments to be passed                                
                            to Spark job                                                                    
                                                                                                            
  --spark-submit-opts TEXT  String of spark-submit options                                                  
  --build                   Package and deploy job artifacts                                                
  --show-stdout             Show the stdout of the job after it's finished                                  
  --help                    Show this message and exit.

If you want to run your code on EMR Serverless, the emr run command takes an --application-id and --job-role parameters. If you want to run on EMR on EC2, you only need the --cluster-id option.

Required for both options are --entry-point and --s3-code-uri. --entry-point is the main script that will be called by Amazon EMR. If you have any dependencies, --s3-code-uri is where they get uploaded to using the emr deploy command, and the EMR CLI will build the relevant spark-submit properties pointing to these artifacts.

There are a few different ways to customize the job:

  • –job-name – Allows you to specify the job or step name
  • –job-args – Allows you to provide command line arguments to your script
  • –spark-submit-opts – Allows you to add additional spark-submit options like --conf spark.jars or others
  • –show-stdout – Currently only works with single-file .py jobs on EMR on EC2, but will display stdout in your terminal after the job is complete

As we’ve seen before, --build invokes both the package and deploy commands. This makes it easier to iterate on local development when your code still needs to run remotely. You can simply use the same emr run command over and over again to build, deploy, and run your code in your environment of choice.

Future updates

The EMR CLI is under active development. Updates are currently in progress to support Amazon EMR on EKS and allow for the creation of local development environments to make local iteration of Spark jobs even easier. Feel free to contribute to the project in the GitHub repository.

Clean up

To avoid incurring future charges, stop or delete your EMR Serverless application. If you used the CloudFormation template, be sure to delete your stack.

Conclusion

With the release of the EMR CLI, we’ve made it easier for you to deploy and run Spark jobs on EMR Serverless. The utility is available as open source on GitHub. We’re planning a host of new functionalities; if there are specific requests you have, feel free to file an issue or open a pull request!


About the author

Damon is a Principal Developer Advocate on the EMR team at AWS. He’s worked with data and analytics pipelines for over 10 years and splits his team between splitting service logs and stacking firewood.

Announcing Amazon EMR Serverless (Preview): Run big data applications without managing servers

Post Syndicated from Damon Cortesi original https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

Overview of EMR Serverless

Tens of thousands of customers use Amazon EMR, a managed service for running open-source analytics frameworks such as Apache Spark, Hive, and Presto, for large-scale data analytics applications. With Amazon EMR, you can provision clusters of any size in minutes. Amazon EMR automatically installs and configures the frameworks you choose, and provides a performance-optimized runtime that is compatible with and over twice as fast as standard open-source.

Amazon EMR customers have full control over cluster configuration. The ability to customize clusters allows you to optimize for cost and performance based on workload requirements. For example, you can use Amazon Elastic Compute Cloud (Amazon EC2) memory optimized instances to run SQL workloads with low latency, or use the EC2 Graviton2-based instances to improve performance. You can also use EC2 Spot Instances, which are integrated in Amazon EMR so that you can take advantage of unused EC2 capacity in the AWS Cloud to obtain instances at up to a 90% discount compared to On-Demand prices. If you run your applications on Kubernetes, you can use Amazon EMR on Amazon EKS to run your Amazon EMR analytics applications on Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

However, tuning clusters for optimal cost and performance requires engineers to have deep knowledge of the underlying analytics frameworks. Furthermore, the specific compute and memory resources needed to optimally run applications depend on various factors, such as the schedule and complexity of data processing jobs and the volume of data being processed. When these characteristics change over time, you need to reevaluate and reconfigure clusters. In addition, administrators have to secure and monitor the clusters to ensure that they’re compliant with corporate security policies, and adjust security settings each time the cluster is reconfigured. Many customers don’t need this level of customization and control, and want a simpler way to process data using open-source frameworks on the Amazon EMR performance-optimized runtime.

With this in mind, we built EMR Serverless. With EMR Serverless, you can get all the benefits of running Amazon EMR, but with a serverless environment. We had the following goals in mind when we built EMR Serverless:

  • Provide a simpler experience – EMR Serverless is simple to use because you don’t have to configure, optimize, operate, or secure clusters. You don’t have to worry about instance types or cluster sizes, or about applying OS patches. You simply specify the framework and version that you want to use for your application, and submit your data processing jobs. You still get all the benefits that you expect out of Amazon EMR—open-source compatibility, open-source version currency, and performance-optimized runtime—but without the need to manage clusters.
  • No need to guess cluster sizes – EMR Serverless eliminates the need to right-size clusters for varying jobs and data sizes. With EMR Serverless, you create an application using an open-source framework version, and submit jobs to the application. EMR Serverless automatically adds and removes workers at different stages of processing your job. As a result, you don’t have to reconfigure when data volumes change, and you only pay for what your jobs require. You can control costs by specifying the minimum and maximum number of concurrent workers, and the VCPU and memory per worker.
  • Retain Amazon EMR’s performance-optimized runtime and open-source currency – EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark, Hive, and Presto. The Amazon EMR runtime is API-compatible and over twice as fast as standard open-source, so your jobs run faster and incur less compute costs.
  • Seamless integration with EMR Studio – EMR Serverless includes EMR Studio, which provides fully managed serverless Jupyter Notebooks and familiar open-source tools such as Spark UI and Tez UI to help you develop, visualize, and debug your applications.
  • Automatic and fine-grained scaling – EMR Serverless automatically scales up workers at each stage of processing your job and scales them down when they’re not required. You’re charged for aggregate vCPU, memory, and storage resources used from the time a worker starts running until it stops, rounded up to the nearest second with a 1-minute minimum. For example, your job may require 10 workers for the first 10 minutes of processing the job, and 50 workers for the next 5 minutes. With fine-grained automatic scaling, you only incur cost for 10 workers for 10 minutes and 50 workers for 5 minutes. As a result, you don’t have to pay for underutilized resources.
  • Resilience to Availability Zone failures – EMR Serverless is a Regional service. When you submit jobs to an EMR Serverless application, it can run in any Availability Zone in the Region. A job is run in a single Availability Zone to avoid performance implications of network traffic across Availability Zones. In case an Availability Zone is impaired, a job submitted to your EMR Serverless application is automatically run in a different (healthy) Availability Zone. When using resources in a private VPC, EMR Serverless recommends you specify the private VPC configuration for multiple Availability Zones so that EMR Serverless can automatically select a healthy Availability Zone.
  • Enable shared applications – When you submit jobs to an EMR Serverless application, you can specify the AWS Identity and Access Management (IAM) role that must be used by the job to access AWS resources such as Amazon Simple Storage Service (Amazon S3) objects. As a result, different IAM principals can run jobs on a single EMR Serverless application, and each job can only access the AWS resources that the IAM principal is allowed to access. This enables you to set up scenarios where a single application with a pre-initialized pool of workers is made available to multiple tenants wherein each tenant can submit jobs using a different IAM role but use the common pool of pre-initialized workers to immediately process requests.
  • Enable interactive applications – Interactive applications that allow data scientists and analysts to run interactive SQL queries for data exploration require a fast response time to user requests. For such interactive applications, EMR Serverless allows you to pre-initialize a pool of workers. You can start your EMR Serverless application and pre-initialize the pool of workers as soon as a user starts the application, and stop the application to stop workers when no interactive users are active. If processing user requests requires more workers than what have been pre-initialized, EMR Serverless automatically adds more workers up to the maximum concurrent limits that you specify. Therefore, by controlling the number of workers to pre-initialize and the maximum concurrent workers, you can optimize user experience and cost for your interactive applications.
  • Make it easy to switch from one deployment model to another – The same Amazon EMR releases are provided for applications using EMR clusters, Amazon EMR on EKS, and EMR Serverless. When you build an application using an Amazon EMR release (for example a Spark job using Amazon EMR release 6.4), you can choose to run it on an EMR cluster, Amazon EMR on EKS, or EMR Serverless without having to rewrite the application. This allows you to build applications for a given framework version, and retain the flexibility to change the deployment model based on future operational needs.

Core concepts

In this section, we discuss the core concepts in EMR Serverless: applications, jobs, workers, and pre-initialized workers.

Application

With EMR Serverless, you can create one or more applications that use open-source analytics frameworks. To create an application, you specify the open-source framework that you want to use (for example, Apache Spark or Apache Hive), the Amazon EMR release for the open-source framework version (for example, Amazon EMR release 6.4, which corresponds to Apache Spark 3.1.2), and a name for your application. After you create an application, you can submit data processing jobs or interactive requests to your application.

The following are a few examples where you may want to create multiple applications:

  • To use different open-source frameworks (for example, Hive or Spark)
  • To use different versions of open-source frameworks for different use cases (for example, use a newer version of Spark for a new application without having to upgrade older applications)
  • To perform A/B testing when upgrading from one version to another (for example, migrating from Spark 2.4 to Spark 3.1)
  • To maintain separate logical environments for test and production scenarios
  • To provide separate logical environments for different teams with independent cost controls and usage tracking
  • To logically separate different line-of-business applications (for example, finance vs. marketing)

Job

A job is a request submitted to an EMR Serverless application that is asynchronously run and tracked through completion. You can run multiple jobs concurrently in an application.

Workers

An EMR Serverless application internally uses workers to run your jobs. By default, each application uses workers with 4 VCPU, 30 memory, and 20 GB of local storage per worker. You have the ability to customize this configuration.

Pre-initialized workers

EMR Serverless provides an optional feature to pre-initialize workers when your application starts up, so that the workers are ready to process requests immediately when a job is submitted to the application. Pre-initialized workers allow you to maintain a warm pool of workers for the application so that it can provide a sub-second response to start processing requests.

Common usage patterns applied to EMR Serverless

Now let’s examine some common usage scenarios and how EMR Serverless provides you a simple solution.

Pattern #1: Data pipelines

Data pipelines are the backbone of your analytics workloads. A common pattern with data pipelines is to start a cluster, run a job, and stop the cluster when the job is complete. Because data is separated from compute, the inputs and outputs for each job are persisted separately from the cluster (for example, in Amazon S3). These steps are frequently automated using workflow orchestration applications such as Apache Airflow. You can also use AWS services such as AWS Step Functions and AWS Managed Workflows for Apache Airflow (Amazon MWAA) to create such workflows.

Although automating these steps isn’t complex, data engineers have to spend time determining the appropriate EC2 instance and cluster size. They have to determine the Availability Zone where the cluster is run, and handle failover. They have to test their applications when adopting OS updates. When data sizes change over time, they have to resize clusters, or use features like Amazon EMR managed scaling that automatically resize clusters. EMR Serverless provides a simpler solution by eliminating the need for you to handle these scenarios. You simply choose the open-source framework and version for your application, and submit jobs. You don’t have to worry about instance selection, cluster sizes, cluster startup, cluster resize, stopping nodes, Availability Zone failover, or OS updates.

Pattern #2: Shared clusters

Another common pattern is for teams to use a shared long-running cluster to run multiple jobs. In this case, engineers implement queues in Apache YARN for different workloads on a common cluster, and set up rules to automatically scale the cluster up or down based on overall workload. With Amazon EMR on EC2 clusters, you can use Amazon EMR managed scaling, a feature that automatically scales clusters up or down depending on the workload. With EMR Serverless, workers are assigned to each job when required, so your jobs get the resources they need. Moreover, because you only pay for the workers that your jobs require, you don’t incur cost for over-provisioned resources. Finally, because each job can specify the IAM role that should be used to access AWS resources when running the job, you don’t have to set up complex configurations to manage queues and permissions.

Pattern #3: Interactive workloads

A third pattern of use is when teams keep a cluster of instances available to support interactive analysis. In this case, the cluster is set up and initialized with applications that wait for interactive user requests. Applications are pre-initialized so that they can immediately start processing user requests and provide an interactive user experience. EMR Serverless enables this scenario without requiring you to manage clusters. You can specify the number of workers that you want to pre-initialize when you start an EMR Serverless application. Subsequently, when users submit requests, the pre-initialized workers can be used to immediately process user requests. If processing the user requests requires more workers than what you have chosen to pre-initialize, EMR Serverless automatically adds more workers (up to the maximum concurrent limit that you specify). When the requests are processed, EMR Serverless automatically reverts back to maintaining the pre-initialized workers that you specified. You can control when the pre-initialized workers start by controlling when to start and stop your EMR Serverless application. For example, you can start your application when users begin interactive analysis and turn it off when there are no user requests and the application remains idle.

Demo

Conclusion

In this post, we discussed the core concepts and common usage patterns of EMR Serverless, and showed you a quick demo. EMR Serverless is in Preview, in which you can run workloads using Spark 3.1.2 and Hive 2.0 using the API, AWS Command Line Interface (AWS CLI), and SDK. Sign up for now, and for more information, see EMR Serverless documentation.


About the Authors

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services.

Mehul Y. Shah is the GM for Amazon EMR.

Abhishek Sinha is a Principal Product Manager at Amazon Web Services.