All posts by Tanzir Musabbir

Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog

Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/build-a-self-service-environment-for-each-line-of-business-using-amazon-emr-and-aws-service-catalog/

Enterprises often want to centralize governance and compliance requirements, and provide a common set of policies on how Amazon EMR instances should be set up. You can use AWS Service Catalog to centrally manage commonly deployed Amazon EMR cluster configurations, and this helps you achieve consistent governance and meet your compliance requirements, while at the same time enabling your end users to quickly deploy only the approved EMR cluster configurations on a self-service basis.

In this post, we will demonstrate how enterprise administrators can use AWS Service Catalog to create and manage catalogs, that data engineers and data scientists use to quickly discover and deploy clusters using a self-service environment. With AWS Service Catalog you can control which EMR release versions are available, cluster configuration, and permission access by individual, group, department, or cost center.

The following are a few key AWS Service Catalog concepts:

  • An AWS Service Catalog product is a blueprint for building the AWS resources that you want available for deployment. You create your products by importing AWS CloudFormation templates.
  • A portfolio is a collection of products. With AWS Service Catalog, you can create a customized portfolio for each type of user in your organization and selectively grant access to the appropriate portfolio.
  • A provisioned product is a collection of resources that result from instantiating an AWS CloudFormation

Use cases

You can use AWS Service Catalog to provide Amazon EMR as a self-serve Extract, Transform, Load (ETL) platform at scale while hiding all the security and network configurations from end users.

As an administrator in AWS Service Catalog, you can create one or more Service Catalog products that define different configurations to be used for EMR clusters. In those Service Catalog products, you can define the security and network configurations to be used for the EMR cluster, you can define auto-scaling rules, instance configurations, different purchase options, or you can preconfigure EMR to run different EMR Step jobs. On the other hand, as a user in Service Catalog, you can browse through different EMR templates through Service Catalog products and provision the product based on your requirement. By following this approach, you can make your EMR usage self-serviceable, reduce the EMR learning curve for your users, and ensure adherence to security standards and best practices.

The following image illustrates how the interactions look between Amazon EMR administrators and end-users when using AWS Service Catalog to provision EMR clusters.

The use cases in this post have three AWS Identity and Access Management (IAM) users with different access permissions:

  • emr-admin: This user is the administrator and has access to all the resources. This user creates EMR clusters for their end-users based on their requirements.
  • emr-data-engineer: The data engineer uses Spark and Hive most of the time. They run different ETL scripts on Hive and Spark to process, transform, and enrich their datasets.
  • emr-data-analyst: This user is very familiar with SQL and mostly uses Hue to submit queries to Hive.

You can solve several Amazon EMR operational use cases using AWS Service Catalog. The following sections discuss three different use cases. Later in this post, you walk through each of the use cases with a solution.

Use case 1: Ensuring least privilege and appropriate access

The administrator wants to enforce a few organizational standards. The first one is no default EMR_EC2_ROLE for any EMR cluster. Instead, the administrator wants to have a role that has limited access to Amazon Simple Storage Service (Amazon S3) and assigns that role automatically every time an EMR cluster is launched. Second, end-users sometimes forget to add appropriate tags to their resources. Because of that, often times it is hard for the administrator to identify their resources and allocate cost appropriately. So, the administrator wants to have a mechanism that assigns tags to EMR clusters automatically when they launch.

Use case 2: Providing Amazon EMR as a self-serve ETL platform with Spark and Hive

Data engineers use Spark and Hive applications, and they prefer to have a platform where they just submit their jobs without spending time creating the cluster. They also want to try out different Amazon EMR versions to see how their jobs run on different Spark or Hive versions. They don’t want to spend time learning AWS or Amazon EMR. Additionally, the administrator doesn’t want to give full Amazon EMR access to all users.

Use case 3: Automatically scaling the Hive cluster for analysts

Data analysts have strong SQL backgrounds, so they typically use Hue to submit their Hive queries. They run queries against a large dataset, so they want to have an EMR cluster that can scale when needed. They also don’t have access to the Amazon EMR console and don’t know how to configure automatic scaling for Amazon EMR.

Solution overview

Service Catalog, self-serve your Amazon EMR users, enforce best practices and compliance, and speed up the adoption process.

At a high level, the solution includes the following steps:

  1. Configuring the AWS environment to run this solution.
  2. Creating a CloudFormation template.
  3. Setting up AWS Service Catalog products and portfolios.
  4. Managing access to AWS Service Catalog and provisioning products.
  5. Demonstrating the self-service Amazon EMR platform for users.
  6. Enforcing best practices and compliance through AWS Service Catalog.
  7. Executing ETL workloads on Amazon EMR using AWS Service Catalog.
  8. Optionally, setting up AWS Service Catalog and launching Amazon EMR products through the AWS Command Line Interface (AWS CLI).

The following section looks at the CloudFormation template, which you use to set up the AWS environment to run this solution.

Setting up the AWS environment

To set up this solution, you need to create a few AWS resources. The CloudFormation template provided in this post creates all the required AWS resources. This template requires you to pass the following parameters during the launch:

  • A password for your test users.
  • An Amazon Compute Cloud (Amazon EC2) key pair.
  • The latest AMI ID for the EC2 helper instance. This instance configures the environment and sets up the required files and templates for this solution.

This template is designed only to show how you can use Amazon EMR with AWS Service Catalog. This setup isn’t intended for production use without modification.

To launch the CloudFormation stack, choose Launch Stack:

Launching this stack creates several AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next step:

KeyDescription
ConsoleLoginURLURL you use to switch between multiple users
EMRSCBlogBucketName of the S3 bucket to store blog-related files
UserPasswordPassword to use for all the test users
DataAdminUsernameIAM user name for the administrator user
DataEngineerUsernameIAM user name for the data engineer user
DataAnalystUsernameIAM user name for the data analyst user
HiveScriptURLAmazon S3 path for the Hive script
HiveETLInputParameterPath for the Hive input parameter
HiveETLOutputParameterPath for the Hive output parameter
SparkScriptURLAmazon S3 path for the Spark script
SparkETLInputParameterPath for the Spark input parameter
SparkETLOutputParameterPath for the Spark output parameter

When the CloudFormation template is complete, record the outputs listed on the Outputs tab on the AWS CloudFormation console. See the following screenshot.

(Optional) Configuring the AWS CLI

The AWS CLI is a unified tool to manage your AWS services. In the optional step, you use the AWS CLI to create AWS Service Catalog products and portfolios. Installation of AWS CLI isn’t required for this solution. For instructions on configuring the AWS CLI in your environment, see Configuring the AWS CLI.

Provisioning EMR clusters through AWS Service Catalog

You can create AWS Service Catalog products from the existing CloudFormation template and use those products to provision a variety of EMR clusters. You can create an EMR cluster and consume the cluster’s services without having access to the cluster, which improves the Amazon EMR adoption process.

The following CloudFormation template creates an EMR cluster. This template takes two parameters:

  • Cluster size – You select how many core nodes you want to have in the EMR cluster
  • Compute type – Based on the compute type you choose; the template selects the respective EC2 instance type

As an account administrator, you can define the internal configuration for the EMR cluster. End users are not required to know all the security groups, subnet ID, key pair, and other information. They also don’t need to access the EMR cluster or spend time setting up your cluster. As an administrator, you define a template for the cluster; enforce all the compliance, versions, applications, automatic scaling rules through the CloudFormation template, and expose this template as a product through AWS Service Catalog.

The following section walks you through the solution for each use case.

Use cases walkthrough

The CloudFormation template already configured AWS Service Catalog portfolios and products. You can review these on the AWS Service Catalog console.

  1. Use the ConsoleLoginURL from the AWS CloudFormation console Outputs tab and sign in as an emr-admin user.
  2. On the AWS Service Catalog console, you can see two portfolios for engineers and analysts. In each of those portfolios, you can see two products.

The Data Analysts Stack contains products for the analyst and is assigned to the user emr-data-analyst. The Data Engineering Stack contains products for engineers and is assigned to the emr-data-engineer user. Upon logging in, they can see their respective products and portfolios.

Use case 1: Ensuring least privilege and appropriate access

The cluster administrator creates the least privilege IAM role for their users and associated that role through the Service Catalog product. Similarly, the administrator also assigns appropriate tags for each product. When data engineers or analysts launch an EMR cluster using any of their assigned products, the cluster has the least privilege access and resources are tagged automatically. To confirm this access is in place, complete the following steps:

  1. Sign in to the AWS Management Console as either emr-data-engineer user or emr-data-analyst user.

Your console looks slightly different because the end-user does not manage the products, they just use the product to launch the clusters or execute jobs on the cluster.

  1. Choose Default EMR and provision this product by choosing Launch Product.
  2. For the name of the provisioned product, enter SampleEMR.

The next screen shows a list of allowed parameters your administrator thinks you may need.

  1. Leave all parameters as default.
  2. For the cluster name, enter Sample EMR.
  3. Review all the information and launch the product.

It takes few minutes to spin up the cluster. When the cluster is ready, the status changes to Succeeded. The provision product page also shows you a list of outputs your product owner wants you to see. For example, using output values, your product owner can share Master DNS Address, Resource Manager URL, and Hue URL as shown in the following figure.

To verify if this launched EMR cluster has the expected IAM role and tags, sign in as emr-admin user and go to the AWS EMR Console to review the service role for EC2 instances and tags.

Use case 2: Providing Amazon EMR as a self-serve ETL platform with Spark and Hive

For this use case, data engineers have two different ETL scripts:

  • A Spark script that reads Amazon reviews stored in Amazon S3 and converts them into Parquet before writing back to Amazon S3
  • A Hive script that reads Amazon reviews data from Amazon S3 and finds out the top toys based on customer ratings.

The administrator creates a product to self-serve these users; the product defines the job type and the job parameters. End users selects the job type and passes script, input and output locations.

  1. Sign in as emr-data-engineer.
  2. Select the EMR ETL Engine product.
  3. Choose Launch.

The next page shows if the product has multiple versions. Because the engineer wants to try out two different Amazon EMR versions, the administrator provided both options through the product version. You can launch the EMR cluster with the required version by selecting your preferred product version.

  1. Enter the name of the product.
  2. For this post, select EMR 5.29.0.
  3. Choose Next.

  1. For JobType, choose Spark.
  2. For JobArtifacts, enter the following value (you can get these values from the AWS CloudFormation output):
s3://blog-emr-sc-<account-id>/scripts/spark_converter.py s3://amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz s3://blog-emr-sc-<account-id>/spark/
  1. Choose Next.

Based on your configuration, an EMR cluster launches. When the cluster is ready, the Spark job runs.

  1. In a different browser, sign in as emr-admin using the ConsoleLoginURL (from the AWS CloudFormation output).

You can see the cluster status, job status, and output path from the Amazon EMR console.

Now, go to Amazon S3 console to check the output path:

The Parquet files are written inside the Spark folder.

  1. To test the Hive job, go back to the first browser where you already signed in as emr-data-engineer.
  2. Choose Provisioned products list.
  3. Choose the product options menu (right-click) and choose Update provisioned product.

  1. On the next page, you can select a different version or the same version.
  2. In the Parameters section, choose Hive.
  3. In the JobArtifacts field, enter the following Hive parameters:
s3://blog-emr-sc-<account-id>/scripts/hive_converter.sql -d INPUT=s3://amazon-reviews-pds/tsv/ -d OUTPUT=s3://blog-emr-sc-<account-id>/hive/
  1. Choose Update.

If you select the same version, AWS Service Catalog compares the old provisioned product with the updated product and only runs the portion that you changed. For this post, I chose the same Amazon EMR version and only updated the job type and parameters. You can see that the same EMR cluster is still there, but on the Steps tab, a new step is executed for Hive.

  1. On the Amazon S3 console using the second browser, verify that a new folder hive is created with data that represents top toys based on Amazon reviews.

To recap, you saw how to use AWS Service Catalog to provide a product to run your ETL jobs. Your data engineers can focus on their ETL scripts and your platform can self-serve them to run their ETL jobs on the EMR cluster.

Use case 3: Automatically scaling the Hive cluster for data analysts

To automatically scale the Hive cluster for data analysts, complete the following steps:

  1. Using the console login URL from the AWS CloudFormation output, and sign in as emr-data-analyst and go to AWS Service Catalog console.

You can see a different set of products for this user.

For this use case, your data analysts want to have an automatically scaling EMR cluster with Hive application. The administrator set up the Auto-scaling EMR product with preconfigured rules.

  1. Choose Auto-scaling EMR.
  2. Enter a provisioned product name.
  3. Select Hive Auto-scaling.
  4. Choose Next.
  5. In the Parameters section, leave the options at their default and enter a cluster name.
  6. Launch the product.

The product owner also provided a client URL (for example, Hue URL) through the product output so business analysts can connect to it.

  1. Sign in as emr-admin and validate if this new cluster is configured with the expected automatic scaling rules.
  2. On the Amazon EMR console, choose the cluster.

You can see the configuration on the Hardware tab.

In this use case, you learned how to use AWS Service Catalog to provide business analyst users a preconfigured, automatically scaled EMR cluster.

(Optional) Setting up AWS Service Catalog for Amazon EMR using AWS CLI

In the previous section, I demonstrated the solution using the AWS Service Catalog console. In the following section, I will show you how you use AWS Service Catalog using the AWS CLI. You can create AWS Service Catalog products and portfolios, assign IAM principals, and launch products.

  1. Create a portfolio named CLI – Stack for the user emr-admin. See the following command:
aws --region us-east-1 servicecatalog create-portfolio --display-name "CLI - Stack" --provider-name "@emr-admin" --description "Sample stack for pre-defined EMR clusters"

You receive a JSON output.

  1. Record the portfolio id port-xxxxxxxx from the output to use later.

The emr-admin user is the provider for this portfolio. The user is created with power user access, so the user can see the full-service catalog console and can manage products and portfolios.

You can associate this portfolio with multiple users. By assigning them to a portfolio, they can use the portfolio, browse through its products, and provision new products. For this use case, you associate a portfolio to emr-admin and the AWS CLI user name (the name of the user that you used to configure your AWS CLI). Make sure to update the portfolio and AWS account ID.

  1. Enter the following code:
aws --region us-east-1 servicecatalog associate-principal-with-portfolio --portfolio-id port-xxxxxxxxxx --principal-type IAM --principal-arn arn:aws:iam::xxxxx:user/emr-admin

aws --region us-east-1 servicecatalog associate-principal-with-portfolio --portfolio-id port-xxxxxxxxxx --principal-type IAM --principal-arn arn:aws:iam::xxxxx:user/<aws-cli-user-name>
  1. To verify the portfolio to the user’s association, enter the following command with the portfolio ID:
aws --region us-east-1 servicecatalog list-principals-for-portfolio --portfolio-id port-xxxxxxxxx

It will list out the associated principals for the above portfolio as shown in this following figure:

The CloudFormation template already copied the Amazon EMR template into your Amazon S3 account at the path s3://blog-emr-sc-<account-id>/products.

  1. To create the product CLI - Sample EMR using that template from Amazon S3, enter the following command:
aws --region us-east-1 servicecatalog create-product --name "CLI - Sample EMR" --owner "@emr-admin" --description "Sample EMR cluster with default" --product-type CLOUD_FORMATION_TEMPLATE --provisioning-artifact-parameters '{"Name": "Initial revision", "Description": "", "Info":{"LoadTemplateFromURL":"https://s3.amazonaws.com/blog-emr-sc-<account-id>/products/sample-cluster.template"},"Type":"CLOUD_FORMATION_TEMPLATE"}'

  1. Record the product ID and provision ID from the JSON output.

You now have a product and a portfolio. A portfolio can have one to many products, and each product can have multiple versions.

  1. To assign the CLI -Sample EMR product to the portfolio you created in Step 1, enter the following command:
aws --region us-east-1 servicecatalog associate-product-with-portfolio --product-id prod-xxxxxx --portfolio-id port-xxxxxx

A launch constraint specifies the IAM role that AWS Service Catalog assumes when an end-user launches a product. With a launch constraint, you can control end-user access to your AWS resources and limit usage.

The CloudFormation template already created the role Blog-SCLaunchRole; create a launch constraint using that IAM role. Use the portfolio and product IDs that you collected from the previous step and your AWS account ID.

  1. To create the launch constraint, enter the following command:
aws --region us-east-1 servicecatalog create-constraint --type LAUNCH --portfolio-id port-xxxxxx --product-id prod-xxxxxx --parameters '{"RoleArn" : "arn:aws:iam::<account-id>:role/Blog-SCLaunchRole"}'

  1. Record the launch constraint ID to use later.

You now have an AWS Service Catalog product that you can use to provision an EMR cluster. The CloudFormation template that you used to create the CLI - Sample EMR product takes three parameters (ClusterName, ComputeRequirements, ClusterSize).

  1. To pass those three parameters as a key value pair, enter the following command (use the product ID and provision ID that you recorded earlier):
aws --region us-east-1 servicecatalog provision-product --product-id prod-xxxxxx --provisioning-artifact-id pa-xxxxx --provisioned-product-name cli-emr --provisioning-parameters Key=ClusterName,Value=cli-emr-cluster Key=ComputeRequirements,Value=CPU Key=ClusterSize,Value=2

  1. Check the provisioned product’s status by using the provisioned product ID:
aws --region us-east-1 servicecatalog describe-provisioned-product --id pp-xxxxx

To recap, in this section you learned how to use AWS Service Catalog CLI to configure AWS Service Catalog products and portfolios, and how to provision an EMR cluster through AWS Service Catalog product.

Cleaning up

To clean up the resources you created, complete the following steps:

  1. Terminate the product that you provisioned in the previous step:
aws --region us-east-1 servicecatalog terminate-provisioned-product --provisioned-product-id pp-xxxxx
  1. Disassociate the product CLI – Sample EMR from the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog disassociate-product-from-portfolio --product-id prod-xxxxx --portfolio-id port-xxxxx
  1. Disassociate IAM principals from the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog disassociate-principal-from-portfolio --portfolio-id port-xxxxx --principal-arn arn:aws:iam::xxxxxx:user/emr-admin

aws --region us-east-1 servicecatalog disassociate-principal-from-portfolio --portfolio-id port-xxxxx --principal-arn arn:aws:iam::xxxxxx:user/<aws-cli-user-name> 
  1. Delete the launch constraint created in the previous step:
aws --region us-east-1 servicecatalog delete-constraint --id cons-xxxxx
  1. Delete the product CLI – Sample EMR:
aws --region us-east-1 servicecatalog delete-product --id prod-xxxxx
  1. Delete the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog delete-portfolio --id port-xxxxx

Cleaning up additional resources

You must also clean up the resources you created with the CloudFormation template.

  1. On the AWS Service Catalog console, choose Provisioned products list.
  2. Terminate each product that you provisioned for these use cases.
  3. Check each of the users and their provisioned products to make sure they’re terminated.
  4. On the Amazon S3 console, empty the bucket blog-emr-sc-<account-id>.
  5. If you are using the AWS CLI, delete the objects in the blog-emr-sc-<account-id> bucket with the following command (make sure you’re running this command on the correct bucket):
aws S3 s3://blog-emr-sc-<account-id> --recursive
  1. If you ran the optional AWS CLI section, make sure you follow the cleanup process mentioned in that section.
  2. On the AWS CloudFormation console or AWS CLI, delete the stack named Blog-EMR-Service-Catalog.

Next steps

To enhance this solution, you can explore the following options:

  • In this post, I enforced resource tagging through AWS CloudFormation. You can also use the AWS Service Catalog TagOptions library to provide a consistent taxonomy and tagging of AWS Service Catalog resources. During a product launch (provisioning), AWS Service Catalog aggregates the associated portfolio and product TagOptions and applies them to the provisioned product.
  • This solution demonstrates the usage of launch constraints and how you can provide limited access to your AWS resources to your users. You can also use template constraints to manage parameters. Template constraints make sure that end-users only have options that you allow them when launching products. This can help you maintain your organization’s compliance requirements.
  • You can integrate AWS Budgets with AWS Service Catalog. By associating AWS Budgets with your products and portfolios, you can track your usage and service costs. You can set a custom budget for each of the portfolios and trigger alerts when your costs exceed your threshold.

Summary

In this post, I showed you how you can simplify your Amazon EMR provisional process using the AWS Service Catalog, how to make Amazon EMR a self-service platform for your end-users, and how you can enforce best practices and compliance to your EMR clusters. You also walked through three different use cases and implemented solutions with AWS Service Catalog. Give this solution a try and share your experience with us!

 


About the Author

Tanzir Musabbir is a Data & Analytics Architect with AWS. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.

Migrate and deploy your Apache Hive metastore on Amazon EMR

Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/migrate-and-deploy-your-apache-hive-metastore-on-amazon-emr/

Combining the speed and flexibility of Amazon EMR with the utility and ubiquity of Apache Hive provides you with the best of both worlds. However, getting started with big data projects can feel intimidating. Whether you want to deploy new data on EMR or migrate an existing project, this post provides you with the basics to get started.

Apache Hive is an open-source data warehouse and analytics package that runs on top of an Apache Hadoop cluster. A Hive metastore contains a description of the table and the underlying data making up its foundation, including the partition names and data types. Hive is one of the applications that can run on EMR.

Most of the solutions that this post presents assume that you use Apache Hadoop to manage your metastore, which provides scalability for Hive. If you don’t use Hadoop, see documentation for Amazon EMR.

Hive metastore deployment

You can choose one of three configuration patterns for your Hive metastore: embedded, local, or remote. When migrating an on-premises Hadoop cluster to EMR, your migration strategy depends on your existing Hive metastore’s configuration.

Bear in mind a few key facts while considering your set-up. Apache Hive ships with the Derby database, which you can use for embedded metastores. However, Derby can’t scale for production-level workloads.

When running off EMR, Hive records metastore information in a MySQL database on the master node’s file system as ephemeral storage, creating a local metastore. When a cluster terminates, all cluster nodes shut down, including that master node, which erases your data.

To get around these problems, create an external Hive metastore. This helps ensure that the Hive metadata store can scale with your implementation and that the metastore persists even if the cluster terminates.

There are two options for creating an external Hive metastore for EMR:

Using the AWS Glue Data Catalog as the Hive metastore

The AWS Glue Data Catalog is flexible and reliable, making it a great choice when you’re new to building or maintaining a metastore. Because AWS manages the service for you, it means investing less time and resources to the process, but it also sacrifices some fine control. The Data Catalog is highly available, fault-tolerant, maintains data replicas to avoid failure, and expands hardware depending on usage.

You don’t have to manage the Hive metastore database instance separately, maintain ongoing replication, or scale up the instance. An AWS Glue Data Catalog can supply one EMR cluster or many, as well as supporting Amazon Athena and Amazon Redshift Spectrum. You can also download the source code for the AWS Glue Data Catalog client for Apache Hive Metastore and use that code as a reference implementation for building a compatible client.

AWS Glue Data Catalog still allows you plenty of control. You can enable encryption on your files, or configure action access to allow or forbid certain processes. Bear in mind that the Data Catalog doesn’t currently support column statistics, Hive authorizations, or Hive constraints.

An AWS Glue Data Catalog has versions, which means a table can have multiple schema versions. AWS Glue stores that information in the Data Catalog, including the Hive metastore data. Based on the catalog configuration, you can adopt the new schema version or ignore new versions.

When you create an EMR cluster using release version 5.8.0 and later, you can choose a Data Catalog as the Hive metastore. The Data Catalog is not available with earlier releases.

Specify the AWS Glue Data Catalog using the EMR console

When you set up an EMR cluster, choose Advanced Options to enable AWS Glue Data Catalog settings in Step 1. Apache Hive, Presto, and Apache Spark all use the Hive metastore. Within EMR, you have options to use the AWS Glue Data Catalog for any of these applications.

Specify the AWS Glue Data Catalog using the AWS CLI or EMR API

To specify the AWS Glue Data Catalog when you create a cluster in either the AWS CLI or the EMR API, use the hive-site configuration classification. Set the value of hive.metastore.client.factory.class property to com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]  

When you create an EMR cluster, save the configuration classification to a JSON file and then specify that file when you create the cluster. For more information, see Configuring Applications in the Amazon EMR Release Guide.

Using Amazon RDS or Amazon Aurora as the Hive metastore

If you want full control of your Hive metastore and want to integrate with other open-source applications such as Apache Ranger or Apache Atlas, then you can host your Hive metastore on Amazon RDS.

Always keep in mind that your Hive metastore is a single point of failure. Amazon RDS doesn’t automatically replicate databases, so you should enable replication when using Amazon RDS to avoid any data loss in the event of failure.

There are three main steps to set up your Hive metastore using RDS or Aurora:

  1. Create a MySQL or Aurora database.
  2. Configure the hive-site.xml file to point to MySQL or Aurora database.
  3. Specify an external Hive metastore.

Create a MySQL or Aurora database

Begin by setting up either your MySQL database on Amazon RDS or an Amazon Aurora database. Make a note of the URL, username, password, and database name, as you need all this information for the configuration process.

Update your database’s security group to allow JDBC connections between the EMR cluster and a MySQL database port (default: 3306).

Configure EMR for an external Hive metastore

To configure EMR, create a configuration file containing the following Hive site classification information:

  • jdo.option.ConnectionDriverName should reflect to driver org.mariadb.jdbc.Driver (preferred driver).
  • jdo.option.ConnectionURL, javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionPassword should all point to the newly created database.
[
    {
      "Classification": "hive-site",
      "Properties": {
        "javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/hostname:3306\/hive?createDatabaseIfNotExist=true",
        "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
        "javax.jdo.option.ConnectionUserName": "username",
        "javax.jdo.option.ConnectionPassword": "password"
      }
    }
]

Specify an external Hive metastore

After you save your configuration, specify an external Hive metastore. You can do this with either the EMR console or the AWS CLI.

On the EMR console, enter the classification settings created in the previous step as JSON file from S3 or embedded text.

If you are using the AWS CLI, save the classification information as a file named hive-configuration.json and pass the configuration file as a local file or from S3.

  • Hive-configuration.json file in local path:

aws emr create-cluster --release-label emr-5.17.0 --instance-type m4.large --instance-count 2 \
--applications Name=Hive --configurations ./hive-configuration.json --use-default-roles

  • Hive-configuration.json file in Amazon S3:

aws emr create-cluster --release-label emr-5.17.0 --instance-type m4.large --instance-count 2 \
--applications Name=Hive --configurations s3://emr-sample/hive-configuration.json --use-default-roles

Hive metastore migration options

When migrating Hadoop-based workloads from on-premises to the cloud, you must migrate your Hive metastore as well. Depending on the migration plan or your requirements, you can migrate a metastore one of two ways:

  • A one-time metastore migration, which moves an existing Hive metastore completely to AWS.
  • An ongoing metastore sync, which migrates the Hive metastore but also keeps a copy on-premises so that the two metastores can sync in real time during the migration phase.

One-time metastore migration

A one-and-done migration option allows you to shift your workspace entirely and never worry about migrating again. This situation is perfect if you plan to run your existing Hive workloads on EMR. The following diagram illustrates this scenario.

Migrating your Hive metastore to AWS Glue Data Catalog

In this case, your goal is to migrate existing Hive metastore from on-premises to an AWS Glue Data Catalog. There are multiple ways to navigate this migration, but the easiest uses an AWS Glue ETL job to extract metadata from your Hive metastore.  You then use AWS Glue jobs to load the metadata and update the AWS Glue Data Catalog. Many scripts to manage this process already exist on GitHub.

Migrating your Hive metastore to Amazon RDS or Amazon Aurora

Instead of using the AWS Glue Data Catalog, you can move your Hive metastore data from an on-premises database to AWS based storage. Depending on your database source and the desired target in AWS, the process requires different steps. For more information, see the following topics:

Ongoing metastore sync

Large-scale migrations benefit from an ongoing sync process, allowing you to keep running your Hive metastore in your data center as well as in the cloud during the migration phase.

The ongoing sync process keeps both Hive metastores accurate and up-to-date with any changes entered during the migration process. Use only one application for updating the Hive metastore. Otherwise, the metastore is out-of-sync.

AWS DMS is a data migration service ideal for on-going replication and custom-built for this need. You can also replicate the external database to Amazon RDS using the binary log file positions of replicated transactions.

Conclusion

This post pointed you to the various existing resources that can make your Hive migration as smooth and easy as possible.

The content of this blog post is part of the EMR Migration guide, which provides a comprehensive overview of advantages and disadvantages of each migration approach of Hadoop ecosystems. To read the paper, download the Amazon EMR Migration Guide now.

If you have additional insights or feedback, leave a comment here or reach out on Twitter!

 


About the Author

Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.

 

 

Launch an edge node for Amazon EMR to run RStudio

Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/launch-an-edge-node-for-amazon-emr-to-run-rstudio/

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

  • Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.
  • Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.
  • You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

In this post, I walk you through a list of steps to configure an Amazon EC2 instance as an Amazon EMR edge node with RStudio Server configured for remote workloads.

Solution overview

In the next few sections, I describe creating an edge node, installing RStudio, and connecting to a remote Spark cluster from R running on the edge node.

At a high level, the solution includes the following steps:

  1. Create an Amazon EC2 instance.
  2. Install RStudio Server and required dependencies on that instance.
  3. Install the Apache Spark and Hadoop client libraries and dependencies on the same instance.
  4. Launch an Amazon EMR cluster with the Apache Spark, Livy, and Hive applications.
  5. Configure the Amazon EC2 instance as an EMR client.
  6. Test EMR client functionality by running sample remote jobs.
  7. Connect to the Spark cluster from R using Sparklyr.
  8. Interact with Hive tables on EMR from RStudio.
  9. Run R models on the data on the EMR cluster.

Let’s take a look at the steps to configure an Amazon EC2 instance as EMR edge node.

Creating an edge node for Amazon EMR

In this exercise, I create a Spark client on the edge node. Because Spark relies on Hadoop libraries, I also install Hadoop on the edge node. To make sure that the client works properly, I install the same Hadoop and Spark versions as those on the EMR cluster. Because most of the libraries also run on JVM, I recommend that you have the same JVM version as the one on the EMR cluster.

After I install Spark and Hadoop on the edge node, I configure the edge node to talk to the remote Amazon EMR cluster. To do that, several configurations files from the EMR cluster need to copy to the edge node. There are two ways to copy the configuration files from a newly created EMR cluster to the edge node on EC2 instance—manual and automated. In the next section, I discuss those two approaches.

Manual approach

After the EMR cluster is up and running, you can use the secure transfer tool scp to copy the required configuration files from an EMR master node to a local machine. In my case, I used my laptop.

> mkdir emr-config
> cd emr-config
> scp -i <key> [email protected]<master-node-dns>:/etc/hadoop/conf/*-site.xml .

You can also use the same tool to copy those files from that local machine to the edge node:

> scp -i <key> hdfs-site.xml [email protected]<edge-node-dns>:/etc/hadoop/conf/.

PC users can use an application like WinSCP to connect and transfer files between an EMR master node and a PC.

Note: Depending on the applications installed on Amazon EMR, you might need to copy other libraries from the cluster. As an example, the open-source distributions of Hadoop and Spark packages that are used in this solution don’t have libraries for EMRFS. So, to use EMRFS, copy the EMRFS libraries to the edge node and update the classpath to include the libraries.

Automated approach (used in this solution)

As you might have noticed in the previous approach, you need to run the copy operation twice:

  1. From the EMR master node to a local machine
  2. From the local machine to the edge node

If you use a bastion host to access EMR, then the copy process also needs to go one extra hop. One way to automate this process is to execute a script as an EMR step, which uploads all the required libraries and configuration files to an Amazon S3 location. A second script on the edge node runs through cfn-init, which downloads files from the S3 location and places them in the right application paths. The following diagram illustrates a sequence of steps that take place during this process.

In this solution, the EMR step (CreateEMRClientDeps) executes the script create-emr-client.sh to copy the configuration files to Amazon S3. The script first creates an archive file awsemrdeps.tgz with all the required libraries. It then uploads that file into a temporary S3 bucket with a prefix ending in /emr-client/. On the edge node, the install-client-and-rstudio.sh script is used to copy the awsemrdeps.tgz file from S3 back to the edge node.

Let’s take a look at the AWS CloudFormation steps to create an edge node for Amazon EMR and run RStudio on the edge node.

Walkthrough using AWS CloudFormation

Following are the prerequisites to run the AWS CloudFormation template for this solution:

  • Create an Amazon VPC with at least one public subnet and one private subnet.
  • Update the IAM policy so that the user has access to create IAM policies, instance profile, roles, and security groups.
  • Enable VPC endpoints for Amazon S3.
  • Create an EC2 key-pair to connect to EC2 instances.

To set up this entire solution, you need to create a few AWS resources. The attached CloudFormation template creates all those required AWS resources and configures them to create an Amazon EMR edge node and running RStudio on it.

This CloudFormation template requires you to pass the following parameters during launch.

ParameterDescription
EmrSubnetThe subnet where the Amazon EMR cluster is deployed. It can be either a public or private subnet.
InstanceTypeThe Amazon EC2 instance type used for the RStudio Server and edge node, which defaults to m4.xlarge.
KeyNameThe name of the existing EC2 key pair to access the Amazon EMR and edge node.
RStudioServerSubnetThe public subnet where the RStudio Server and edge node are launched.
S3RepoPathThe Amazon S3 path where all required files (template, scripts job, sample data, and so on) are stored.
S3TempUploadPathThe S3 path in your AWS account for housing temporary dependency files and sample data for Hive.
VPCThe ID of the virtual private cloud (VPC) where the EMR and edge node is deployed.

Important: This template is designed only to show how you can create an EMR edge node and configure RStudio for remote EMR workloads. This setup isn’t intended for production use without modification. If you try this solution outside of the US-East-1 Region, be sure to download the necessary files from s3://aws-data-analytics-blog/rstudio-edge-node. You then upload the files to the buckets in your AWS Region, edit the script as appropriate, and then run it.

To launch the CloudFormation stack, choose Launch Stack:

The following sample screenshot shows the stack parameters.

Launching this stack creates the following AWS resources.

Logical IDResource typeDescription
EMRClusterAmazon EMR clusterThe EMR cluster to run the Spark and Hive jobs
CreateEMRClientDepsEMR step jobA job that runs a script to create client dependencies and uploads to S3
CreateHiveTablesEMR step jobA job to copy sample data for Hive and create Hive tables
RStudioConfigureWaitConditionCloudFormation wait conditionA wait condition that works with the wait handler, and waits for the RStudio Server setup process to complete
RStudioEIPElastic IP addressThe elastic IP address for RStudio Server
RStudioInstanceProfileInstance profileThe instance profile for the RStudio and edge node instance (for this solution, I used the default role EMR_EC2_DefaultRole created during EMR launch)
RStudioSecGroupAmazon EC2 security groupThe security group that controls incoming traffic to the edge node
RStudioServerEC2Amazon EC2 instanceThe EC2 instance for the edge node and RStudio Server
RStudioToEMRSecGroupAmazon EC2 security groupThe security group that controls traffic between EMR and the edge node
RStudioWaitHandleCloudFormation wait handlerThe wait handler that gets triggered after RStudio Server is launched
SecGroupSelfIngressAmazon EC2 security group ingress ruleAn ingress rule to RStudioToEMRSecGroup that allows the instance to talk an instance with the same security group

The CloudFormation template used in this solution configures S3 paths and stores files to their respective locations. The EMR client dependencies archive awsemrdeps.tgz is stored at the <<s3-temp-upload-path>>/emr-client/ location. The sample data file tripdata.csv is stored at <<s3-temp-upload-path>>/ny-taxi/.

The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed an S3 full path, s3://<<my-bucket>>/rstudio-edge-node, which is on my Amazon S3 account.

When the CloudFormation template has successfully completed, the DNS address of RStudio Server is displayed on the Outputs tab, as shown following.

The address shown is the DNS address of the RStudio Server and edge node. A user should be able to connect to this address immediately after enabling FoxyProxy.

Test data and tables

For the source data, I have used New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. The trip data is in comma-separated value (CSV) format with the first row as a header. The following image shows data from the trip dataset.

A second EMR step, CreateHiveTables, is created as part of the CloudFormation template. This step creates two Hive tables that will be later used by R on RStudio to run sample models. Both are external Hive tables—one is stored in HDFS on the EMR cluster and the other in Amazon S3. The goal is to demonstrate how RStudio can consume data with storage that is backed by HDFS and S3.

Table nameStorage typePath
ny_taxi_hdfsHDFS/user/ruser/ny_taxi
ny_taxi_s3S3s3://<s3-temp-upload-path>/ny_taxi

The following section shows a list of steps to test Amazon EMR client functionality, which is optional.

Testing EMR client functionality (optional)

If the EMR client is configured correctly on the edge node, you should be able to submit Spark jobs from the edge node to the EMR cluster. You can apply the following few steps to the edge node to verify this functionality:

  1. Log in to the edge node using the default user ec2-user.
  2. Choose this host address from the CloudFormation Outputs tab:
ssh -i <<key-pair>> [email protected]<<rstudio-server-address>>
  1. The CloudFormation template also creates a new user, called ruser, that you can use to submit Spark jobs or use the RStudio UI. Switch the user from ec2-user to ruser:
[[email protected] ~]$ sudo -s
[[email protected] ec2-user]# su – ruser
  1. Submit a Spark example job to the remote EMR cluster:
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.1.jar

  1. Check the job status in the terminal and also on the EMR console. The Spark example job should be able to finish successfully. In the terminal, it should display the value of Pi as shown following.

  1. Check the job status in the Resource Manager UI; notice that the Spark PI job ran as ruser and completed successfully.

  1. Test this setup further by running spark-shell, and retrieve Hive table data from the remote EMR cluster:
[[email protected] ~]$ $SPARK_HOME/bin/spark-shell

  1. Check the list of all available Hive tables and their content:
scala> spark.sql("show tables").show

scala> spark.sql("select * from ny_taxi_s3 limit 10").show

Running R and connecting to Apache Spark

In this section, let’s run some tests and models from RStudio consuming data from Amazon EMR. Locate the RStudio Server address on the Outputs tab on the CloudFormation console. The user name is ruser and the password is BigData26.

A successful login redirects you to this welcome window. The left big window is the console window, where you write R.

Create a SparkContext on R console. No additional configuration is needed, because RStudio is already set up with the required environment variables and files through the AWS CloudFormation stack. In this solution, Sparklyr is used to connect to Spark. Attach the required R packages before creating SparkContext as follows:

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "yarn")

When a connection to Spark is established, it creates a “yarn” connection channel (find this in the RStudio UI, on the Connections tab at the right corner). It also shows Hive metadata on the same widget. Because the CloudFormation template created two Hive tables, they appear under the “yarn” connection as shown following.

A YARN application is also placed under ruser. The status of the application is RUNNING as long as the connection is established. You can find the status of that application on the YARN ResourceManager UI also. Notice that the user is ruser and the name of the application is sparklyr.

For more information, check the YARN app log by choosing Log on the widget.

Now, test whether the data for those two Hive tables is accessible. Choose the ny_taxi_hdfs table to sample the data.

Choose the ny_taxi_s3 table to sample the data on S3.

By successfully running these two tests, you can see that RStudio running on an edge node can consume data stored in a remote EMR cluster.

During the development phase, users might want to write some data back to your S3 bucket. So it’s a good idea to verify whether a user can write data directly to S3 using R and Spark. To test this, I read the ny_taxi_hdfs Hive table using the spark_read_table API. Then I write the data to Amazon S3 by calling the spark_write_csv API and passing my S3 target path. For this solution, I used s3://tm-blogs-placeholder/write-from-rstudio as my new S3 path.

ny_taxi <- spark_read_table(sc, "ny_taxi_hdfs")
spark_write_csv(ny_taxi,path = "s3://tm-blogs-placeholder/write-from-rstudio")

After the write operation, the S3 location appears as follows.

You can also see Spark write logs in the YARN application log.

Now analyze the data with R and draw some plots. To do so, first check the count of ny_taxi data. It should return 20,000.

ny_taxi <- spark_read_table(sc, "ny_taxi_hdfs")
ny_taxi %>% count

Now, find the number of trips for each rate code type. There are six different rate code types where 1 is the standard rate code and 5 is the negotiated rate. For details, see this detailed dictionary of the taxi data.

library(ggplot2)
trip_by_rate_code_id <- ny_taxi %>%
  mutate(rate_code_id) %>%
  group_by(rate_code_id) %>%
  summarize(n = n()) %>%
  collect()

ggplot(trip_by_rate_code_id, aes(rate_code_id, n)) + 
  geom_bar(stat="Identity") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Number of Trips by Rate Code", x = "Rate Code Id", y = "")

Based on the graph, I can say that (except for some passengers who paid a negotiated rate) the rest of the passengers paid the standard rate during their ride.

Now find the average trip duration between two New York areas—Queens and Manhattan. The pu_location_id value represents the taxi pick-up zone, and do_location_id represents the taxi drop-off zone. For this test, I use 129 as the pick-up zone and 82 as the drop-off zone. Taxi zone 129 represents the Jackson Heights area in Queens, and taxi zone 82 represents the Elmhurst area. For details, see this taxi zone lookup table.

trip_duration_tbl <- ny_taxi %>%
  filter(pu_location_id == 129 & do_location_id == 82) %>%
  mutate(pickup_time = hour(from_unixtime(unix_timestamp(lpep_pickup_datetime, "MM/dd/yy HH:mm")))) %>%
  mutate(trip_duration = unix_timestamp(lpep_dropoff_datetime, "MM/dd/yy HH:mm") - unix_timestamp(lpep_pickup_datetime, "MM/dd/yy HH:mm")) %>%
  group_by(pickup_time) %>% 
  summarize(n = n(),
            trip_duration_mean = mean(trip_duration),
            trip_duration_p10 = percentile(trip_duration, 0.10),
            trip_duration_p25 = percentile(trip_duration, 0.25),
            trip_duration_p50 = percentile(trip_duration, 0.50),
            trip_duration_p75 = percentile(trip_duration, 0.75),
            trip_duration_p90 = percentile(trip_duration, 0.90)) %>% 
  collect()
            
ggplot(trip_duration_tbl, aes(x = pickup_time)) +
          geom_line(aes(y = trip_duration_p50, alpha = "Median")) +
          geom_ribbon(aes(ymin = trip_duration_p25, ymax = trip_duration_p75, 
                          alpha = "25–75th percentile")) +
          geom_ribbon(aes(ymin = trip_duration_p10, ymax = trip_duration_p90, 
                          alpha = "10–90th percentile")) +
          scale_y_continuous("Trip Duration (in seconds)") + 
          scale_x_continuous("Pickup Time of the day")

Based on the plot, I can say that on average, each trip duration was about 10–12 minutes. There was a rare peak around 1 a.m. for some days, where the trip duration was more than 30 minutes.

Next steps

The goal of this post is to show, first how to create an edge node or Amazon EMR client on an Amazon EC2 instance. Second, it’s to show how other applications—RStudio in this case—can use that edge node or Amazon EMR client to submit workloads remotely. By following the same approach, you can also create an edge node for other Hadoop applications—Hive client, Oozie client, HBase client, and so on. Data scientists can keep enriching their R environment by adding additional packages and keeping it totally isolated from developers EMR environments. To enhance this solution further and make this production ready, you can explore the following options:

  • Use friendly URLs for Amazon EMR interfaces. For example, instead of thrift://ip-10-0-20-253.ec2.internal:9083 for the hive.metastore.uris value, you can use something like thrift://hive-metastore.dev.example.corp:9083. In the same way, instead of using ip-10-0-20-253.ec2.internal:8032 for the yarn.resourcemanager.address property value, you can use dev.emr.example.corp:8032. The benefit of this approach is that, even if you terminate your EMR cluster and recreate it again (with new IP addresses), you don’t have to change your client node’s configuration. This blog post shows how you can create friendly URLs for Amazon EMR.
  • If you already integrated Microsoft Active Directory into your Amazon EMR cluster, you can do the same with RStudio. That way, you can achieve single sign-on across your data analytics solutions.
  • Enable detailed Amazon CloudWatch logs to monitor your edge node behaviors and trigger alerts for different scenarios (disk space utilization, memory usage, and so on). With this approach, you can proactively notify your data scientists before a possible failure.
  • H20 is one of the popular packages used in R. It’s open-source software that allows users to fit thousands of potential models to discover patterns in user data. You can install H20 using CRAN just like the way Sparklyr was installed in this solution. You can execute this on RStudio. Alternatively, you can add the H20 package as part of the installation process by placing it in the install-client-and-rstudio.sh
install.packages("h2o")
library(h2o)
localH2O = h2o.init()

Common issues

Although it’s hard to cover every possible scenario (because these vary on AWS environments), this section covers some common issues that can occur and ways to fix them.

Issue 1: Clicking the RStudio Server URL returns a There is no Internet connection error.

Solution:  Make sure that you configured FoxyProxy in your browser and that you are connecting to the public IP address of the RStudio EC2 instance. You can get this address from the AWS CloudFormation console on the Outputs tab.

Issue 2: The EMR step job CreateClientDeps fails.

Solution: This EMR step job runs the create-emr-client.sh script, which creates an archive with all required dependencies and uploads it to the S3 location. If the edge node doesn’t have write access to S3, this step job fails. In this solution, the default EMR role EMR_EC2_DefaultRole is assigned to the edge node instance also. We assume that EMR_EC2_DefaultRole has write access to the S3 location given through the CloudFormation parameter S3TempUploadPath.

Issue 3: The AWS CloudFormation template Blog-EMR-Edge-Node-With-RStudio times out or fails.

Solution: A script called install-client-and-rstudio.sh runs through cfn-init on the edge node, and it writes logs to the /tmp/edge-node-rstudio-installation.log file. This script contains a sleep clause, where it waits for the awsemrdeps.tgz file to be available on S3. This clause times out after 20 minutes. If the script fails to find that file within that time period, subsequent execution fails. Also, in this solution, RStudio uses http://cran.rstudio.com/ as its repo when installing packages. If the Amazon EC2 instance can’t reach the internet, it can’t download and install those packages, and the template might fail. Make sure that you pick a public subnet or a private subnet with NAT for the edge node.

Issue 4: During Amazon EMR client testing, the Spark sample application fails with a NoClassdefFounderror or UnsupportedOperationException error.

Solution: This blog post uses Amazon EMR 5.16.0. Make sure to use the Hadoop and Spark versions corresponding to the EMR release. If the master node’s application version is different from the edge node’s application version, the client might fail with a NoClassdefFounderror or UnsupportedOperationException error. Make sure that you always install the same version of Hadoop and Spark in both locations.

Cleaning up

When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-EMR-Edge-Node-With-RStudio.

Summary

In this post, I show you how to create a client for Amazon EMR. I also show how you can install RStudio on that client node and connect Apache Spark clusters running on Amazon EMR. I used Sparklyr to connect to Spark, consume data from both HDFS and S3, and analyze the data using R models. Go ahead—give this solution a try and share your experience with us!

 


Additional Reading

If you found this post useful, be sure to check out Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR, Statistical Analysis with Open-Source R and RStudio on Amazon EMR, and Running R on Amazon Athena.

 


About the Author

Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.