Tag Archives: Amazon EMR

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

New – Using Step Functions to Orchestrate Amazon EMR workloads

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/

AWS Step Functions allows you to add serverless workflow automation to your applications. The steps of your workflow can run anywhere, including in AWS Lambda functions, on Amazon Elastic Compute Cloud (EC2), or on-premises. To simplify building workflows, Step Functions is directly integrated with multiple AWS Services: Amazon ECS, AWS Fargate, Amazon DynamoDB, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Batch, AWS Glue, Amazon SageMaker, and (to run nested workflows) with Step Functions itself.

Starting today, Step Functions connects to Amazon EMR, enabling you to create data processing and analysis workflows with minimal code, saving time, and optimizing cluster utilization. For example, building data processing pipelines for machine learning is time consuming and hard. With this new integration, you have a simple way to orchestrate workflow capabilities, including parallel executions and dependencies from the result of a previous step, and handle failures and exceptions when running data processing jobs.

Specifically, a Step Functions state machine can now:

  • Create or terminate an EMR cluster, including the possibility to change the cluster termination protection. In this way, you can reuse an existing EMR cluster for your workflow, or create one on-demand during execution of a workflow.
  • Add or cancel an EMR step for your cluster. Each EMR step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster, including tools such as Apache Spark, Hive, or Presto.
  • Modify the size of an EMR cluster instance fleet or group, allowing you to manage scaling programmatically depending on the requirements of each step of your workflow. For example, you may increase the size of an instance group before adding a compute-intensive step, and reduce the size just after it has completed.

When you create or terminate a cluster or add an EMR step to a cluster, you can use synchronous integrations to move to the next step of your workflow only when the corresponding activity has completed on the EMR cluster.

Reading the configuration or the state of your EMR clusters is not part of the Step Functions service integration. In case you need that, the EMR List* and Describe* APIs can be accessed using Lambda functions as tasks.

Building a Workflow with EMR and Step Functions
On the Step Functions console, I create a new state machine. The console renders it visually, so that is much easier to understand:

To create the state machine, I use the following definition using the Amazon States Language (ASL):

{
  "StartAt": "Should_Create_Cluster",
  "States": {
    "Should_Create_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": true,
          "Next": "Create_A_Cluster"
        },
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": false,
          "Next": "Enable_Termination_Protection"
        }
      ],
      "Default": "Create_A_Cluster"
    },
    "Create_A_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
      "Parameters": {
        "Name": "WorkflowCluster",
        "VisibleToAllUsers": true,
        "ReleaseLabel": "emr-5.28.0",
        "Applications": [{ "Name": "Hive" }],
        "ServiceRole": "EMR_DefaultRole",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "LogUri": "s3://aws-logs-123412341234-eu-west-1/elasticmapreduce/",
        "Instances": {
          "KeepJobFlowAliveWhenNoSteps": true,
          "InstanceFleets": [
            {
              "InstanceFleetType": "MASTER",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            },
            {
              "InstanceFleetType": "CORE",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            }
          ]
        }
      },
      "ResultPath": "$.CreateClusterResult",
      "Next": "Merge_Results"
    },
    "Merge_Results": {
      "Type": "Pass",
      "Parameters": {
        "CreateCluster.$": "$.CreateCluster",
        "TerminateCluster.$": "$.TerminateCluster",
        "ClusterId.$": "$.CreateClusterResult.ClusterId"
      },
      "Next": "Enable_Termination_Protection"
    },
    "Enable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": true
      },
      "ResultPath": null,
      "Next": "Add_Steps_Parallel"
    },
    "Add_Steps_Parallel": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Step_One",
          "States": {
            "Step_One": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The first step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Wait_10_Seconds",
          "States": {
            "Wait_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Step_Two (async)"
            },
            "Step_Two (async)": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The second step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "ResultPath": "$.AddStepsResult",
              "Next": "Wait_Another_10_Seconds"
            },
            "Wait_Another_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Cancel_Step_Two"
            },
            "Cancel_Step_Two": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:cancelStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "StepId.$": "$.AddStepsResult.StepId"
              },
              "End": true
            }
          }
        }
      ],
      "ResultPath": null,
      "Next": "Step_Three"
    },
    "Step_Three": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "Step": {
          "Name": "The third step",
          "ActionOnFailure": "CONTINUE",
          "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
              "hive-script",
              "--run-hive-script",
              "--args",
              "-f",
              "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
              "-d",
              "INPUT=s3://eu-west-1.elasticmapreduce.samples",
              "-d",
              "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
            ]
          }
        }
      },
      "ResultPath": null,
      "Next": "Disable_Termination_Protection"
    },
    "Disable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": false
      },
      "ResultPath": null,
      "Next": "Should_Terminate_Cluster"
    },
    "Should_Terminate_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": true,
          "Next": "Terminate_Cluster"
        },
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": false,
          "Next": "Wrapping_Up"
        }
      ],
      "Default": "Wrapping_Up"
    },
    "Terminate_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:terminateCluster.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId"
      },
      "Next": "Wrapping_Up"
    },
    "Wrapping_Up": {
      "Type": "Pass",
      "End": true
    }
  }
}

I let the Step Functions console create a new AWS Identity and Access Management (IAM) role for the executions of this state machine. The role automatically includes all permissions required to access EMR.

This state machine can either use an existing EMR cluster, or create a new one. I can use the following input to create a new cluster that is terminated at the end of the workflow:

{
"CreateCluster": true,
"TerminateCluster": true
}

To use an existing cluster, I need to provide input in the cluster ID, using this syntax:

{
"CreateCluster": false,
"TerminateCluster": false,
"ClusterId": "j-..."
}

Let’s see how that works. As the workflow starts, the Should_Create_Cluster Choice state looks into the input to decide if it should enter the Create_A_Cluster state or not. There, I use a synchronous call (elasticmapreduce:createCluster.sync) to wait for the new EMR cluster to reach the WAITING state before progressing to the next workflow state. The AWS Step Functions console shows the resource that is being created with a link to the EMR console:

After that, the Merge_Results Pass state merges the input state with the cluster ID of the newly created cluster to pass it to the next step in the workflow.

Before starting to process any data, I use the Enable_Termination_Protection state (elasticmapreduce:setClusterTerminationProtection) to help ensure that the EC2 instances in my EMR cluster are not shut down by an accident or error.

Now I am ready to do something with the EMR cluster. I have three EMR steps in the workflow. For the sake of simplicity, these steps are all based on this Hive tutorial. For each step, I use Hive’s SQL-like interface to run a query on some sample CloudFront logs and write the results to Amazon Simple Storage Service (S3). In a production use case, you’d probably have a combination of EMR tools processing and analyzing your data in parallel (two or more steps running at the same time) or with some dependencies (the output of one step is required by another step). Let’s try to do something similar.

First I execute Step_One and Step_Two inside a Parallel state:

  • Step_One is running the EMR step synchronously as a job (elasticmapreduce:addStep.sync). That means that the execution waits for the EMR step to be completed (or cancelled) before moving on to the next step in the workflow. You can optionally add a timeout to monitor that the execution of the EMR step happens within an expected time frame.
  • Step_Two is adding an EMR step asynchronously (elasticmapreduce:addStep). In this case, the workflow moves to the next step as soon as EMR replies that the request has been received. After a few seconds, to try another integration, I cancel Step_Two (elasticmapreduce:cancelStep). This integration can be really useful in production use cases. For example, you can cancel an EMR step if you get an error from another step running in parallel that would make it useless to continue with the execution of this step.

After those two steps have both completed and produce their results, I execute Step_Three as a job, similarly to what I did for Step_One. When Step_Three has completed, I enter the Disable_Termination_Protection step, because I am done using the cluster for this workflow.

Depending on the input state, the Should_Terminate_Cluster Choice state is going to enter the Terminate_Cluster state (elasticmapreduce:terminateCluster.sync) and wait for the EMR cluster to terminate, or go straight to the Wrapping_Up state and leave the cluster running.

Finally I have a state for Wrapping_Up. I am not doing much in this final state actually, but you can’t end a workflow from a Choice state.

In the EMR console I see the status of my cluster and of the EMR steps:

Using the AWS Command Line Interface (CLI), I find the results of my query in the S3 bucket configured as output for the EMR steps:

aws s3 ls s3://MY-BUCKET/MyHiveQueryResults/
...

Based on my input, the EMR cluster is still running at the end of this workflow execution. I follow the resource link in the Create_A_Cluster step to go to the EMR console and terminate it. In case you are following along with this demo, be careful to not leave your EMR cluster running if you don’t need it.

Available Now
Step Functions integration with EMR is available in all regions. There is no additional cost for using this feature on top of the usual Step Functions and EMR pricing.

You can now use Step Functions to quickly build complex workflows for executing EMR jobs. A workflow can include parallel executions, dependencies, and exception handling. Step Functions makes it easy to retry failed jobs and terminate workflows after critical errors, because you can specify what happens when something goes wrong. Let me know what are you going to use this feature for!

Danilo

Amazon EMR introduces EMR runtime for Apache Spark

Post Syndicated from Joseph Marques original https://aws.amazon.com/blogs/big-data/amazon-emr-introduces-emr-runtime-for-apache-spark/

Amazon EMR is happy to announce Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. EMR runtime for Spark is up to 32 times faster than EMR 5.16, with 100% API compatibility with open-source Spark. This means that your workloads run faster, saving you compute costs without making any changes to your applications.

Amazon EMR has been adding Spark runtime improvements since EMR 5.24, and discussed them in Optimizing Spark Performance. EMR 5.28 features several new improvements.

To measure these improvements, we compared EMR 5.16 (with open source Apache Spark version 2.4) with EMR 5.28 (with EMR runtime for Apache Spark compatible with Apache Spark version 2.4). We used TPC-DS benchmark queries with 3 TB scale and ran them on a six-node c4.8xlarge EMR cluster with data in Amazon S3. We measured performance improvements as the geometric mean of improvement in total query execution time, and the total query execution time across all queries. The results showed considerable improvement—that the geometric mean was 2.4 times faster and the total query runtime was 3.2 times faster.

The following graph shows performance improvements measured as total runtime for 104 TPC-DS queries. EMR 5.28 has the better (lower) runtime.

The following graph shows performance improvements measured as the geometric mean for 104 TPC-DS queries. EMR 5.28 has the better (lower) geomean.

In breaking down the per-query improvements, you can observe the highest performance gains in long-running queries.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for long-running queries (running for more than 130 seconds in EMR 5.16). In this comparison, the higher numbers are better.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for short-running queries (running for less than 130 seconds). Again, the higher numbers are better.

Queries running for more than 130 seconds are up to 32 times faster as seen in query 72. Queries running for less than 130 seconds are up to 6 times faster, with an average improvement of 2 times faster across the board.

Customers use Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. They choose to run Spark on EMR because EMR provides the latest, stable, open-source community innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Auto Scaling. It also provides ease of use with managed EMR Notebooks, notebook-scoped libraries, Git integration, and easy debugging and monitoring with off-cluster Spark History Services. Combined with the runtime improvements, and fine-grained access control using AWS Lake Formation, Amazon EMR presents an excellent choice for customers running Apache Spark.

With each of these performance optimizations to Apache Spark, you benefit from better query performance. Stay tuned for additional updates that improve Apache Spark performance in Amazon EMR. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more Apache Spark optimizations, configuration best practices, and tuning advice.

 


About the Authors

Joseph Marques is a principal engineer for EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

 

 

 

 

New – Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top of that, you can leverage Amazon EMR to process and analyze your data using open source tools like Apache Spark, Hive, and Presto. As powerful as these tools are, it can still be challenging to deal with use cases where you need to do incremental data processing, and record-level insert, update, and delete.

Talking with customers, we found that there are use cases that need to handle incremental changes to individual records, for example:

  • Complying with data privacy regulations, where their users choose to exercise their right to be forgotten, or change their consent as to how their data can be used.
  • Working with streaming data, when you have to handle specific data insertion and update events.
  • Using change data capture (CDC) architectures to track and ingest database change logs from enterprise data warehouses or operational data stores.
  • Reinstating late arriving data, or analyzing data as of a specific point in time.

Starting today, EMR release 5.28.0 includes Apache Hudi (incubating), so that you no longer need to build custom solutions to perform record-level insert, update, and delete operations. Hudi development started in Uber in 2016 to address inefficiencies across ingest and ETL pipelines. In the recent months the EMR team has worked closely with the Apache Hudi community, contributing patches that include updating Hudi to Spark 2.4.4 (HUDI-12), supporting Spark Avro (HUDI-91), adding support for AWS Glue Data Catalog (HUDI-306), as well as multiple bug fixes.

Using Hudi, you can perform record-level inserts, updates, and deletes on S3 allowing you to comply with data privacy laws, consume real time streams and change data captures, reinstate late arriving data and track history and rollbacks in an open, vendor neutral format. You create datasets and tables and Hudi manages the underlying data format. Hudi uses Apache Parquet, and Apache Avro for data storage, and includes built-in integrations with Spark, Hive, and Presto, enabling you to query Hudi datasets using the same tools that you use today with near real-time access to fresh data.

When launching an EMR cluster, the libraries and tools for Hudi are installed and configured automatically any time at least one of the following components is selected: Hive, Spark, or Presto. You can use Spark to create new Hudi datasets, and insert, update, and delete data. Each Hudi dataset is registered in your cluster’s configured metastore (including the AWS Glue Data Catalog), and appears as a table that can be queried using Spark, Hive, and Presto.

Hudi supports two storage types that define how data is written, indexed, and read from S3:

  • Copy on Write – data is stored in columnar format (Parquet) and updates create a new version of the files during writes. This storage type is best used for read-heavy workloads, because the latest version of the dataset is always available in efficient columnar files.
  • Merge on Read – data is stored with a combination of columnar (Parquet) and row-based (Avro) formats; updates are logged to row-based “delta files” and compacted later creating a new version of the columnar files. This storage type is best used for write-heavy workloads, because new commits are written quickly as delta files, but reading the data set requires merging the compacted columnar files with the delta files.

Let’s do a quick overview of how you can set up and use Hudi datasets in an EMR cluster.

Using Apache Hudi with Amazon EMR
I start creating a cluster from the EMR console. In the advanced options I select EMR release 5.28.0 (the first including Hudi) and the following applications: Spark, Hive, and Tez. In the hardware options, I add 3 task nodes to ensure I have enough capacity to run both Spark and Hive.

When the cluster is ready, I use the key pair I selected in the security options to SSH into the master node and access the Spark Shell. I use the following command to start the Spark Shell to use it with Hudi:

$ spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
              --conf "spark.sql.hive.convertMetastoreParquet=false"
              --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

There, I use the following Scala code to import some sample ELB logs in a Hudi dataset using the Copy on Write storage type:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/"
val hudiTableName = "elb_logs_hudi_cow"
val hudiTablePath = "s3://MY-BUCKET/PATH/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "request_ip",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "request_verb", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "request_timestamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "request_verb", 
    DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write
       .format("org.apache.hudi")
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(hudiTablePath)

In the Spark Shell, I can now count the records in the Hudi dataset:

scala> inputDF2.count()
res1: Long = 10491958

In the options, I used the integration with the Hive metastore configured for the cluster, so that the table is created in the default database. In this way, I can use Hive to query the data in the Hudi dataset:

hive> use default;
hive> select count(*) from elb_logs_hudi_cow;
...
OK
10491958
...

I can now update or delete a single record in the dataset. In the Spark Shell, I prepare some variables to find the record I want to update, and a SQL statement to select the value of the column I want to change:

val requestIpToUpdate = "243.80.62.181"
val sqlStatement = s"SELECT elb_name FROM elb_logs_hudi_cow WHERE request_ip = '$requestIpToUpdate'"

I execute the SQL statement to see the current value of the column:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_003|
+------------+

Then, I select and update the record:

// Create a DataFrame with a single record and update column value
val updateDF = inputDF.filter(col("request_ip") === requestIpToUpdate)
                      .withColumn("elb_name", lit("elb_demo_001"))

Now I update the Hudi dataset with a syntax similar to the one I used to create it. But this time, the DataFrame I am writing contains only one record:

// Write the DataFrame as an update to existing Hudi dataset
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I check the result of the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Now I want to delete the same record. To delete it, I pass the EmptyHoodieRecordPayload payload in the write options:

// Write the DataFrame with an EmptyHoodieRecordPayload for deleting a record
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
                "org.apache.hudi.EmptyHoodieRecordPayload")
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I see that the record is no longer available:

scala> spark.sql(sqlStatement).show()
+--------+                                                                      
|elb_name|
+--------+
+--------+

How are all those updates and deletes managed by Hudi? Let’s use the Hudi Command Line Interface (CLI) to connect to the dataset and see now those changes are interpreted as commits:

This dataset is a Copy on Write dataset, that means that each time there is an update to a record, the file that contains that record will be rewritten to contain the updated values. You can see how many records have been written for each commit. The bottom line of the table describes the initial creation of the dataset, above there is the single record update, and at the top the single record delete.

With Hudi, you can roll back to each commit. For example, I can roll back the delete operation with:

hudi:elb_logs_hudi_cow->commit rollback --commit 20191104121031

In the Spark Shell, the record is now back to where it was, just after the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Copy on Write is the default storage type. I can repeat the steps above to create and update a Merge on Read dataset type by adding this to our hudiOptions:

DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ"

If you update a Merge on Read dataset and look at the commits with the Hudi CLI, you can see how different Merge on Read is compared to Copy on Write. With Merge on Read, you are only writing the updated rows and not whole files as with Copy on Write. This is why Merge on Read is helpful for use cases that require more writes, or update/delete heavy workload, with a fewer number of reads. Delta commits are written to disk as Avro records (row-based storage), and compacted data is written as Parquet files (columnar storage). To avoid creating too many delta files, Hudi will automatically compact your dataset so that your reads are as performant as possible.

When a Merge On Read dataset is created, two Hive tables are created:

  • The first table matches the name of the dataset.
  • The second table has the characters _rt appended to its name; the _rt postfix stands for real-time.

When queried, the first table return the data that has been compacted, and will not show the latest delta commits. Using this table provides the best performance, but omits the freshest data. Querying the real-time table will merge the compacted data with the delta commits on read, hence this dataset being called “Merge on Read”. This will result in the freshest data being available, but incurs a performance overhead, and is not as performant as querying the compacted data. In this way, data engineers and analysts have the flexibility to choose between performance and data freshness.

Available Now
This new feature is available now in all regions with EMR 5.28.0. There is no additional cost in using Hudi with EMR. You can learn more about Hudi in the EMR documentation. This new tool can simplify the way you process, update and delete data in S3. Let me know which use cases are you going to use it for!

Danilo

Secure your data on Amazon EMR using native EBS and per bucket S3 encryption options

Post Syndicated from Duncan Chan original https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Local disk encryption on Amazon EMR

Previously you could only choose Linux Unified Key Setup (LUKS) for at-rest encryption. You now have a choice of using LUKS or native EBS encryption to encrypt EBS volumes attached to an EMR cluster. EBS encryption provides the following benefits:

  • End-to-end encryption – When you enable EBS encryption for Amazon EMR, all data on EBS volumes, including intermediate disk spills from applications and Disk I/O between the nodes and EBS volumes, are encrypted. The snapshots that you take of an encrypted EBS volume are also encrypted and you can move them between AWS Regions as needed.
  • Amazon EMR root volumes encryption – There is no need to create a custom Amazon Linux Image for encrypting root volumes.
  • Easy auditing for encryption When you use LUKS encryption, though your EBS volumes are encrypted along with any instance store volumes, you still see EBS with Not Encrypted status when you use an Amazon EC2 API or the EC2 console to check on the encryption status. This is because the API doesn’t look into the EMR cluster to check the disk status; your auditors would need to SSH into the cluster to check for disk encrypted compliance. However, with EBS encryption, you can check the encryptions status from the EC2 console or through an EC2 API call.
  • Transparent Encryption – EBS encryption is transparent to any applications running on Amazon EMR and doesn’t require you to modify any code.

Amazon EBS encryption integrates with AWS KMS to provide the encryption keys that protect your data. To use this feature, you have to use a CMK in your account and Region. A CMK gives you control to create and manage the key, including enabling and disabling the key, controlling access, rotating the key, and deleting it. For more information, see Customer Master Keys.

Enabling EBS encryption on Amazon EMR

To enable EBS encryption on Amazon EMR, complete the following steps:

  1. Create your CMK in AWS KMS.
    You can do this either through the AWS KMS console, AWS CLI, or the AWS KMS CreateKey API. Create keys in the same Region as your EMR cluster. For more information, see Creating Keys.
  2. Give the Amazon EMR service role and EC2 instance profile permission to use your CMK on your behalf.
    If you are using the EMR_DefaultRole, add the policy with the following steps:

    • Open the AWS KMS console.
    • Choose your AWS Region.
    • Choose the key ID or alias of the CMK you created.
    • On the key details page, under Key Users, choose Add.
    • Choose the Amazon EMR service role.The name of the default role is EMR_DefaultRole.
    • Choose Attach.
    • Choose the Amazon EC2 instance profile.The name of the default role for the instance profile is EMR_EC2_DefaultRole.
    • Choose Attach.
      If you are using a customized policy, add the following code to the service role to allow Amazon EMR to create and use the CMK, with the resource being the CMK ARN:

      { 
      "Version": "2012-10-17", 
      "Statement": [ 
         { 
         "Sid": "EmrDiskEncryptionPolicy", 
         "Effect": "Allow", 
         "Action": [ 
            "kms:Encrypt", 
            "kms:Decrypt", 
            "kms:ReEncrypt*", 
            "kms:CreateGrant", 
            "kms:GenerateDataKeyWithoutPlaintext", 
            "kms:DescribeKey" 
            ], 
         "Resource": [ 
            " arn:aws:kms:region:account-id:key/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx " " 
            ] 
         } 
      ] 
      } 

       

  3. Create and configure the Amazon EMR Security configuration template.Do this either through the console or using CLI or SDK, with the following steps:
    • Open the Amazon EMR console.
    • Choose Security Configuration.
    • Under Local disk encryption, choose Enable at-rest encryption for local disks
    • For Key provider type, choose AWS KMS.
    • For AWS KMS customer master key, choose the key ARN of your CMK.This post uses the key ARN ebsEncryption_emr_default_role.
    • Select Encrypt EBS volumes with EBS encryption.

Default encryption with EC2 vs. Amazon EMR EBS encryption

EC2 has a similar feature called default encryption. With this feature, all EBS volumes in your account are encrypted without exception using a single CMK that you specify per Region. With EBS encryption from Amazon EMR, you can use different a KMS key per EMR cluster to secure your EBS volumes. You can use both EBS encryption provided by Amazon EMR and default encryption provided by EC2.

For this post, EBS encryption provided by Amazon EMR takes precedent, and you encrypt the EBS volumes attached to the cluster with the CMK that you selected in the security configuration.

S3 encryption

Amazon S3 encryption also works with Amazon EMR File System (EMRFS) objects read from and written to S3. You can use either server-side encryption (SSE) or client-side encryption (CSE) mode to encrypt objects in S3 buckets. The following table summarizes the different encryption modes available for S3 encryption in Amazon EMR.

Encryption locationKey storageKey management
SSE-S3Server side on S3S3S3
SSE-KMSServer side on S3KMS

Choose the AWS managed CMK for Amazon S3 with the alias aws/s3, or create a custom CMK.

 

CSE-KMSClient side on the EMR clusterKMSA custom CMK that you create.
CSE-CustomClient side on the EMR clusterYouYour own key provider.

The encryption choice you make depends on your specific workload requirements. Though SSE-S3 is the most straightforward option that allows you to fully delegate the encryption of S3 objects to Amazon S3 by selecting a check box, SSE-KMS or CSE-KMS are better options that give you granular control over CMKs in KMS by using policies. With AWS KMS, you can see when, where, and by whom your customer managed keys (CMK) were used, because AWS CloudTrail logs API calls for key access and key management. These logs provide you with full audit capabilities for your keys. For more information, see Encryption at Rest for EMRFS Data in Amazon S3.

Encrypting your S3 buckets with different encryption modes and keys

With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR. To do so, complete the following steps:

  1. Open the Amazon EMR console.
  2. Choose Security Configuration.
  3. Under S3 encryption, select Enable at-rest encryption for EMRFS data in Amazon S3.
  4. For Default encryption mode, choose your encryption mode.This post uses SSE-KMS.
  5. For AWS KMS customer master key, choose your key.The key you provide here encrypts all S3 buckets used with Amazon EMR. This post uses ebsEncryption_emr_default_role.
  6. Choose Per bucket encryption overrides.You can set different encryption modes for different buckets.
  7. For S3 bucket, add your S3 bucket that you want to encrypt differently.
  8. For Encryption mode, choose an encryption mode.
  9. For Encryption materials, enter your CMK.

If you have already enabled default encryption for S3 buckets directly in Amazon S3, you can also choose to bypass the S3 encryption options in the security configuration setting in Amazon EMR. This allows Amazon EMR to delegate encrypting objects in the buckets to Amazon S3, which uses the encryption key specified in the bucket policy to encrypt objects before persisting it on S3.

Summary

This post walked through the native EBS and S3 encryption options available with Amazon EMR to encrypt and secure your data. Please share your feedback on how these optimizations benefit your real-world workloads.

 


About the Author

Duncan Chan is a software development engineer for Amazon EMR. He enjoys learning and working on big data technologies. When he is not working, he will be playing with his dogs.

 

 

 

Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 2

Post Syndicated from Francisco Oliveira original https://aws.amazon.com/blogs/big-data/orchestrate-big-data-workflows-with-apache-airflow-genie-and-amazon-emr-part-2/

Large enterprises running big data ETL workflows on AWS operate at a scale that services many internal end-users and runs thousands of concurrent pipelines. This, together with a continuous need to update and extend the big data platform to keep up with new frameworks and the latest releases of big data processing frameworks, requires an efficient architecture and organizational structure that both simplifies management of the big data platform and promotes easy access to big data applications.

In Part 1 of this post series, you learned how to use Apache Airflow, Genie, and Amazon EMR to manage big data workflows.

This post guides you through deploying the AWS CloudFormation templates, configuring Genie, and running an example workflow authored in Apache Airflow.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Solution overview

This solution uses an AWS CloudFormation template to create the necessary resources.

Users access the Apache Airflow Web UI and the Genie Web UI via SSH tunnel to the bastion host.

The Apache Airflow deployment uses Amazon ElastiCache for Redis as a Celery backend, Amazon EFS as a mount point to store DAGs, and Amazon RDS PostgreSQL for database services.

Genie uses Apache Zookeeper for leader election, an Amazon S3 bucket to store configurations (binaries, application dependencies, cluster metadata), and Amazon RDS PostgreSQL for database services. Genie submits jobs to an Amazon EMR cluster.

The architecture in this post is for demo purposes. In a production environment, the Apache Airflow and the Genie instances should be part of an Auto Scaling Group. For more information, see Deployment on the Genie Reference Guide.

The following diagram illustrates the solution architecture.

Creating and storing admin passwords in AWS Systems Manager Parameter Store

This solution uses AWS Systems Manager Parameter Store to store the passwords used in the configuration scripts. With AWS Systems Manager Parameter Store, you can create secure string parameters, which are parameters that have a plaintext parameter name and an encrypted parameter value. Parameter Store uses AWS KMS to encrypt and decrypt the parameter values of secure string parameters.

Before deploying the AWS CloudFormation templates, execute the following AWS CLI commands. These commands create AWS Systems Manager Parameter Store parameters to store the passwords for the RDS master user, the Airflow DB administrator, and the Genie DB administrator.

aws ssm put-parameter --name "/GenieStack/RDS/Settings" --type SecureString --value "ch4ng1ng-s3cr3t" --region Your-AWS-Region

aws ssm put-parameter --name "/GenieStack/RDS/AirflowSettings" --type SecureString --value "ch4ng1ng-s3cr3t" --region Your-AWS-Region

aws ssm put-parameter --name "/GenieStack/RDS/GenieSettings" --type SecureString --value "ch4ng1ng-s3cr3t" --region Your-AWS-Region

Creating an Amazon S3 Bucket for the solution and uploading the solution artifacts to S3

This solution uses Amazon S3 to store all artifacts used in the solution. Before deploying the AWS CloudFormation templates, create an Amazon S3 bucket and download the artifacts required by the solution from this link.

Unzip the artifacts required by the solution and upload the airflow and genie directories to the Amazon S3 bucket you just created. Keep a record of the Amazon S3 root path because you add it as a parameter to the AWS CloudFormation template later.

As an example, the following screenshot uses the root location geniestackbucket.

Use the value of the Amazon S3 Bucket you created for the AWS CloudFormation parameters GenieS3BucketLocation and AirflowBucketLocation.

Deploying the AWS CloudFormation stack

To launch the entire solution, choose Launch Stack.

The following table explains the parameters that the template requires. You can accept the default values for any parameters not in the table. For the full list of parameters, see the AWS CloudFormation template.

ParameterValue
Location of the configuration artifactsGenieS3BucketLocationThe S3 bucket with Genie artifacts and Genie’s installation scripts. For example: geniestackbucket.
AirflowBucketLocationThe S3 bucket with the Airflow artifacts. For example: geniestackbucket.
NetworkingSSHLocationThe IP address range to SSH to the Genie, Apache Zookeeper, and Apache Airflow EC2 instances.
SecurityBastionKeyNameAn existing EC2 key pair to enable SSH access to the bastion host instance.
AirflowKeyNameAn existing EC2 key pair to enable SSH access to the Apache Airflow instance.
ZKKeyNameAn existing EC2 key pair to enable SSH access to the Apache Zookeeper instance.
GenieKeyNameAn existing EC2 key pair to enable SSH access to the Genie.
EMRKeyNameAn existing Amazon EC2 key pair to enable SSH access to the Amazon EMR cluster.
LoggingemrLogUriThe S3 location to store Amazon EMR cluster Logs. For example: s3://replace-with-your-bucket-name/emrlogs/

Post-deployment steps

To access the Apache Airflow and Genie Web Interfaces, set up an SSH and configure a SOCKS proxy for your browser. Complete the following steps:

  1. On the AWS CloudFormation console, choose the stack you created.
  2. Choose the Outputs
  3. Find the public DNS of the bastion host instance.The following screenshot shows the instance this post uses.
  4. Set up an SSH tunnel to the master node using dynamic port forwarding.
    Instead of using the master public DNS name of your cluster and the username hadoop, which the walkthrough references, use the public DNS of the bastion host instance and replace the user hadoop for the user ec2-user.
  1. Configure the proxy settings to view websites hosted on the master node.
    You do not need to modify any of the steps in the walkthrough.

This process configures a SOCKS proxy management tool that allows you to automatically filter URLs based on text patterns and limit the proxy settings to domains that match the form of the Amazon EC2 instance’s public DNS name.

Accessing the Web UI for Apache Airflow and Genie

To access the Web UI for Apache Airflow and Genie, complete the following steps:

  1. On the CloudFormation console, choose the stack you created.
  2. Choose the Outputs
  3. Find the URLs for the Apache Airflow and Genie Web UI.The following screenshot shows the URLs that this post uses.
  1. Open two tabs in your web browser. You will use the tabs for the Apache Airflow UI and the Genie UI.
  2. For the Foxy Proxy you configured previously, click the icon Foxy Proxy added to the top right section of your browser and choose Use proxies based on their predefined patterns and priorities.The following screenshot shows the proxy options.
  1. Enter the URL for the Apache Airflow Web UI and for the Genie Web UI on their respective tabs.

You are now ready to run a workflow in this solution.

Preparing application resources

The first step as a platform admin engineer is to prepare the binaries and configurations of the big data applications that the platform supports. In this post, the Amazon EMR clusters use release 5.26.0. Because Amazon EMR release 5.26.0 has Hadoop 2.8.5 and Spark 2.4.3 installed, those are the applications you want to support in the big data platform. If you decide to use a different EMR release, prepare binaries and configurations for those versions. The following sections guide you through the steps to prepare binaries should you wish to use a different EMR release version.

To prepare a Genie application resource, create a YAML file with fields that are sent to Genie in a request to create an application resource.

This file defines metadata information about the application, such as the application name, type, version, tags, the location on S3 of the setup script, and the location of the application binaries. For more information, see Create an Application in the Genie REST API Guide.

Tag structure for application resources

This post uses the following tags for application resources:

  • type – The application type, such as Spark, Hadoop, Hive, Sqoop, or Presto.
  • version – The version of the application, such as 2.8.5 for Hadoop.

The next section shows how the tags are defined in the YAML file for an application resource. You can add an arbitrary number of tags to associate with Genie resources. Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files.

Preparing the Hadoop 2.8.5 application resource

This post provides an automated creation of the YAML file. The following code shows the resulting file details:

id: hadoop-2.8.5
name: hadoop
user: hadoop
status: ACTIVE
description: Hadoop 2.8.5 Application
setupFile: s3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/setup.sh
configs: []
version: 2.8.5
type: hadoop
tags:
  - type:hadoop
  - version:2.8.5
dependencies:
  - s3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/hadoop-2.8.5.tar.gz

The file is also available directly at s3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/hadoop-2.8.5.yml.

NOTE: The following steps are for reference only, should you be completing this manually, rather than using the automation option provided.

The S3 objects referenced by the setupFile and dependencies labels are available in your S3 bucket. For your reference, the steps to prepare the artifacts used by properties setupFile and dependencies are as follows:

  1. Download hadoop-2.8.5.tar.gz from https://www.apache.org/dist/hadoop/core/hadoop-2.8.5/.
  2. Upload hadoop-2.8.5.tar.gz to s3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/.

Preparing the Spark 2.4.3 application resource

This post provides an automated creation of the YAML file. The following code shows the resulting file details:

id: spark-2.4.3
name: spark
user: hadoop
status: ACTIVE
description: Spark 2.4.3 Application
setupFile: s3://Your_Bucket_Name/genie/applications/spark-2.4.3/setup.sh
configs: []
version: 2.4.3
type: spark
tags:
  - type:spark
  - version:2.4.3
  - version:2.4
dependencies:
  - s3://Your_Bucket_Name/genie/applications/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

The file is also available directly at s3://Your_Bucket_Name/genie/applications/spark-2.4.3/spark-2.4.3.yml.

NOTE: The following steps are for reference only, should you be completing this manually, rather than using the automation option provided.

The objects in setupFile and dependencies are available in your S3 bucket. For your reference, the steps to prepare the artifacts used by properties setupFile and dependencies are as follows:

  1. Download spark-2.4.3-bin-hadoop2.7.tgz from https://archive.apache.org/dist/spark/spark-2.4.3/ .
  2. Upload spark-2.4.3-bin-hadoop2.7.tgz to s3://Your_Bucket_Name/genie/applications/spark-2.4.3/ .

Because spark-2.4.3-bin-hadoop2.7.tgz uses Hadoop 2.7 and not Hadoop 2.8.3, you need to extract the EMRFS libraries for Hadoop 2.7 from an EMR cluster running Hadoop 2.7 (release 5.11.3). This is already available in your S3 Bucket. For reference, the steps to extract the EMRFS libraries are as follows:

  1. Deploy an EMR cluster with release 5.11.3.
  2. Run the following command:
aws s3 cp /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.20.0.jar s3://Your_Bucket_Name/genie/applications/spark-2.4.3/hadoop-2.7/aws/emr/emrfs/lib/

Preparing a command resource

The next step as a platform admin engineer is to prepare the Genie commands that the platform supports.

In this post, the workflows use Apache Spark. This section shows the steps to prepare a command resource of type Apache Spark.

To prepare a Genie command resource, create a YAML file with fields that are sent to Genie in a request to create a command resource.

This file defines metadata information about the command, such as the command name, type, version, tags, the location on S3 of the setup script, and the parameters to use during command execution. For more information, see Create a Command in the Genie REST API Guide.

Tag structure for command resources

This post uses the following tag structure for command resources:

  • type – The command type, for example, spark-submit.
  • version – The version of the command, for example, 2.4.3 for Spark.

The next section shows how the tags are defined in the YAML file for a command resource. Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files.

Preparing the spark-submit command resource

This post provides an automated creation of the YAML file. The following code shows the resulting file details:

id: spark-2.4.3_spark-submit
name: Spark Submit 
user: hadoop 
description: Spark Submit Command 
status: ACTIVE 
setupFile: s3://Your_Bucket_Name/genie/commands/spark-2.4.3_spark-submit/setup.sh
configs: [] 
executable: ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode client 
version: 2.4.3 
tags:
  - type:spark-submit
  - version:2.4.3
checkDelay: 5000

The file is also available at s3://Your_Bucket_Name/genie/commands/spark-2.4.3_spark-submit/spark-2.4.3_spark-submit.yml.

The objects in setupFile are available in your S3 bucket.

Preparing cluster resources

This post also automated the step to prepare cluster resources; it follows a similar process as described previously but applied to cluster resources.

During the startup of the Amazon EMR cluster, a custom script creates a YAML file with the metadata details about the cluster and uploads the file to S3. For more information, see Create a Cluster in the Genie REST API Guide.

The script also extracts all Amazon EMR libraries and uploads them to S3. The next section discusses the process of registering clusters with Genie.

The script is available at s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh.

Tag structure for cluster resources

This post uses the following tag structure for cluster resources:

  • cluster.release – The Amazon EMR release name. For example, emr-5.26.0.
  • cluster.id – The Amazon EMR cluster ID. For example, j-xxxxxxxx.
  • cluster.name – The Amazon EMR cluster name.
  • cluster.role – The role associated with this cluster. For this post, the role is batch. Other possible roles would be ad hoc or Presto, for example.

You can add new tags for a cluster resource or change the values of existing tags by editing s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh.

You could also use other combinations of tags, such as a tag to identify the application lifecycle environment or required custom jars.

Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files. If multiple clusters share the same tag, by default, Genie distributes jobs across clusters associated with the same tag randomly. For more information, see Cluster Load Balancing in the Genie Reference Guide.

Registering resources with Genie

Up to this point, all the configuration activities mentioned in the previous sections were already prepared for you.

The following sections show how to register resources with Genie. In this section you will be connecting to the bastion via SSH to run configuration commands.

Registering application resources

To register the application resources you prepared in the previous section, SSH into the bastion host and run the following command:

python /tmp/genie_assets/scripts/genie_register_application_resources.py Replace-With-Your-Bucket-Name Your-AWS-Region http://replace-with-your-genie-server-url:8080

To see the resource information, navigate to the Genie Web UI and choose the Applications tab. See the following screenshot. The screenshot shows two application resources, one for Apache Spark (version 2.4.3) and the other for Apache Hadoop (version 2.8.5).

Registering commands and associate commands with applications

The next step is to register the Genie command resources with specific applications. For this post, because spark-submit depends on Apache Hadoop and Apache Spark, associate the spark-submit command with both applications.

The order you define for the applications in file genie_register_command_resources_and_associate_applications.py is important. Because Apache Spark depends on Apache Hadoop, the file first references Apache Hadoop and then Apache Spark. See the following code:

commands = [{'command_name' : 'spark-2.4.3_spark-submit', 'applications' : ['hadoop-2.8.5', 'spark-2.4.3']}]

To register the command resources and associate them with the application resources registered in the previous step, SSH into the bastion host and run the following command:

python /tmp/genie_assets/scripts/genie_register_command_resources_and_associate_applications.py Replace-With-Your-Bucket-Name Your-AWS-Region http://replace-with-your-genie-server-url:8080

To see the command you registered plus the applications it is linked to, navigate to the Genie Web UI and choose the Commands tab.

The following screenshot shows the command details and the applications it is linked to.

Registering Amazon EMR clusters

As previously mentioned, the Amazon EMR cluster deployed in this solution registers the cluster when the cluster starts via an Amazon EMR step. You can access the script that Amazon EMR clusters use at s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh. The script also automates deregistering the cluster from Genie when the cluster terminates.

In the Genie Web UI, choose the Clusters tab. This page shows you the current cluster resources. You can also find the location of the configuration files that uploaded to the cluster S3 location during the registration step.

The following screenshot shows the cluster details and the location of configuration files (yarn-site.xml, core-site.xml, mapred-site.xml).

Linking commands to clusters

You have now registered all applications, commands, and clusters, and associated commands with the applications on which they depend. The final step is to link a command to a specific Amazon EMR cluster that is configured to run it.

Complete the following steps:

  1. SSH into the bastion host.
  2. Open /tmp/genie_assets/scripts/genie_link_commands_to_clusters.py with your preferred text editor.
  3. Look for the following lines in the code:# Change cluster_name below
    clusters = [{'cluster_name' : 'j-xxxxxxxx', 'commands' :
    ['spark-2.4.3_spark-submit']}]
  1. Replace j-xxxxxxxx in the file with the cluster_name.
    To see the name of the cluster, navigate to the Genie Web UI and choose Clusters.
  2. To link the command to a specific Amazon EMR cluster run the following command:
    python /tmp/genie_assets/scripts/genie_link_commands_to_clusters.py Replace-With-Your-Bucket-Name Your-AWS-Region http://replace-with-your-genie-server-url:8080

The command is now linked to a cluster.

In the Genie Web UI, choose the Commands tab. This page shows you the current command resources. Select spark-2.4.3_spark_submit and see the clusters associated with the command.

The following screenshot shows the command details and the clusters it is linked to.

You have configured Genie with all resources; it can now receive job requests.

Running an Apache Airflow workflow

It is out of the scope of this post to provide a detailed description of the workflow code and dataset. This section provides details of how Apache Airflow submits jobs to Genie via a GenieOperator that this post provides.

The GenieOperator for Apache Airflow

The GenieOperator allows the data engineer to define the combination of tags to identify the commands and the clusters in which the tasks should run.

In the following code example, the cluster tag is ‘emr.cluster.role:batch‘ and the command tags are ‘type:spark-submit‘ and ‘version:2.4.3‘.

spark_transform_to_parquet_movies = GenieOperator(
    task_id='transform_to_parquet_movies',
    cluster_tags=['emr.cluster.role:batch'],
    command_tags=['type:spark-submit', 'version:2.4.3'],
    command_arguments="transform_to_parquet.py s3://{}/airflow/demo/input/csv/{}  s3://{}/demo/output/parquet/{}/".format(bucket,'movies.csv',bucket,'movies'), dependencies="s3://{}/airflow/dag_artifacts/transforms/transform_to_parquet.py".format(bucket),
    description='Demo Spark Submit Job',
    job_name="Genie Demo Spark Submit Job",
    tags=['transform_to_parquet_movies'],
    xcom_vars=dict(),
    retries=3,
    dag=extraction_dag)

The property command_arguments defines the arguments to the spark-submit command, and dependencies defines the location of the code for the Apache Spark Application (PySpark).

You can find the code for the GenieOperator in the following location: s3://Your_Bucket_Name/airflow/plugins/genie_plugin.py.

One of the arguments to the DAG is the Genie connection ID (genie_conn_id). This connection was created during the automated setup of the Apache Airflow Instance. To see this and other existing connections, complete the following steps:

  1. In the Apache Airflow Web UI, choose the Admin
  2. Choose Connections.

The following screenshot shows the connection details.

The Airflow variable s3_location_genie_demo reference in the DAG was set during the installation process. To see all configured Apache Airflow variables, complete the following steps:

  1. In the Apache Airflow Web UI, choose the Admin
  2. Choose Variables.

The following screenshot shows the Variables page.

Triggering the workflow

You can now trigger the execution of the movie_lens_transfomer_to_parquet DAG. Complete the following steps:

  1. In the Apache Airflow Web UI, choose the DAGs
  2. Next to your DAG, change Off to On.

The following screenshot shows the DAGs page.

For this example DAG, this post uses a small subset of the movielens dataset. This dataset is a popular open source dataset, which you can use in exploring data science algorithms. Each dataset file is a comma-separated values (CSV) file with a single header row. All files are available in your solution S3 bucket under s3://Your_Bucket_Name/airflow/demo/input/csv .

movie_lens_transfomer_to_parquet is a simple workflow that triggers a Spark job that converts input files from CSV to Parquet.

The following screenshot shows a graphical representation of the DAG.

In this example DAG, after transform_to_parquet_movies concludes, you can potentially execute four tasks in parallel. Because the DAG concurrency is set to 3, as seen in the following code example, only three tasks can run at the same time.

# Initialize the DAG
# Concurrency --> Number of tasks allowed to run concurrently
extraction_dag = DAG(dag_name,
          default_args=dag_default_args,
          start_date=start_date,
          schedule_interval=schedule_interval,
          concurrency=3,
          max_active_runs=1
          )

Visiting the Genie job UI

The GenieOperator for Apache Airflow submitted the jobs to Genie. To see job details, in the Genie Web UI, choose the Jobs tab. You can see details such as the jobs submitted, their arguments, the cluster it is running, and the job status.

The following screenshot shows the Jobs page.

You can now experiment with this architecture by provisioning a new Amazon EMR cluster, registering it with a new value (for example, “production”) for Genie tag “emr.cluster.role”, linking the cluster to a command resource, and updating the tag combination in the GenieOperator used by some of the tasks in the DAG.

Cleaning up

To avoid incurring future charges, delete the resources and the S3 bucket created for this post.

Conclusion

This post showed how to deploy an AWS CloudFormation template that sets up a demo environment for Genie, Apache Airflow, and Amazon EMR. It also demonstrated how to configure Genie and use the GenieOperator for Apache Airflow.

 


About the Authors

Francisco Oliveira is a senior big data solutions architect with AWS. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.

 

 

 

Jelez Raditchkov is a practice manager with AWS.

 

 

 

 

Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.

Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 1

Post Syndicated from Francisco Oliveira original https://aws.amazon.com/blogs/big-data/orchestrate-big-data-workflows-with-apache-airflow-genie-and-amazon-emr-part-1/

Large enterprises running big data ETL workflows on AWS operate at a scale that services many internal end-users and runs thousands of concurrent pipelines. This, together with a continuous need to update and extend the big data platform to keep up with new frameworks and the latest releases of big data processing frameworks, requires an efficient architecture and organizational structure that both simplifies management of the big data platform and promotes easy access to big data applications.

This post introduces an architecture that helps centralized platform teams maintain a big data platform to service thousands of concurrent ETL workflows, and simplifies the operational tasks required to accomplish that.

Architecture components

At high level, the architecture uses two open source technologies with Amazon EMR to provide a big data platform for ETL workflow authoring, orchestration, and execution. Genie provides a centralized REST API for concurrent big data job submission, dynamic job routing, central configuration management, and abstraction of the Amazon EMR clusters. Apache Airflow provides a platform for job orchestration that allows you to programmatically author, schedule, and monitor complex data pipelines. Amazon EMR provides a managed cluster platform that can run and scale Apache Hadoop, Apache Spark, and other big data frameworks.

The following diagram illustrates the architecture.

Apache Airflow

Apache Airflow is an open source tool for authoring and orchestrating big data workflows.

With Apache Airflow, data engineers define direct acyclic graphs (DAGs). DAGs describe how to run a workflow and are written in Python. Workflows are designed as a DAG that groups tasks that are executed independently. The DAG keeps track of the relationships and dependencies between tasks.

Operators define a template to define a single task in the workflow. Airflow provides operators for common tasks, and you can also define custom operators. This post discusses the custom operator (GenieOperator) to submit tasks to Genie.

A task is a parameterized instance of an operator. After an operator is instantiated, it’s referred to as a task. A task instance represents a specific run of a task. A task instance has an associated DAG, task, and point in time.

You can run DAGs and tasks on demand or schedule them to run at a specific time defined as a cron expression in the DAG.

For additional details on Apache Airflow, see Concepts in the Apache Airflow documentation.

Genie

Genie is an open source tool by Netflix that provides configuration-management capabilities and dynamic routing of jobs by abstracting access to the underlining Amazon EMR clusters.

Genie provides a REST API to submit jobs from big data applications such as Apache Hadoop MapReduce or Apache Spark. Genie manages the metadata of the underlining clusters and the commands and applications that run in the clusters.

Genie abstracts access to the processing clusters by associating one or more tags with the clusters. You can also associate tags with the metadata details for the applications and commands that the big data platform supports. As Genie receives job submissions for specific tags, it uses a combination of the cluster/command tag to route each job to the correct EMR cluster dynamically.

Genie’s data model

Genie provides a data model to capture the metadata associated with resources in your big data environment.

An application resource is a reusable set of binaries, configuration files, and setup files to install and configure applications supported by the big data platform on the Genie node that submits the jobs to the clusters. When Genie receives a job, the Genie node downloads all dependencies, configuration files, and setup files associated with the applications and stores it in a job working directory. Applications are linked to commands because they represent the binaries and configurations needed before a command runs.

Command resources represent the parameters when using the command line to submit work to a cluster and which applications need to be available on the PATH to run the command. Command resources glue metadata components together. For example, a command resource representing a Hive command would include a hive-site.xml and be associated with a set of application resources that provide the Hive and Hadoop binaries needed to run the command. Moreover, a command resource is linked to the clusters it can run on.

A cluster resource identifies the details of an execution cluster, including connection details, cluster status, tags, and additional properties. A cluster resource can register with Genie during startup and deregister during termination automatically. Clusters are linked to one or more commands that can run in it. After a command is linked to a cluster, Genie can start submitting jobs to the cluster.

Lastly, there are three job resource types: job request, job, and job execution. A job request resource represents the request submission with details to run a job. Based on the parameters submitted in the request, a job resource is created. The job resource captures details such as the command, cluster, and applications associated with the job. Additionally, information on status, start time, and end time is also available on the job resource. A job execution resource provides administrative details so you can understand where the job ran.

For more information, see Data Model on the Genie Reference Guide.

Amazon EMR and Amazon S3

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. For more information, see Overview of Amazon EMR Architecture and Overview of Amazon EMR.

Data is stored in Amazon S3, an object storage service with scalable performance, ease-of-use features, and native encryption and access control capabilities. For more details on S3, see Amazon S3 as the Data Lake Storage Platform.

Architecture deep dive

Two main actors interact with this architecture: platform admin engineers and data engineers.

Platform admin engineers have administrator access to all components. They can add or remove clusters, and configure the applications and the commands that the platform supports.

Data engineers focus on writing big data applications with their preferred frameworks (Apache Spark, Apache Hadoop MR, Apache Sqoop, Apache Hive, Apache Pig, and Presto) and authoring python scripts to represent DAGs.

At high level, the team of platform admin engineers prepares the supported big data applications and its dependencies and registers them with Genie. The team of platform admin engineers launches Amazon EMR clusters that register with Genie during startup.

The team of platform admin engineers associates each Genie metadata resource (applications, commands, and clusters) with Genie tags. For example, you can associate a cluster resource with a tag named environment and the value can be “Production Environment”, “Test Environment”, or “Development Environment”.

Data engineers author workflows as Airflow DAGs and use a custom Airflow Operator—GenieOperator—to submit tasks to Genie. They can use a combination of tags to identify the type of tasks they are running plus where the tasks should run. For example, you might need to run Apache Spark 2.4.3 tasks in the environment identified by the “Production Environment” tag. To do this, set the cluster and command tags in the Airflow GenieOperator as the following code:

(cluster_tags=['emr.cluster.environment:production'],command_tags=['type:spark-submit','ver:2.4.3'])

The following diagram illustrates this architecture.

The workflow, as it corresponds to the numbers in this diagram are as follows:

  1. A platform admin engineer prepares the binaries and dependencies of the supported applications (Spark-2.4.5, Spark-2.1.0, Hive-2.3.5, etc.). The platform admin engineer also prepares commands (spark-submit, hive). The platform admin engineer registers applications and commands with Genie. Moreover, the platform admin engineer associates commands with applications and links commands to a set of clusters after step 2 (below) is concluded.
  2. Amazon EMR cluster(s) register with Genie during startup.
  3. A data engineer authors Airflow DAGs and uses the Genie tag to reference the environment, application, command or any combination of the above. In the workflow code, the data engineer uses the GenieOperator. The GenieOperator submits jobs to Genie.
  4. A schedule triggers workflow execution or a data engineer manually triggers the workflow execution. The jobs that compose the workflow are submitted to Genie for execution with a set of Genie tags that specify where the job should be run.
  5. The Genie node, working as the client gateway, will set up a working directory with all binaries and dependencies. Genie dynamically routes the jobs to the cluster(s) associated with the provided Genie tags. The Amazon EMR clusters run the jobs.

For details on the authorization and authentication mechanisms supported by Apache Airflow and Genie see Security in the Apache Airflow documentation and Security in the Genie documentation.  This architecture pattern does not expose SSH access to the Amazon EMR clusters. For details on providing different levels of access to data in Amazon S3 through EMR File System (EMRFS), see Configure IAM Roles for EMRFS Requests to Amazon S3.

Use cases enabled by this architecture

The following use cases demonstrate the capabilities this architecture provides.

Managing upgrades and deployments with no downtime and adopting the latest open source release

In a large organization, teams that use the data platform use heterogeneous frameworks and different versions. You can use this architecture to support upgrades with no downtime and offer the latest version of open source frameworks in a short amount of time.

Genie and Amazon EMR are the key components to enable this use case. As the Amazon EMR service team strives to add the latest version of the open source frameworks running on Amazon EMR in a short release cycle, you can keep up with your internal teams’ needs of the latest features of their preferred open source framework.

When a new version of the open source framework is available, you need to test it, add the new supported version and its dependencies to Genie, and move tags in the old cluster to the new one. The new cluster takes new job submissions, and the old cluster concludes jobs it is still running.

Moreover, because Genie centralizes the location of application binaries and its dependencies, upgrading binaries and dependencies in Genie also upgrades any upstream client automatically. Using Genie removes the need for upgrading all upstream clients.

Managing a centralized configuration, job and cluster status, and logging

In a universe of thousands of jobs and multiple clusters, you need to identify where a specific job is running and access logging details quickly. This architecture gives you visibility into jobs running on the data platform, logging of jobs, clusters, and their configurations.

Having programmatic access to the big data platform

This architecture enables a single point of job submissions by using Genie’s REST API. Access to the underlying cluster is abstracted through a set of APIs that enable administration tasks plus submitting jobs to the clusters. A REST API call submits jobs into Genie asynchronously. If accepted, a job-id is returned that you can use to get job status and outputs programmatically via API or web UI. A Genie node sets up the working directory and runs the job on a separate process.

You can also integrate this architecture with continuous integration and continuous delivery (CI/CD) pipelines for big data application and Apache Airflow DAGs.

Enabling scalable client gateways and concurrent job submissions

The Genie node acts as a client gateway (edge node) and can scale horizontally to make sure the client gateway resources used to submit jobs to the data platform meet demand. Moreover, Genie allows the submission of concurrent jobs.

When to use this architecture

This architecture is recommended for organizations that use multiple large, multi-tenant processing clusters instead of transient clusters. It is out of the scope of this post to address when organizations should consider always-on clusters versus transient clusters (you can use an EMR Airflow Operator to spin up Amazon EMR clusters that register with Genie, run a job, and tear them down). You should use Reserved Instances with this architecture. For more information, see Using Reserved Instances.

This architecture is especially recommended for organizations that choose to have a central platform team to administer and maintain a big data platform that supports many internal teams that require thousands of jobs to run concurrently.

This architecture might not make sense for organizations that are not at as large or don’t expect to grow to that scale. The benefits of cluster abstraction and centralized configuration management are ideal in bringing structured access to a potentially chaotic environment of thousands of concurrent workflows and hundreds of teams.

This architecture is also recommended for organizations that support a high percentage of multi-hour or overlapping workflows and heterogeneous frameworks (Apache Spark, Apache Hive, Apache Pig, Apache Hadoop MapReduce, Apache Sqoop, or Presto).

If your organization relies solely on Apache Spark and is aligned with the recommendations discussed previously, this architecture might still apply. For organizations that don’t have the scale to justify the need for centralized REST API for job submission, cluster abstraction, dynamic job routing, or centralized configuration management, Apache Livy plus Amazon EMR might be the appropriate option. Genie has its own scalable infrastructure that acts as the edge client. This means that Genie does not compete with Amazon EMR master instance resources, whereas Apache Livy does.

If the majority of your organization’s workflows are a few short-lived jobs, opting for a serverless processing layer, serverless ad hoc querying layer, or using dedicated transient Amazon EMR clusters per workflow might be more appropriate. If the majority of your organization’s workflows are composed of thousands of short-lived jobs, the architecture still applies because it removes the need to spin up and down clusters.

This architecture is recommended for organizations that require full control of the processing platform to optimize component performance. Moreover, this architecture is recommended for organizations that need to enforce centralized governance on their workflows via CI/CD pipelines.

It is out of the scope of this post to evaluate different orchestration options or the benefits of adopting Airflow as the orchestration layer. When considering adopting an architecture, also consider the existing skillset and time to adopt tooling. The open source nature of Genie may allow you to integrate other orchestration tools. Evaluating that route might be an option if you wish to adopt this architecture with another orchestration tool.

Conclusion

This post presented how to use Apache Airflow, Genie, and Amazon EMR to manage big data workflows. The post described the architecture components, the use cases the architecture supports, and when to use it. The second part of this post deploys a demo environment and walks you through the steps to configure Genie and use the GenieOperator for Apache Airflow.

 


About the Author

Francisco Oliveira is a senior big data solutions architect with AWS. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.

 

 

 

Jelez Raditchkov is a practice manager with AWS.

Install Python libraries on a running cluster with EMR Notebooks

Post Syndicated from Parag Chaudhari original https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Last year, AWS introduced EMR Notebooks, a managed notebook environment based on the open-source Jupyter notebook application.

This post discusses installing notebook-scoped libraries on a running cluster directly via an EMR Notebook. Before this feature, you had to rely on bootstrap actions or use custom AMI to install additional libraries that are not pre-packaged with the EMR AMI when you provision the cluster. This post also discusses how to use the pre-installed Python libraries available locally within EMR Notebooks to analyze and plot your results. This capability is useful in scenarios in which you don’t have access to a PyPI repository but need to analyze and visualize a dataset.

Benefits of using notebook-scoped libraries with EMR Notebooks

Notebook-scoped libraries provide you the following benefits:

  • Runtime installation – You can import your favorite Python libraries from PyPI repositories and install them on your remote cluster on the fly when you need them. These libraries are instantly available to your Spark runtime environment. There is no need to restart the notebook session or recreate your cluster.
  • Dependency isolation – The libraries you install using EMR Notebooks are isolated to your notebook session and don’t interfere with bootstrapped cluster libraries or libraries installed from other notebook sessions. These notebook-scoped libraries take precedence over bootstrapped libraries. Multiple notebook users can import their preferred version of the library and use it without dependency clashes on the same cluster.
  • Portable library environment – The library package installation happens from your notebook file. This allows you to recreate the library environment when you switch the notebook to a different cluster by re-executing the notebook code. At the end of the notebook session, the libraries you install through EMR Notebooks are automatically removed from the hosting EMR cluster.

Prerequisites

To use this feature in EMR Notebooks, you need a notebook attached to a cluster running EMR release 5.26.0 or later. The cluster should have access to the public or private PyPI repository from which you want to import the libraries. For more information, see Creating a Notebook.

There are different ways to configure your VPC networking to allow clusters inside the VPC to connect to an external repository. For more information, see Scenarios and Examples in the Amazon VPC User Guide.

Using notebook-scoped libraries

This post demonstrates the notebook-scoped libraries feature of EMR Notebooks by analyzing the publicly available Amazon customer reviews dataset for books. For more information, see Amazon Customer Reviews Dataset on the Registry of Open Data for AWS.

Open your notebook and make sure the kernel is set to PySpark. Run the following command from the notebook cell:

print("Welcome to my EMR Notebook!")

You get the following output:

Output shows newly created spark session.

You can examine the current notebook session configuration by running the following command:

%%info

You get the following output:

Output shows spark session properties which include Python version and properties to enable this new feature.

The notebook session is configured for Python 3 by default (through spark.pyspark.python). If you prefer to use Python 2, reconfigure your notebook session by running the following command from your notebook cell:

%%configure -f { "conf":{ "spark.pyspark.python": "python", 
                "spark.pyspark.virtualenv.enabled": "true", 
                "spark.pyspark.virtualenv.type":"native", 
                "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv" }}

You can also verify the Python version used in your current notebook session by running the following code:

import sys
sys.version

You get the following output:

Output shows Python version 3.6.8

Before starting your analysis, check the libraries that are already available on the cluster. You can do this using the list_packages() PySpark API, which lists all the Python libraries on the cluster. Run the following code:

sc.list_packages()

You get an output similar to the following code, which shows all the available Python 3-compatible packages on your cluster:

Output shows list of Python packages along with their versions.

Load the Amazon customer reviews data for books into a Spark DataFrame with the following code:

df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')

You are now ready to explore the data. Determine the schema and number of available columns in your dataset with the following code:

# Total columns
print(f'Total Columns: {len(df.dtypes)}')
df.printSchema()

The following code is the output:

Output shows that Spark DataFrame has 15 columns. It also shows the schema of the Spark DataFrame.

This dataset has a total of 15 columns. You can also check the total rows in your dataset by running the following code:

# Total row
print(f'Total Rows: {df.count():,}')

You get the following output:

Output shows that Spark DataFrame has more than 20 million rows.

Check the total number of books with the following code:

# Total number of books
num_of_books = df.select('product_id').distinct().count()
print(f'Number of Books: {num_of_books:,}')

You get the following output:

Output shows that there are more than 3 million books.

You can also analyze the number of book reviews by year and find the distribution of customer ratings. To do this, import the Pandas library version 0.25.1 and the latest Matplotlib library from the public PyPI repository. Install them on the cluster attached to your notebook using the install_pypi_package API. See the following code:

sc.install_pypi_package("pandas==0.25.1") #Install pandas version 0.25.1 
sc.install_pypi_package("matplotlib", "https://pypi.org/simple") #Install matplotlib from given PyPI repository

You get the following output:

Output shows that “pandas” and “matplotlib” packages are successfully installed.

The install_pypi_package PySpark API installs your libraries along with any associated dependencies. By default, it installs the latest version of the library that is compatible with the Python version you are using. You can also install a specific version of the library by specifying the library version from the previous Pandas example.

Verify that your imported packages successfully installed by running the following code:

sc.list_packages()

You get the following output:

Output shows list of Python packages along with their versions.

You can also analyze the trend for the number of reviews provided across multiple years. Use ‘toPandas()’ to convert the Spark data frame to a Pandas data frame, which you can visualize with Matplotlib. See the following code:

# Number of reviews across years
num_of_reviews_by_year = df.groupBy('year').count().orderBy('year').toPandas()

import matplotlib.pyplot as plt
plt.clf()
num_of_reviews_by_year.plot(kind='area', x='year',y='count', rot=70, color='#bc5090', legend=None, figsize=(8,6))
plt.xticks(num_of_reviews_by_year.year)
plt.xlim(1995, 2015)
plt.title('Number of reviews across years')
plt.xlabel('Year')
plt.ylabel('Number of Reviews')

The preceding commands render the plot on the attached EMR cluster. To visualize the plot within your notebook, use %matplot magic. See the following code:

%matplot plt

The following graph shows that the number of reviews provided by customers increased exponentially from 1995 to 2015. Interestingly, 2001, 2002, and 2015 are outliers, when the number of reviews dropped from the previous years.

A line chart showing number of reviews across years.

You can analyze the distribution of star ratings and visualize it using a pie chart. See the following code:

# Distribution of overall star ratings
product_ratings_dist = df.groupBy('star_rating').count().orderBy('count').toPandas()

plt.clf()
labels = [f"Star Rating: {rating}" for rating in product_ratings_dist['star_rating']]
reviews = [num_reviews for num_reviews in product_ratings_dist['count']]
colors = ['#00876c', '#89c079', '#fff392', '#fc9e5a', '#de425b']
fig, ax = plt.subplots(figsize=(8,5))
w,a,b = ax.pie(reviews, autopct='%1.1f%%', colors=colors)
plt.title('Distribution of star ratings for books')
ax.legend(w, labels, title="Star Ratings", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))

Print the pie chart using %matplot magic and visualize it from your notebook with the following code:

%matplot plt

The following pie chart shows that 80% of users gave a rating of 4 or higher. Approximately 10% of users rated their books 2 or lower. In general, customers are happy about their book purchases from Amazon.

Output shows pie chart depicting distribution of star ratings.

Lastly, use the ‘uninstall_package’ Pyspark API to uninstall the Pandas library that you installed using the install_package API. This is useful in scenarios in which you want to use a different version of a library that you previously installed using EMR Notebooks. See the following code:

sc.uninstall_package('pandas')

You get the following output:

Output shows that “pandas” package is successfully uninstalled.

Next, run the following code:

sc.list_packages()

You get the following output:

Output shows list of Python packages along with their versions.

After closing your notebook, the Pandas and Matplot libraries that you installed on the cluster using the install_pypi_package API are garbage and collected out of the cluster.

Using local Python libraries in EMR Notebooks

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

To see the list of local libraries, run the following command from the notebook cell:

%%local
conda list

You get a list of all the libraries available in the notebook. Because the list is rather long, this post doesn’t include them.

For this analysis, find out the top 10 children’s books from your book reviews dataset and analyze the star rating distribution for these children’s books.

You can identify the children’s books by using customers’ written reviews with the following code:

kids_books = (
df
.where("lower(review_body) LIKE '%child%' OR lower(review_body) LIKE '%kid%' OR lower(review_body) LIKE '%infant%'OR lower(review_body) LIKE '%Baby%'")
.select("customer_id", "product_id", "star_rating", "product_title", "year")
)

Plot the top 10 children’s books by number of customer reviews with the following code:

top_10_book_titles = kids_books.groupBy('product_title') \
                       .count().orderBy('count', ascending=False) \
                       .limit(10)
top_10_book_titles.show(10, False)

You get the following output:

Output shows list of book titles with their corresponding review count.

Analyze the customer rating distribution for these books with the following code:

top_10 = kids_books.groupBy('product_title', 'star_rating') \
           .count().join(top_10_book_titles, ['product_title'], 'leftsemi') \
           .orderBy('count', ascending=False) 
top_10.show(truncate=False)

You get the following output:

Output shows list of book titles along with start ratings and review counts.

To plot these results locally within your notebook, export the data from the Spark driver and cache it in your local notebook as a Pandas DataFrame. To achieve this, first register a temporary table with the following code:

top_10.createOrReplaceTempView("top_10_kids_books")

Use the local SQL magic to extract the data from this table with the following code:

%%sql -o top_10 -n -1
SELECT product_title, star_rating, count from top_10_kids_books
GROUP BY product_title, star_rating, count
ORDER BY count Desc

For more information about these magic commands, see the GitHub repo.

After you execute the code, you get a user-interface to interactively plot your results. The following pie chart shows the distribution of ratings:

Output shows pie chart depicting distribution of star ratings.

You can also plot more complex charts by using local Matplot and seaborn libraries available with EMR Notebooks. See the following code:

%%local 
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set()
plt.clf()
top_10['book_name'] = top_10['product_title'].str.slice(0,30)
colormap = sns.color_palette("hls", 8)
pivot_df = top_10.pivot(index= 'book_name', columns='star_rating', values='count')
pivot_df.plot.barh(stacked=True, color = colormap, figsize=(15,11))
plt.title('Top 10 children books',fontsize=16)
plt.xlabel('Number of reviews',fontsize=14)
plt.ylabel('Book',fontsize=14)
plt.subplots_adjust(bottom=0.2)

You get the following output:

Output shows stacked bar chart for top 10 children books with star ratings and review counts.

Summary

This post showed how to use the notebook-scoped libraries feature of EMR Notebooks to import and install your favorite Python libraries at runtime on your EMR cluster, and use these libraries to enhance your data analysis and visualize your results in rich graphical plots. The post also demonstrated how to use the pre-packaged local Python libraries available in EMR Notebook to analyze and plot your results.

 


About the Author

Parag Chaudhari is a software development engineer at AWS.

 

 

 

Secure your Amazon EMR cluster from unintentional network exposure with Block Public Access configuration

Post Syndicated from Vignesh Rajamani original https://aws.amazon.com/blogs/big-data/secure-your-amazon-emr-cluster-from-unintentional-network-exposure-with-block-public-access-configuration/

AWS security groups act as a network firewall that allows you to control access to your cluster to only whitelisted IP addresses. Proper management of security groups rules is critical to protect your application and data on the cluster. Amazon EMR strongly recommends creating restrictive security group rules that include the necessary network ports, protocols, and IP addresses based on your application requirements.

While AWS account administrators can protect cloud network security in different ways, a new feature helps them prevent account users from launching clusters with misconfigured security group rules. Misconfiguration can open a broad range of cluster ports to unrestricted traffic from the public internet and expose cluster resources to outside threats.

This post discusses a new account level feature called Block Public Access (BPA) configuration that helps administrators enforce a common public access rule across all of their EMR clusters in a region.

Overview of Block Public Access configuration

BPA configuration is an account-level configuration that helps you centrally manage public network access to EMR clusters in a region. You can enable this configuration in a region and block your account users from launching clusters that allow unrestricted inbound traffic from the public IP address ( source set to 0.0.0.0/0 for IPv4 and ::/0 for IPv6) through its ports. Your applications may require specific ports to be open to the internet. In that case, configure these ports (or port ranges) in the BPA configuration as exceptions to allow public access before you launch clusters.

When account users launch clusters in the region where you have enabled BPA configuration, EMR will check the port rules defined in this configuration and  compare it with inbound traffic rules specified in the security groups associated with the clusters. If these security groups have inbound rules that open ports to the public IP address but you did not configure these ports as exception in BPA configuration, then EMR will fail the cluster creation and send an exception to the user.

 Enabling BPA configuration from the AWS Management Console

To enable BPA configuration, you need permission to call PutBlockPublicAccessConfiguration API.

  • Log in to the AWS Management Console. From the console, navigate to the Amazon EMR
  • From the navigation panel, choose Block Public Access.
  • Choose Change and select On to enable BPA.

By default, all ports are blocked except port 22 for SSH traffic. To allow more ports for public access, add them as exceptions.

  • Choose Add a port range.

Before launching your cluster, define these exceptions. The port number or range should be the only ones in the security group rules that have an inbound source IP address of 0.0.0.0/0 for IPv4 and ::/0 for IPv6.

  • Enter a port number or the range of ports for public access.
  • Choose Save Changes.

Block public access section with the "Change" hyperlink circled in red.

Block public access settings, under Exceptions section +Add a port range circled in red.

For information about configuring BPA using the AWS CLI, see Configure Block Public Access.

Summary

In this post ,we discussed a new account level feature on Amazon EMR called Block Public Access  (BPA) configuration that helps administrators manage public access to their EMR clusters. You can enable BPA configuration today and prevent your EMR cluster in a region from being unintentionally exposed to public network.

 


About the Author

Vignesh Rajamani is a senior product manager for EMR at AWS.

 

Implement perimeter security in EMR using Apache Knox

Post Syndicated from Varun Rao Bhamidimarri original https://aws.amazon.com/blogs/big-data/implement-perimeter-security-in-emr-using-apache-knox/

Perimeter security helps secure Apache Hadoop cluster resources to users accessing from outside the cluster. It enables a single access point for all REST and HTTP interactions with Apache Hadoop clusters and simplifies client interaction with the cluster. For example, client applications must acquire Kerberos tickets using Kinit or SPNEGO before interacting with services on Kerberos enabled clusters. In this post, we walk through setup of Apache Knox to enable perimeter security for EMR clusters.

It provides the following benefits:

  • Simplify authentication of various Hadoop services and UIs
  • Hide service-specific URL’s/Ports by acting as a Proxy
  • Enable SSL termination at the perimeter
  • Ease management of published endpoints across multiple clusters

Overview

Apache Knox

Apache Knox provides a gateway to access Hadoop clusters using REST API endpoints. It simplifies client’s interaction with services on the Hadoop cluster by integrating with enterprise identity management solutions and hiding cluster deployment details.

In this post, we run the following setup:

  • Create a virtual private cloud (VPC) based on the Amazon VPC
  • Provision an Amazon EC2 Windows instance for Active Directory domain controller.
  • Create an Amazon EMR security configuration for Kerberos and cross-realm trust.
  • Set up Knox on EMR master node and enable LDAP authentication

Visually, we are creating the following resources:

Figure 1: Provisioned infrastructure from CloudFormation

Prerequisites and assumptions

Before getting started, the following prerequisites must be met:

IMPORTANT: The templates use hardcoded user name and passwords, and open security groups. They are not intended for production use without modification.

NOTE:

  • Single VPC has been used to simplify networking
  • CloudFormationtemplates use hardcoded user names and passwords and open security groups for simplicity.

Implementation

Single-click solution deployment

If you don’t want to set up each component individually, you can use the single-step AWS CloudFormation template. The single-step template is a master template that uses nested stacks (additional templates) to launch and configure all the resources for the solution in one operation.

To launch the entire solution, click on the Launch Stack button below that directs you to the console. Do not change to a different Region because the template is designed to work only in US-EAST-1 Region.

This template requires several parameters that you must provide. See the table below, noting the parameters marked with *, for which you have to provide values. The remaining parameters have default values and should not be edited.

For this parameterUse this
1Domain Controller NameDC1
2Active Directory domainawsknox.com
3Domain NetBIOS nameAWSKNOX (NetBIOS name of the domain (up to 15 characters).
4Domain admin userUser name for the account to be added as Domain administrator. (awsadmin)
5Domain admin password *Password for the domain admin user. Must be at least eight characters containing letters, numbers, and symbols – for example, CheckSum123
6Key pair name *Name of an existing EC2 key pair to enable access to the domain controller instance.
7Instance typeInstance type for the domain controller EC2 instance.
8LDAP Bind user nameLDAP Bind user name.
Default value is: CN=awsadmin,CN=Users,DC=awsknox,DC=com
9EMR Kerberos realmEMR Kerberos realm name. This is usually the VPC’s domain name in upper case letters Eg: EC2.INTERNAL
10Cross-realm trust password *Password for cross-realm trust Eg: CheckSum123
11Trusted Active Directory DomainThe Active Directory domain that you want to trust. This is same as Active Directory in name, but in upper case letters. Default value is “AWSKNOX.COM”
12Instance typeInstance type for the domain controller EC2 instance. Default: m4.xlarge
13Instance countNumber of core instances of EMR cluster. Default: 2
14Allowed IP addressThe client IP address that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
15EMR applicationsComma-separated list of applications to install on the cluster. By default it selects “Hadoop,” “Spark,” “Ganglia,” “Hive” and “HBase”
16LDAP search baseLDAP search base: Only value is : “CN=Users,DC=awshadoop,DC=com”
17LDAP search attributeProvide LDAP user search attribute. Only value is : “sAMAccountName”
18LDAP user object classProvide LDAP user object class value. Only value is : “person”
19LDAP group search baseProvide LDAP group search base value. Only value is : “dc=awshadoop, dc=com”
20LDAP group object classProvide LDAP group object class. Only value is “group”
21LDAP member attributeProvide LDAP member attribute. Only value is : “member”
22EMRLogDir *Provide an Amazon S3 bucket where the EMRLogs are stored. Also provide “s3://” as prefix.
23S3 BucketAmazon S3 bucket where the artifacts are stored. In this case, all the artifacts are stored in “aws-bigdata-blog” public S3 bucket. Do not change this value.

Deploying each component individually

If you used the CloudFormation Template in the single-step solution, you can skip this section and start from the Access the Cluster section. This section describes how to use AWS CloudFormation templates to perform each step separately in the solution.

1.     Create and configure an Amazon VPC

In this step, we set up an Amazon VPC, a public subnet, an internet gateway, a route table, and a security group.

In order for you to establish a cross-realm trust between an Amazon EMR Kerberos realm and an Active Directory domain, your Amazon VPC must meet the following requirements:

  • The subnet used for the Amazon EMR cluster must have a CIDR block of fewer than nine digits (for example, 10.0.1.0/24).
  • Both DNS resolution and DNS hostnames must be enabled (set to “yes”).
  • The Active Directory domain controller must be the DNS server for instances in the Amazon VPC (this is configured in the next step).

To launch directly through the console, choose Launch Stack.

2.     Launch and configure an Active Directory domain controller

In this step, you use an AWS CloudFormation template to automatically launch and configure a new Active Directory domain controller and cross-realm trust.

Next, launch a windows EC2 instance and install and configure an Active Directory domain controller. In addition to launching and configuring an Active Directory domain controller and cross realm trust, this AWS CloudFormation template also sets the domain controller as the DNS server (name server) for your Amazon VPC.

To launch directly through the console, choose Launch Stack.

3.     Launch and configure EMR cluster with Apache Knox

To launch a Kerberized Amazon EMR cluster, first we must create a security configuration containing the cross-realm trust configuration. For more details on this, please refer to the blog post, Use Kerberos Authentication to integerate Amazon EMR with Microsoft Active Directory.

In addition to the steps that are described in the above blog, this adds an additional step to the EMR cluster, which creates a Kerberos principal for Knox.

The CloudFormation script also updates the below parameters in core-site.xml, hive-site.xml, hcatalog-webchat-site.xml and oozie-site.xml files. You can see these in “create_emr.py” script. Once the EMR cluster is created, it also runs a shell script as an EMR step. This shell script downloads and installs Knox software on EMR master machine. It also creates a Knox topology file with the name: emr-cluster-top.

To launch directly through the console, choose Launch Stack.

Accessing the cluster

API access to Hadoop Services

One of the main reasons to use Apache Knox is the isolate the Hadoop cluster from direct connectivity by users. Below, we demonstrate how you can interact with several Hadoop services like WebHDFS, WebHCat, Oozie, HBase, Hive, and Yarn applications going through the Knox endpoint using REST API calls. The REST calls can be called on the EMR cluster or outside of the EMR cluster. However, in a production environment, EMR cluster’s security groups should be set to only allow traffic on Knox’s port number to block traffic to all other applications.

For the purposes of this blog, we make the REST calls on the EMR cluster by SSH’ing to master node on the EMR cluster using the LDAP credentials:

ssh [email protected]<EMR-Master-Machine-Public-DNS>

Replace <EMR-Master-Machine-Public-DNS> with the value from the CloudFormation outputs to the EMR cluster’s master node. Find this CloudFormation Output value from the stack you deployed in Step 3 above.

You are prompted for the ‘awsadmin’ LDAP password. Please use the password you selected during the CloudFormation stack creation.

NOTE: In order to connect, your client machine’s IP should fall within the CIDR range specified by “Allowed IP address in the CloudFormation parameters. If you are not able to connect to the master node, check the master instance’s security group for the EMR cluster has a rule to allow traffic from your client. Otherwise, your organizations firewall may be blocking your traffic.

Demonstrating access to the WebHDFS service API:

Here we will invoke the LISTSTATUS operation on WebHDFS via the knox gateway. In our setup, knox is running on port number 8449. The below command will return a directory listing of the root directory of HDFS.

curl -ku awsadmin 'https://localhost:8449/gateway/emr-cluster-top/webhdfs/v1/?op=LISTSTATUS'

You can use both “localhost” or the private DNS of the EMR master node.

You are prompted for the password. This is the same “Domain admin password” that was passed as the parameter into the CloudFormation stack.

Demonstrating access Resource Manager service API:

The Resource manager REST API provides information about the Hadoop cluster status, applications that are running on the cluster etc. We can use the below command to get the cluster information.

curl -ikv -u awsadmin -X GET 'https://localhost:8449/gateway/emr-cluster-top/resourcemanager/v1/cluster'

You are prompted for the password. This is the same “Domain admin password” that was passed as the parameter into the CloudFormation stack.

Demonstrating connecting to Hive using Beeline through Apache Knox:

We can use Beeline, a JDBC client tool to connect to HiveServer2. Here we will connect to Beeline via Knox.

Use the following command to connect to hive shell

$hive

Use the following syntax to connect to Hive from beeline

!connect jdbc:hive2://<EMR-Master-Machine-Public-DNS>:8449/;transportMode=http;httpPath=gateway/emr-cluster-top/hive;ssl=true;sslTrustStore=/home/knox/knox/data/security/keystores/gateway.jks;trustStorePassword=CheckSum123

NOTE: You must update the <EMR-Master-Machine-Public-DNS> with the public DNS name of the EMR master node.

Demonstrating submitting an Spark job using Apache Livy through Apache Knox

You can use the following command to submit a spark job to an EMR cluster. In this example, we run SparkPi program that is available in spark-examples.jar.

curl -i -k -u awsadmin -X POST --data '{"file": "s3://aws-bigdata-blog/artifacts/aws-blog-emr-knox/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi", "args": ["100"]}' -H "Content-Type: application/json" https://localhost:8449/gateway/emr-cluster-top/livy/v1/batches

You can use both “localhost” or the private DNS of EMR master node.

Securely accessing Hadoop Web UIs

In addition to providing API access to Hadoop clusters, Knox also provides proxying service for Hadoop UIs. Below is a table of available UIs:

Application NameApplication URL
1Resource Managerhttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/yarn/
2Gangliahttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/ganglia/
3Apache HBasehttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/hbase/webui/master-status
4WebHDFShttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/hdfs/
5Spark Historyhttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/sparkhistory/

On the first visit of any UI above, you are presented with a drop-down for login credentials. Enter the login user awsadmin and the password you specified as a parameter to your CloudFormation template.

You can now browse the UI as you were directly connected to the cluster. Below is a sample of the Yarn UI:

And the scheduler information in the Yarn UI:

Ganglia:

Spark History UI:

Lastly, HBase UI. The entire URL to the “master-status” page must be provided

Troubleshooting

It’s always clear when there is an error interacting with Apache Knox. Below are a few troubleshooting steps.

I cannot connect to the UI. I do not get any error codes.

  • Apache Knox may not be running. Check that its running by logging into the master node of your cluster and running “ps -ef | grep knox”. There should be a process running.
ps -ef | grep knox
Knox 114022 1 0 Aug24 ? 00:04:21 /usr/lib/jvm/java/bin/java -Djava.library.path=/home/knox/knox/ext/native -jar /home/knox/knox/bin/gateway.jar

If the process is not running, start the process by running “/home/knox/knox/bin/gateway.sh start” as the Knox user (sudo su – knox).

  • Your browser may not have connectivity to the cluster. Even though you may be able to SSH to the cluster, a firewall rule or security group rule may be preventing traffic on the port number that Knox is running on. You can route traffic through SSH by building an SSH tunnel and enable port forwarding.

I get an HTTP 400, 404 or 503 code when accessing a UI:

  • Ensure that the URL you are entering is correct. If you do not enter the correct path, then Knox provides an HTTP 404.
  • There is an issue with the routing rules within Apache Knox and it does not know how to route the requests. The logs for Knox are at INFO level by default and is available in /home/knox/knox/logs/. If you want to change the logging level, change the following lines in /home/knox/knox/conf/gateway-log4j.properties:log4j.logger.org.apache.knox.gateway=INFO
    #log4j.logger.org.apache.knox.gateway=DEBUGto#log4j.logger.org.apache.knox.gateway=INFO
    log4j.logger.org.apache.knox.gateway=DEBUGThe logs will provide a lot more information such as how Knox is rewriting URL’s. This could provide insight whether Knox is translating URL’s correctly.You can use the below “ldap”, “knoxcli” and “curl” commands to verify that the setup is correct. Run these commands as “knox” user.
  • To verify search base, search attribute and search class, run the below ldap command
    ldapsearch -h <Active-Directory-Domain-Private-IP-Address> -p 389 -x -D 'CN=awsadmin,CN=Users,DC=awsknox,DC=com' -w 'CheckSum123' -b 'CN=Users,DC=awsknox,DC=com' -z 5 '(objectClass=person)' sAMAccountName

  • Replace “<Active-Directory-Domain-Private-IP-Address>” with the private IP address of the Active Directory EC2 instance. You can get this IP address from the output of second CloudFormation template.
  • To verify the values for server host, port, username, and password, run the below ldap command.
    ldapwhoami -h <Active-Directory-Domain-Private-IP-Address> -p 389 -x -D 'CN=awsadmin,CN=Users,DC=awsknox,DC=com' -w 'CheckSum123'

  • Replace “<Active-Directory-Domain-Private-IP-Address>” with the private IP address of the Active Directory EC2 instance. You can get this IP address from the output of second CloudFormation template.
  • It should display the below output:

  • To verify the System LDAP bind successful or not:
    /home/knox/knox/bin/knoxcli.sh user-auth-test --cluster emr-cluster-top --u awsadmin --p 'CheckSum123'

  • Here “emr-cluster-top” is the topology file that defines the applications that are available and the endpoints that Knox should connect to service the application.
  • The output from the command should return the below output:

“System LDAP Bind successful!”

  • To verify LDAP authentication successful or not, run the below command.
    /home/knox/knox/bin/knoxcli.sh user-auth-test --cluster emr-cluster-top --u awsadmin --p 'CheckSum123'

  • Here “emr-cluster-top” is the topology file name that we created.
  • The output the command should return the below output:

“LDAP authentication successful!”

  • Verify if WebHDFS is reachable directly using the service
  • First, we must get a valid Kerberos TGT, for that we must use the kinit command as below:
    kinit -kt /mnt/var/lib/bigtop_keytabs/knox.keytab knox/<EMR-Master-Machine-Private-DNS>@EC2.INTERNAL
    curl --negotiate -u : http://<EMR-Master-Machine-Private-DNS>:50070/webhdfs/v1/?op=GETHOMEDIRECTORY

  • For example: EMR-Master-Machine-Private-DNS appears in this format: ip-xx-xx-xx-xx.ec2.internal
  • It should return a JSON object containing a “Path” variable of the user’s home directory.

Cleanup

Delete the CloudFormation stack to clean up all the resources created for this setup. If you used the nested stack, CloudFormation deletes all resources in one operation. If you deployed the templates individually, delete them in the reverse order of creation, deleting the VPC stack last.

Conclusion

In this post, we went through the setup, configuration, and validation of Perimeter security for EMR clusters using Apache Knox. This helps simplify Authentication for various Hadoop services. In our next post, we will show you how to integrate Apache Knox and Apache Ranger to enable authorization and audits.

Stay tuned!

 


Related

 


About the Author


Varun Rao is a enterprise solutions architect
. He works with enterprise customers in their journey to the cloud with focus of data strategy and security. In his spare time, he tries to keep up with his 4-year old.

 

 

 

Mert Hocanin is a big data architect with AWS, covering several products, including EMR, Athena and Managed Blockchain. Prior to working in AWS, he has worked on Amazon.com’s retail business as a Senior Software Development Engineer, building a data lake to process vast amounts of data from all over the company for reporting purposes. When not building and designing data lakes, Mert enjoys traveling and food.

 

 

 

Photo of Srikanth KodaliSrikanth Kodali is a Sr. IOT Data analytics architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on building IoT data and analytics solutions, helping them improve the value of their solutions when using AWS.

 

 

 

Run Spark applications with Docker using Amazon EMR 6.0.0 (Beta)

Post Syndicated from Paul Codding original https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

The Amazon EMR team is excited to announce the public beta release of EMR 6.0.0 with Spark 2.4.3, Hadoop 3.1.0, Amazon Linux 2, and Amazon Corretto 8. With this beta release, Spark users can use Docker images from Docker Hub and Amazon Elastic Container Registry (Amazon ECR) to define environment and library dependencies. Using Docker, users can easily define their dependencies and use them for individual jobs, avoiding the need to install dependencies on individual cluster hosts.

This post shows you how to use Docker with the EMR release 6.0.0 Beta. You’ll learn how to launch an EMR release 6.0.0 Beta cluster and run Spark jobs using Docker containers from both Docker Hub and Amazon ECR.

Hadoop 3 Docker support

EMR 6.0.0 (Beta) includes Hadoop 3.1.0, which allows the YARN NodeManager to launch containers either directly on the host machine of the cluster, or inside a Docker container. Docker containers provide a custom execution environment in which the application’s code runs isolated from the execution environment of the YARN NodeManager and other applications.

These containers can include special libraries needed by the application, and even provide different versions of native tools and libraries such as R, Python, Python libraries. This allows you to easily define the libraries and runtime dependencies that your applications need, using familiar Docker tooling.

Clusters running the EMR 6.0.0 (Beta) release are configured by default to allow YARN applications such as Spark to run using Docker containers. To customize this, use the configuration for Docker support defined in the yarn-site.xml and container-executor.cfg files available in the /etc/hadoop/conf directory. For details on each configuration option and how it is used, see Launching Applications Using Docker Containers.

You can choose to use Docker when submitting a job. On job submission, the following variables are used to specify the Docker runtime and Docker image used:

    • YARN_CONTAINER_RUNTIME_TYPE=docker
    • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={DOCKER_IMAGE_NAME}

When you use Docker containers to execute your YARN applications, YARN downloads the Docker image specified when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. Options to configure a Docker registry differ based on how you chose to deploy EMR (using either a public or private subnet).

Docker registries

A Docker registry is a storage and distribution system for Docker images. For EMR 6.0.0 (Beta), the following Docker registries can be configured:

  • Docker Hub: A public Docker registry containing over 100,000 popular Docker images.
  • Amazon ECR: A fully-managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.

Deployment considerations

Docker registries require network access from each host in the cluster, as each host downloads images from the Docker registry when your YARN application is running on the cluster. How you choose to deploy your EMR cluster (launching it into a public or private subnet) may limit your choice of Docker registry due to network connectivity requirements.

Public subnet

With EMR public subnet clusters, nodes running YARN NodeManager can directly access any registry available over the internet, such as Docker Hub, as shown in the following diagram.

Private Subnet

With EMR private subnet clusters, nodes running YARN NodeManager don’t have direct access to the internet.  Docker images can be hosted in the ECR and accessed through AWS PrivateLink, as shown in the following diagram.

For details on how to use AWS PrivateLink to allow access to ECR in a private subnet scenario, see Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR.

Configuring Docker registries

Docker must be configured to trust the specific registry used to resolve Docker images. The default trust registries are local (private) and centos (on public Docker Hub). You can override docker.trusted.registries in /etc/hadoop/conf/container-executor.cfg to use other public repositories or ECR. To override this configuration, use the EMR Classification API with the container-executor classification key.

The following example shows how to configure the cluster to trust both a public repository (your-public-repo) and an ECR registry (123456789123.dkr.ecr.us-east-1.amazonaws.com). When using ECR, replace this endpoint with your specific ECR endpoint.  When using Docker Hub, please replace this repository name with your actual repository name.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        }
    ]
  }
]

To launch an EMR 6.0.0 (Beta) cluster with this configuration using the AWS Command Line Interface (AWS CLI), create a file named container-executor.json with the contents of the preceding JSON configuration.  Then, use the following commands to launch the cluster:

$ export KEYPAIR=<Name of your Amazon EC2 key-pair>
$ export SUBNET_ID=<ID of the subnet to which to deploy the cluster>
$ export INSTANCE_TYPE=<Name of the instance type to use>
$ export REGION=<Region to which to deploy the cluster deployed>

$ aws emr create-cluster \
    --name "EMR-6-Beta Cluster" \
    --region $REGION \
    --release-label emr-6.0.0-beta \
    --applications Name=Hadoop Name=Spark \
    --service-role EMR_DefaultRole \
    --ec2-attributes KeyName=$KEYPAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$SUBNET_ID \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE InstanceGroupType=CORE,InstanceCount=2,InstanceType=$INSTANCE_TYPE \
    --configuration file://container-executor.json

Using ECR

If you’re new to ECR, follow the instructions in Getting Started with Amazon ECR and verify you have access to ECR from each instance in your EMR cluster.

To access ECR using the docker command, you must first generate credentials. To make sure that YARN can access images from ECR, pass a reference to those generated credentials using the container environment variable YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG.

Run the following command on one of the core nodes to get the login line for your ECR account.

$ aws ecr get-login --region us-east-1 --no-include-email

The get-login command generates the correct Docker CLI command to run to create credentials. Copy and run the output from get-login.

$ sudo docker login -u AWS -p <password> https://<account-id>.dkr.ecr.us-east-1.amazonaws.com

This command generates a config.json file in the /root/.docker folder.  Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to ECR.

Execute the commands below to copy the config.json file to your home directory.

$ mkdir -p ~/.docker
$ sudo cp /root/.docker/config.json ~/.docker/config.json
$ sudo chmod 644 ~/.docker/config.json

Execute the commands below to put the config.json in HDFS so it may be used by jobs running on the cluster.

$ hadoop fs -put ~/.docker/config.json /user/hadoop/

At this point, YARN can access ECR as a Docker image registry and pull containers during job execution.

Using Spark with Docker

With EMR 6.0.0 (Beta), Spark applications can use Docker containers to define their library dependencies, instead of requiring dependencies to be installed on the individual Amazon EC2 instances in the cluster. This integration requires configuration of the Docker registry, and definition of additional parameters when submitting a Spark application.

When the application is submitted, YARN invokes Docker to pull the specified Docker image and run the Spark application inside of a Docker container. This allows you to easily define and isolate dependencies. It reduces the time spent bootstrapping or preparing instances in the EMR cluster with the libraries needed for job execution.

When using Spark with Docker, make sure that you consider the following:

  • The docker package and CLI are only installed on core and task nodes.
  • The spark-submit command should always be run from a master instance on the EMR cluster.
  • The Docker registries used to resolve Docker images must be defined using the Classification API with the container-executor classification key to define additional parameters when launching the cluster:
    • docker.trusted.registries
    • docker.privileged-containers.registries
  • To execute a Spark application in a Docker container, the following configuration options are necessary:
    • YARN_CONTAINER_RUNTIME_TYPE=docker
    • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={DOCKER_IMAGE_NAME}
  • When using ECR to retrieve Docker images, you must configure the cluster to authenticate itself. To do so, you must use the following configuration option:
    • YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={DOCKER_CLIENT_CONFIG_PATH_ON_HDFS}
  • Mount the /etc/passwd file into the container so that the user running the job can be identified in the Docker container.
    • YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro
  • Any Docker image used with Spark must have Java installed in the Docker image.

Creating a Docker image

Docker images are created using a Dockerfile, which defines the packages and configuration to include in the image.  The following two example Dockerfiles use PySpark and SparkR.

PySpark Dockerfile

Docker images created from this Dockerfile include Python 3 and the numpy Python package.  This Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8.

FROM amazoncorretto:8

RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development

RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv

RUN python -V
RUN python3 -V

ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3

RUN pip3 install --upgrade pip
RUN pip3 install numpy panda

RUN python3 -c "import numpy as np"

SparkR Dockerfile

Docker images created from this Dockerfile include R and the randomForest CRAN package. This Dockerfile includes Amazon Linux 2 and the Amazon Corretto JDK 8.

FROM amazoncorretto:8

RUN java -version

RUN yum -y update
RUN amazon-linux-extras enable R3.4

RUN yum -y install R R-devel openssl-devel
RUN yum -y install curl

#setup R configs
RUN echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" > ~/.Rprofile

RUN Rscript -e "install.packages('randomForest')"

For more information on Dockerfile syntax, see the Dockerfile reference documentation.

Using Docker images from ECR

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. When using ECR, the cluster must be configured to trust your instance of ECR, and you must configure authentication in order for the cluster to use Docker images from ECR.

In this example, our cluster must be created with the following additional configuration, to ensure the ECR registry is trusted. Please replace the 123456789123.dkr.ecr.us-east-1.amazonaws.com endpoint with your ECR endpoint.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos, 123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        }
    ]
  }
]

Using PySpark with ECR

This example uses the PySpark Dockerfile.  It will be tagged and upload to ECR. Once uploaded, you will run the PySpark job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the PySpark Dockerfile example.

First, create a directory and a Dockerfile for our example.

$ mkdir pyspark

$ vi pyspark/Dockerfile

Paste the contents of the PySpark Dockerfile and run the following commands to build a Docker image.

$ sudo docker build -t local/pyspark-example pyspark/

Create the emr-docker-examples ECR repository for our examples.

$ aws ecr create-repository --repository-name emr-docker-examples

Tag and upload the locally built image to ECR, replacing 123456789123.dkr.ecr.us-east-1.amazonaws.com with your ECR endpoint.

$ sudo docker tag local/pyspark-example 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example
$ sudo docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example

Use SSH to connect to the master node and prepare a Python script with the filename main.py. Paste the following content into the main.py file and save it.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext

import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG must reference the config.json file containing the credentials used to authenticate to ECR.

$ DOCKER_IMAGE_NAME=123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example
$ DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
$ spark-submit --master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--num-executors 2 \
main.py -v

When the job has completed, take note of the YARN application ID, and use the following command to obtain the output of the PySpark job.

$ yarn logs --applicationId application_id | grep -C2 '\[\['
LogLength:55
LogContents:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

Using SparkR with ECR

This example uses the SparkR Dockerfile. It will be tagged and upload to ECR. Once uploaded, you will run the SparkR job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the SparkR Dockerfile example.

First, create a directory and the Dockerfile for this example.

$ mkdir sparkr

$ vi sparkr/Dockerfile

Paste the contents of the SparkR Dockerfile and run the following commands to build a Docker image.

$ sudo docker build -t local/sparkr-example sparkr/

Tag and upload the locally built image to ECR, replacing 123456789123.dkr.ecr.us-east-1.amazonaws.com with your ECR endpoint.

$ sudo docker tag local/sparkr-example 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example
$ sudo docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example

Use SSH to connect to the master node and prepare an R script with name sparkR.R. Paste the following contents into the sparkR.R file.

library(SparkR)
sparkR.session(appName = "R with Spark example", sparkConfig = list(spark.some.config.option = "some-value"))

sqlContext <- sparkRSQL.init(spark.sparkContext)
library(randomForest)
# check release notes of randomForest
rfNews()

sparkR.session.stop()

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG must reference the config.json file containing the credentials used to authenticate to ECR.

$ DOCKER_IMAGE_NAME=123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example
$ DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
$ spark-submit --master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
sparkR.R

When the job has completed, note the YARN application ID, and use the following command to obtain the output of the SparkR job. This example includes testing to make sure that the randomForest library, version installed, and release notes are available.

$ yarn logs --applicationId application_id | grep -B4 -A10 "Type rfNews"
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Wishlist (formerly TODO):

* Implement the new scheme of handling classwt in classification.

* Use more compact storage of proximity matrix.

* Allow case weights by using the weights in sampling?

========================================================================
Changes in 4.6-14:

Using a Docker image from Docker Hub

To use Docker Hub, you must deploy your cluster to a public subnet, and configure it to use Docker Hub as a trusted registry. In this example, the cluster needs the following additional configuration to to make sure that the your-public-repo repository on Docker Hub is trusted. When using Docker Hub, please replace this repository name with your actual repository.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,your-public-repo ",
                "docker.privileged-containers.registries": "local,centos,your-public-repo"
            }
        }
    ]
  }
]

Beta limitations

EMR 6.0.0 (Beta) focuses on helping you get value from using Docker with Spark to simplify dependency management. You can also use EMR 6.0.0 (Beta) to get familiar with Amazon Linux 2, and Amazon Corretto JDK 8.

The EMR 6.0.0 (Beta) supports the following applications:

  • Spark 2.4.3
  • Livy 0.6.0
  • ZooKeeper 3.4.14
  • Hadoop 3.1.0

This beta release is supported in the following Regions:

  • US East (N. Virginia)
  • US West (Oregon)

The following EMR features are currently not available with this beta release:

  • Cluster integration with AWS Lake Formation
  • Native encryption of Amazon EBS volumes attached to an EMR cluster

Conclusion

In this post, you learned how to use an EMR 6.0.0 (Beta) cluster to run Spark jobs in Docker containers and integrate with both Docker Hub and ECR. You’ve seen examples of both PySpark and SparkR Dockerfiles.

The EMR team looks forward to hearing about how you’ve used this integration to simplify dependency management in your projects. If you have questions or suggestions, feel free to leave a comment.


About the Authors

Paul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Ajay Jadhav is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Rentao Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Stephen Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Migrate and deploy your Apache Hive metastore on Amazon EMR

Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/migrate-and-deploy-your-apache-hive-metastore-on-amazon-emr/

Combining the speed and flexibility of Amazon EMR with the utility and ubiquity of Apache Hive provides you with the best of both worlds. However, getting started with big data projects can feel intimidating. Whether you want to deploy new data on EMR or migrate an existing project, this post provides you with the basics to get started.

Apache Hive is an open-source data warehouse and analytics package that runs on top of an Apache Hadoop cluster. A Hive metastore contains a description of the table and the underlying data making up its foundation, including the partition names and data types. Hive is one of the applications that can run on EMR.

Most of the solutions that this post presents assume that you use Apache Hadoop to manage your metastore, which provides scalability for Hive. If you don’t use Hadoop, see documentation for Amazon EMR.

Hive metastore deployment

You can choose one of three configuration patterns for your Hive metastore: embedded, local, or remote. When migrating an on-premises Hadoop cluster to EMR, your migration strategy depends on your existing Hive metastore’s configuration.

Bear in mind a few key facts while considering your set-up. Apache Hive ships with the Derby database, which you can use for embedded metastores. However, Derby can’t scale for production-level workloads.

When running off EMR, Hive records metastore information in a MySQL database on the master node’s file system as ephemeral storage, creating a local metastore. When a cluster terminates, all cluster nodes shut down, including that master node, which erases your data.

To get around these problems, create an external Hive metastore. This helps ensure that the Hive metadata store can scale with your implementation and that the metastore persists even if the cluster terminates.

There are two options for creating an external Hive metastore for EMR:

Using the AWS Glue Data Catalog as the Hive metastore

The AWS Glue Data Catalog is flexible and reliable, making it a great choice when you’re new to building or maintaining a metastore. Because AWS manages the service for you, it means investing less time and resources to the process, but it also sacrifices some fine control. The Data Catalog is highly available, fault-tolerant, maintains data replicas to avoid failure, and expands hardware depending on usage.

You don’t have to manage the Hive metastore database instance separately, maintain ongoing replication, or scale up the instance. An AWS Glue Data Catalog can supply one EMR cluster or many, as well as supporting Amazon Athena and Amazon Redshift Spectrum. You can also download the source code for the AWS Glue Data Catalog client for Apache Hive Metastore and use that code as a reference implementation for building a compatible client.

AWS Glue Data Catalog still allows you plenty of control. You can enable encryption on your files, or configure action access to allow or forbid certain processes. Bear in mind that the Data Catalog doesn’t currently support column statistics, Hive authorizations, or Hive constraints.

An AWS Glue Data Catalog has versions, which means a table can have multiple schema versions. AWS Glue stores that information in the Data Catalog, including the Hive metastore data. Based on the catalog configuration, you can adopt the new schema version or ignore new versions.

When you create an EMR cluster using release version 5.8.0 and later, you can choose a Data Catalog as the Hive metastore. The Data Catalog is not available with earlier releases.

Specify the AWS Glue Data Catalog using the EMR console

When you set up an EMR cluster, choose Advanced Options to enable AWS Glue Data Catalog settings in Step 1. Apache Hive, Presto, and Apache Spark all use the Hive metastore. Within EMR, you have options to use the AWS Glue Data Catalog for any of these applications.

Specify the AWS Glue Data Catalog using the AWS CLI or EMR API

To specify the AWS Glue Data Catalog when you create a cluster in either the AWS CLI or the EMR API, use the hive-site configuration classification. Set the value of hive.metastore.client.factory.class property to com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]  

When you create an EMR cluster, save the configuration classification to a JSON file and then specify that file when you create the cluster. For more information, see Configuring Applications in the Amazon EMR Release Guide.

Using Amazon RDS or Amazon Aurora as the Hive metastore

If you want full control of your Hive metastore and want to integrate with other open-source applications such as Apache Ranger or Apache Atlas, then you can host your Hive metastore on Amazon RDS.

Always keep in mind that your Hive metastore is a single point of failure. Amazon RDS doesn’t automatically replicate databases, so you should enable replication when using Amazon RDS to avoid any data loss in the event of failure.

There are three main steps to set up your Hive metastore using RDS or Aurora:

  1. Create a MySQL or Aurora database.
  2. Configure the hive-site.xml file to point to MySQL or Aurora database.
  3. Specify an external Hive metastore.

Create a MySQL or Aurora database

Begin by setting up either your MySQL database on Amazon RDS or an Amazon Aurora database. Make a note of the URL, username, password, and database name, as you need all this information for the configuration process.

Update your database’s security group to allow JDBC connections between the EMR cluster and a MySQL database port (default: 3306).

Configure EMR for an external Hive metastore

To configure EMR, create a configuration file containing the following Hive site classification information:

  • jdo.option.ConnectionDriverName should reflect to driver org.mariadb.jdbc.Driver (preferred driver).
  • jdo.option.ConnectionURL, javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionPassword should all point to the newly created database.
[
    {
      "Classification": "hive-site",
      "Properties": {
        "javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/hostname:3306\/hive?createDatabaseIfNotExist=true",
        "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
        "javax.jdo.option.ConnectionUserName": "username",
        "javax.jdo.option.ConnectionPassword": "password"
      }
    }
]

Specify an external Hive metastore

After you save your configuration, specify an external Hive metastore. You can do this with either the EMR console or the AWS CLI.

On the EMR console, enter the classification settings created in the previous step as JSON file from S3 or embedded text.

If you are using the AWS CLI, save the classification information as a file named hive-configuration.json and pass the configuration file as a local file or from S3.

  • Hive-configuration.json file in local path:

aws emr create-cluster --release-label emr-5.17.0 --instance-type m4.large --instance-count 2 \
--applications Name=Hive --configurations ./hive-configuration.json --use-default-roles

  • Hive-configuration.json file in Amazon S3:

aws emr create-cluster --release-label emr-5.17.0 --instance-type m4.large --instance-count 2 \
--applications Name=Hive --configurations s3://emr-sample/hive-configuration.json --use-default-roles

Hive metastore migration options

When migrating Hadoop-based workloads from on-premises to the cloud, you must migrate your Hive metastore as well. Depending on the migration plan or your requirements, you can migrate a metastore one of two ways:

  • A one-time metastore migration, which moves an existing Hive metastore completely to AWS.
  • An ongoing metastore sync, which migrates the Hive metastore but also keeps a copy on-premises so that the two metastores can sync in real time during the migration phase.

One-time metastore migration

A one-and-done migration option allows you to shift your workspace entirely and never worry about migrating again. This situation is perfect if you plan to run your existing Hive workloads on EMR. The following diagram illustrates this scenario.

Migrating your Hive metastore to AWS Glue Data Catalog

In this case, your goal is to migrate existing Hive metastore from on-premises to an AWS Glue Data Catalog. There are multiple ways to navigate this migration, but the easiest uses an AWS Glue ETL job to extract metadata from your Hive metastore.  You then use AWS Glue jobs to load the metadata and update the AWS Glue Data Catalog. Many scripts to manage this process already exist on GitHub.

Migrating your Hive metastore to Amazon RDS or Amazon Aurora

Instead of using the AWS Glue Data Catalog, you can move your Hive metastore data from an on-premises database to AWS based storage. Depending on your database source and the desired target in AWS, the process requires different steps. For more information, see the following topics:

Ongoing metastore sync

Large-scale migrations benefit from an ongoing sync process, allowing you to keep running your Hive metastore in your data center as well as in the cloud during the migration phase.

The ongoing sync process keeps both Hive metastores accurate and up-to-date with any changes entered during the migration process. Use only one application for updating the Hive metastore. Otherwise, the metastore is out-of-sync.

AWS DMS is a data migration service ideal for on-going replication and custom-built for this need. You can also replicate the external database to Amazon RDS using the binary log file positions of replicated transactions.

Conclusion

This post pointed you to the various existing resources that can make your Hive migration as smooth and easy as possible.

The content of this blog post is part of the EMR Migration guide, which provides a comprehensive overview of advantages and disadvantages of each migration approach of Hadoop ecosystems. To read the paper, download the Amazon EMR Migration Guide now.

If you have additional insights or feedback, leave a comment here or reach out on Twitter!

 


About the Author

Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.

 

 

Modifying your cluster on the fly with Amazon EMR reconfiguration

Post Syndicated from Brandon Scheller original https://aws.amazon.com/blogs/big-data/modifying-your-cluster-on-the-fly-with-amazon-emr-reconfiguration/

If you are a developer or data scientist using long-running Amazon EMR clusters, you face fast-changing workloads. These changes often require different application configurations to run optimally on your cluster.

With the reconfiguration feature, you can now change configurations on running EMR clusters. Starting with EMR release emr-5.21.0, this feature allows you to modify configurations without creating a new cluster or manually connecting by SSH into each node.

In this post, I go over the following topics:

  • Using reconfiguration
  • Instance group states, configuration versions, and events
  • Reconfiguration example use cases
  • Reconfiguration benefits

Using reconfiguration

The following tasks are updated in EMR release emr-5.21.0:

  • Submitting a reconfiguration
  • Modifying configurations
  • Defining configuration levels

Submitting a reconfiguration

You can submit a recognition through the EMR console, SDK, or AWS CLI. For more information, see submitting a reconfiguration and additional information.

Modifying configurations

When submitting a reconfiguration, you must include all of the configurations you want to apply to the cluster. The update only applies those items, removing all others. As you modify configurations, the EMR console also tracks your previous cluster configurations for you.

Defining configuration levels

Define cluster-level and instance-group-level configurations for your applications. Supply cluster-level configurations as you create a cluster. These configurations are then automatically applied to all your instance groups, even those added after the cluster’s up and running. After the configuration starts, you can’t modify your cluster-level configurations. But you can supplement or override those configurations on the instance-group level through reconfiguration requests. Whenever you submit a reconfiguration request for an instance group, these new instance-group-level configurations take precedence over inherited cluster-level configurations.

To better understand how cluster-level and instance-group-level configurations work together on an instance group, look at a simple demonstration in the EMR console:

Under the Configuration tab, select an instance group in the Filter drop-down list. Navigate to the desired instance group’s configuration table. The Source column of the configuration table indicates the level of your configurations.

This cluster starts with the cluster-level configuration set:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-1",
      "Key-B": "Value-2"
    }
  }
]

As you can see in the console, the instance group ig-Y4E3MN8C4YBP automatically inherited the cluster-level configuration set. Now, reconfigure the instance group as follows:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-a",
      "Key-C": "Value-3"
    }
  }
]

Once the request goes through, the value of configuration “Key-A” gets overridden by the instance-group-level configuration and changes from “Value-1” to “Value-a.”  In contrast, the value of configuration “Key-B” remains unchanged. Meanwhile, your request introduces the new, supplemental configuration, “Key-C.” The configuration table in your console always displays these kinds of subtle changes.

For more information about how to customize cluster-level and instance-group-level configurations, see Supplying a Configuration when Creating a Cluster.

Instance group states, configuration versions, and events

The states of your reconfiguration requests appear in instance group state transitions, configuration version increases, and CloudWatch events. Understand how each works to keep from losing track of any reconfiguration request:

  • Instance group states: After an instance group receives a reconfiguration request, it transitions from the RUNNING state to the RECONFIGURING state. The RECONFIGURING state indicates the start of the reconfiguration process. After the process completes and the new configurations have taken effect, the instance group returns to the RUNNING state. Then, you can verify your configurations either via your application’s Web UI or application-specific commands.
  • Configuration versions: Every reconfiguration request you submit establishes a new configuration set, distinguished by a new version number. Configuration versions start from 0 and increase by 1 for each new configuration set that you submit. Each instance group keeps its respective configuration version number. Version numbers increase depending on the number of times that you reconfigure the different instance groups.
  • Events:EMR posts a state for each reconfiguration request as an Amazon CloudWatch event. These events list the exact times when the request is submitted, the reconfiguration operation starts, and when it completes. For easy tracking, each request is posted together with its associated configuration version. For example, the following event flow shows how EMR executes a typical reconfiguration request in an instance group:

For a complete list of EMR events and instance group state transitions for reconfiguration operation, see the EMR Management Guide.

Reconfiguration example use cases

Here are some use case examples of reconfiguration operations:

  • Reconfiguring HDFS blocksize
  • Configuring capacity-scheduler queues

Reconfiguring HDFS blocksize

You may deal with fluctuating workloads. These changes can call for new application configurations throughout the lifetime of a cluster.

For example, suppose that you’ve recently seen growth in the workload and filesize for your long-running cluster. You’d like to account for this growth without replacing your current cluster.

To increase your HDFS block size for better performance, take advantage of the new reconfiguration feature. HDFS NameNode tracks each data block in your cluster. Increasing this block size could increase HDFS performance by reducing the number of blocks watched by NameNode. In addition, this feature improves job performance by reducing the number of required mappers.

To increase the HDFS block size from the default of 128 GB to 256 GB, submit a reconfiguration request to the master instance group, which runs the same node:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
  "ClusterId": "j-MyClusterID",
  "InstanceGroups": [
  {
    "InstanceGroupId": "ig-MyMasterId",
    "Configurations": [
    {
      "Classification": "hdfs-site",
      "Properties": {
        "dfs.blocksize": "256m"
      },
    "Configurations": []
    }]
  }]
}

The EMR reconfiguration process then modifies the “dfs.blocksize” parameter to the provided “256 m” value within the hdfs-size.xml file. The reconfiguration process also automatically restarts NameNode, to pick up the new configuration. Any new blocks added to the cluster automatically use the new default blocksize of 256 MB. If you’d like any existing blocks to pick up this default, follow these steps:

  1. Copy the blocks to a new location.
  2. Delete the originals.
  3. Copy the blocks back to their original location.

The restored blocks pick up the new default block size. NameNode is inactive during the short restart period.

Configuring capacity-scheduler queues

Do you want to change cluster resources sharing strategies among different Hadoop jobs? Modify YARN CapacityScheduler configurations on a running cluster? Add new queues on a large shared cluster that you manage with another organization? Alter the capacity allocation between different queues to meet your changing workloads?

Using the EMR reconfiguration feature, you can make changes by submitting a reconfiguration request to the master node. New configurations take effect on your queues in a few minutes. You don’t have to go through the hassle of logging into the master node, directly updating the configuration file, or manually refresh queues.

EMR clusters come with a single queue by default. To create two additional queues, alpha, and beta, and allocate each 30% of the total resource capacity of your cluster to each of them. Here’s a sample command that submits a reconfiguration request to accomplish the desired change:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
   "ClusterId":"j-MyClusterID",
   "InstanceGroups":[
      {
         "InstanceGroupId":"ig-MyMasterId",
         "Configurations":[
            {
               "Classification":"capacity-scheduler",
               "Properties":{
                  "yarn.scheduler.capacity.root.queues":"default,alpha,beta",
                  "yarn.scheduler.capacity.root.default.capacity":"40",
                  "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity":"40",
                  "yarn.scheduler.capacity.root.alpha.capacity":"30",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels.CORE.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels.CORE.capacity":"30"
               },
               "Configurations":[]
            }
         ]
      }
   ]
}

Access to the “*” label was given to both queues so that each can access labeled core nodes. Additionally, the sum of capacities for all queues must be equal to 100. The capacity of the default queue decreases to 40%.

Finally, the capacity for each queue’s access to the core label matches the capacity of the queue itself. That means that the core partition splits between queues at the same ratio as the rest of the cluster.

After completing this step, go to the YARN ResourceManager Web UI to verify that your modifications have taken place.

EMR reconfiguration benefits

The following are EMR reconfiguration benefits:

  • Rolling reconfiguration process
  • Reconfiguration failure and reversion

Rolling reconfiguration process

One key benefit of EMR reconfiguration is a rolling reconfiguration process. From the documentation:

“Amazon EMR follows a ‘rolling’ process to reconfigure the instances in the Task and Core instance groups. Only 10 percent of the instances in an instance group are modified and restarted at a time. This process takes longer to finish but reduces the chance of potential application failure in a running cluster.”

Rolling reconfiguration protects against any HDFS downtime by allowing 90% of core nodes to stay running during reconfiguration. YARN on EMR additionally has NodeManager recovery enabled. NodeManager recovers containers running after the reconfiguration restart.

Because containers are always active, some MapReduce jobs can continue to run successfully during the reconfiguration process. However, not all applications can recover after a restart. For example, Spark on YARN (the EMR default) may encounter executor issues and job failure after NodeManager restart.

Test applications with the type of reconfiguration that you plan to do in a safe environment before reconfiguring in production.

Finally, the rolling reconfiguration process might result in a temporary mismatch of your instance group’s configuration state. While mismatched, some instances may have old configurations while others may have the newly requested ones. When reconfiguring your cluster, consider any possible side effects of this situation.

Reconfiguration failure and reversion

EMR can also recover your instance group from a reconfiguration failure.

To make your new configuration take effect, EMR restarts your reconfigured applications and ensures that they are running before declaring the reconfiguration operation complete.

However, if any application fails to restart successfully on any node, the reconfiguration operation fails and the instance group remains in the RECONFIGURING state. Such failures might result from problematic configuration values. For example, an invalid address for `yarn.resourcemanager.scheduler.address` can cause the YARN ResourceManager to fail to restart.

In such situations, EMR automatically triggers a configuration reversion. Reversion re-applies the previous working configuration set on the instance group. Reversion brings the instance group state back to the RUNNING state as soon as the reversion completes. Your instance group thus returns to a functioning state and maintains the availability of your applications on the cluster. Rolling reconfiguration continues throughout the process.

If applications still fail to start after the previous working configurations have been re-applied, EMR places the instance group in te ARRESTED state rather than make further reconfiguration attempts. To release the instance group from the ARRESTED state, submit a new reconfiguration request.

Summary

In this post, I showed you the basics of how to configure instance groups on running clusters using the new EMR cluster reconfiguration feature. I walked through the extra semantics of submitting reconfiguration requests, important configuration level concepts, and ways of reconfiguration tracking methods. I provided some real-world reconfiguration examples and covered two useful features of reconfiguration.

Try the new cluster reconfiguration feature and share your experience with us in the comments below!

 


About the Authors

Brandon Scheller is a software development engineer for Amazon EMR. His passion lies in developing and advancing the applications of the Hadoop ecosystem and working with the open source community. He enjoys mountaineering in the Cascades with his free time.

 

 

 

Junyang Li is a software development engineer for Amazon EMR. She works on cutting-edge features of EMR and is also involved in open source projects. Besides work, she enjoys arts and crafts, exercising and traveling.

 

 

 

 

Performance updates to Apache Spark in Amazon EMR 5.24 – Up to 13x better performance compared to Amazon EMR 5.16

Post Syndicated from Paul Codding original https://aws.amazon.com/blogs/big-data/performance-updates-to-apache-spark-in-amazon-emr-5-24-up-to-13x-better-performance-compared-to-amazon-emr-5-16/

Amazon EMR release 5.24.0 includes several optimizations in Spark that improve query performance. To evaluate the performance improvements, we used TPC-DS benchmark queries with 3-TB scale and ran them on a 6-node c4.8xlarge EMR cluster with data in Amazon S3. We observed up to 13X better query performance on EMR 5.24 compared to EMR 5.16 when operating with a similar configuration.

Customers use Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. They choose to run Spark on EMR because EMR provides the latest, stable open source community innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Auto Scaling.

Each monthly EMR release offers the latest open source packages, alongside new features such as multiple master nodes, and cluster reconfiguration. The team also adds performance improvements with each release.

Each of those optimizations helps you run faster and reduce costs. With EMR 5.24, we have made several new optimizations and are detailing three critical ones in this post.

Setup

To get started with EMR, sign into the console, launch a cluster, and process data.

To replicate the setup for the benchmarking queries, use the following configuration:

  • Applications installed on the cluster: Ganglia, Hive, Spark, Hadoop (installed by default).
  • EMR release: EMR 5.24.0
  • Cluster configuration
    • Master instance group: 1 c4.8xlarge instance with 512 GiB of GP2 EBS storage (4 volumes of 128 GiB each)
    • Core instance group: 5 c4.8xlarge instances with 512 GiB of GP2 EBS storage (4 volumes of 128 GiB each)
ClassificationProperties
yarn-siteyarn.nodemanager.resource.memory-mb : 53248
yarn.scheduler.maximum-allocation-vcores : 36
spark-defaultsspark.executor.memory : 4743m
spark.driver.memory : 2g
spark.sql.optimizer.distinctBeforeIntersect.enabled : true
spark.sql.dynamicPartitionPruning.enabled : true
spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled : true
spark.executor.cores : 4
spark.executor.memoryOverhead : 890m

Results observed using TPC-DS benchmarks

The following two graphs compare the total aggregate runtime and geometric mean for all queries in the TPC-DS 3TB query dataset between the EMR releases.

The per-query runtime improvement between EMR 5.16 and EMR 5.24 is also illustrated in the following chart. The horizontal axis shows each of the queries in the TPC-DS 3 TB benchmark. The vertical axis shows the orders of magnitude of performance improvement seen in EMR 5.24.0 relative to EMR 5.16.0 as measured by query execution time. The largest performance improvements can be seen in 26 of the queries. In each of these queries, the performance was at least 2X better than EMR 5.16.

Performance optimizations in EMR 5.24

While AWS made several incremental performance improvements aggregating to the overall speedup, this post describes three major improvements in EMR 5.24 that affect the most common customer workloads:

  • Dynamic partition pruning
  • Flatten scalar subqueries
  • DISTINCT before INTERSECT

Dynamic partition pruning

Dynamic partition pruning improves job performance by selecting specific partitions within a table that must be read and processed for a query. By reducing the amount of data read and processed, queries run faster. The open source version of Spark (2.4.2) only supports pushing down static predicates that can be resolved at plan time. Examples of static predicate push down include the following:

partition_col = 5

partition_col IN (1,3,5)

partition_col BETWEEN 1 AND 3

partition_col = 1 + 3

With dynamic partition pruning turned on, Spark on EMR infers the partitions that must be read at runtime. Dynamic partition pruning is disabled by default, and can be enabled by setting the Spark property spark.sql.dynamicPartitionPruning.enabled from within Spark or when creating clusters. For more information, see Configure Spark.

Here’s an example that joins two tables and relies on dynamic partition pruning to improve performance. The store_sales table contains total sales data partitioned by region, and store_regions table contains a mapping of regions for each country. In this representative query, you want to only get data from a specific country.

SELECT ss.quarter, ss.region, ss.store, ss.total_sales 
FROM store_sales ss, store_regions sr
WHERE ss.region sr.region AND sr.country = ’North America’

Without dynamic partition pruning, this query reads all regions, before filtering out the subset of regions that match the results of the subquery. With dynamic partition pruning, only the partitions for the regions returned in the subquery are read and processed. This saves time and resources by both reading less data from storage, and processing fewer records.

The following graph shows the performance improvements to Queries 72, 80, 17, and 25 from the TPC-DS suite that we tested with 3-TB data.

Flatten scalar subqueries

This optimization can improve query performance where multiple conditions must be applied to rows from a specific table. The optimization prevents the table from being read multiple times for each condition. This optimization detects such cases, and optimizes the query to read the table only one time.

Flatten scalar subqueries is disabled by default and can be enabled by setting the Spark property spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled from within Spark or when creating clusters.

To give an example of how this works, use the same total_sales table from the previous optimization. In this example, you want to group stores by their average sales when their average sales are in between specific ranges.

SELECT (SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 5000000 AND 10000000) AS group1, 
(SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 10000000 AND 15000000) AS group2, 
(SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 15000000 AND 20000000) AS group3  

With this optimization disabled, the total_sales table is read for each sub query. With the optimization enabled, the query is rewritten as follows to apply each of the conditions to the rows returned by reading the table only one time.

SELECT c1 AS group1, c2 AS group2, c3 AS group3 
FROM (SELECT avg (IF(total_sales BETWEEN 5000000 AND 10000000, total_sales, null)) AS c1, 
avg (IF(total_sales BETWEEN 10000000 AND 15000000, total_sales, null)) AS c2, 
avg (IF(total_sales BETWEEN 15000000 AND 20000000, total_sales, null)) AS c3 FROM store_sales);  

This optimization saves time and resources by both reading less data from storage, and processing fewer records.

To illustrate, take the example of Q9 from the TPCDS suite. The query runs 2.9x faster in version 5.24 compared to 5.16, when the relevant Spark property is switched on.

DISTINCT before INTERSECT

When producing the intersection of two collections, the result of that intersection is a set of unique values found in each collection. When dealing with large collections, many duplicate records must be both processed, and shuffled between hosts to finally calculate the intersection. This optimization eliminates duplicate values in each collection before computing the intersection, improving performance by reducing the amount of data shuffled between hosts.

This optimization is disabled by default and can be enabled by setting the Spark property spark.sql.optimizer.distinctBeforeIntersect.enabled from within Spark or when creating clusters.

For example (simplified from TPC-DS query14), you want to find all of the brands that are sold both in store and catalog sale channels. In this example, the store_sales table contains sale made through the store channel, the catalog_sales table contains sale made through catalog, and the item table contains each unique product’s formulation (e.g. brand, manufactuer).

(SELECT item.brand ss_brand FROM store_sales, item
WHERE store_sales.item_id = item.item_id)
INTERSECT
(SELECT item.brand cs_brand FROM catalog_sales, item 
WHERE catalog_sales.item_id = item.item_id) 

With this optimization disabled, the first SELECT statement produces 2,600,000 records (same number of records as store_sales) with only 1,200 unique brands. The second SELECT statement produces 1,500,000 records (same number of records as catalog_sales) with 300 unique brands. This results in all 4,100,000 rows being fed into the intersect operation to produce the 200 brands that exist in both results.

With the optimization enabled, a distinct operation is performed on each collection before being fed into the intersect operator, resulting in only 1,200 + 300 records being fed into the intersect operator. This optimization saves time and resources by shuffling less data between hosts.

Summary

With each of these performance optimizations to Apache Spark, you benefit from better query performance on EMR 5.24 compared to EMR 5.16. We look forward to feedback on how these optimizations benefit your real world workloads.

Stay tuned as we roll out additional updates to improve Apache Spark performance in EMR. To keep up-to-date, subscribe to the Big Data blog’s RSS feed to learn about more great Apache Spark optimizations, configuration best practices, and tuning advice. Be sure not to miss other great optimizations like using S3 Select with Spark, and the EMRFS S3-Optimized Committer from previous EMR releases.

 


About the Author

Paul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

 

 

 

 

Joseph Marques is a principal engineer for EMR at Amazon Web Services.

 

 

 

 

Yuzhou Sun is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Atul Payapilly is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Surya Vadan Akivikolanu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Amazon EMR Migration Guide

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/amazon-emr-migration-guide/

Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor.

To address this, we’ve introduced the Amazon EMR Migration Guide (first published June 2019.) This paper is a comprehensive guide to offer sound technical advice to help customers in planning how to move from on-premises big data deployments to EMR.

Common problems of on-premises big data environments include a lack of agility, excessive costs, and administrative headaches, as IT organizations wrestle with the effort of provisioning resources, handling uneven workloads at large scale, and keeping up with the pace of rapidly changing, community-driven, open-source software innovation. Many big data initiatives suffer from the delay and burden of evaluating, selecting, purchasing, receiving, deploying, integrating, provisioning, patching, maintaining, upgrading, and supporting the underlying hardware and software infrastructure.

A subtler, if equally critical, problem is the way companies’ data center deployments of Apache Hadoop and Apache Spark directly tie together the compute and storage resources in the same servers, creating an inflexible model where they must scale in lock step. This means that almost any on-premises environment pays for high amounts of under-used disk capacity, processing power, or system memory, as each workload has different requirements for these components. Typical workloads run on different types of clusters, at differing frequencies and times of day. These big data workloads should be freed to run whenever and however is most efficient, while still accessing the same shared underlying storage or data lake. See Figure 1. below for an illustration.

How can smart businesses find success with their big data initiatives? Migrating big data (and machine learning) to the cloud offers many advantages. Cloud infrastructure service providers, such as AWS offer a broad choice of on-demand and elastic compute resources, resilient and inexpensive persistent storage, and managed services that provide up-to-date, familiar environments to develop and operate big data applications. Data engineers, developers, data scientists, and IT personnel can focus their efforts on preparing data and extracting valuable insights.

Services like Amazon EMR, AWS Glue, and Amazon S3 enable you to decouple and scale your compute and storage independently, while providing an integrated, well-managed, highly resilient environment, immediately reducing so many of the problems of on-premises approaches. This approach leads to faster, more agile, easier to use, and more cost-efficient big data and data lake initiatives.

However, the conventional wisdom of traditional on-premises Apache Hadoop and Apache Spark isn’t always the best strategy in cloud-based deployments. A simple lift and shift approach to running cluster nodes in the cloud is conceptually easy but suboptimal in practice. Different design decisions go a long way towards maximizing your gains as you migrate big data to a cloud architecture.

This guide provides the best practices for:

  • Migrating data, applications, and catalogs
  • Using persistent and transient resources
  • Configuring security policies, access controls, and audit logs
  • Estimating and minimizing costs, while maximizing value
  • Leveraging the AWS Cloud for high availability (HA) and disaster recovery (DR)
  • Automating common administrative tasks

Although not intended as a replacement for professional services, this guide covers a wide range of common questions, and scenarios as you migrate your big data and data lake initiatives to the cloud.

When starting your journey for migrating your big data platform to the cloud, you must first decide how to approach migration. One approach is to re-architect your platform to maximize the benefits of the cloud. The other approach is known as lift and shift, is to take your existing architecture and complete a straight migration to the cloud. A final option is a hybrid approach, where you blend a lift and shift with re-architecture. This decision is not straightforward as there are advantages and disadvantages of both approaches.

A lift and shift approach is usually simpler with less ambiguity and risk. Additionally, this approach is better when you are working against tight deadlines, such as when your lease is expiring for a data center. However, the disadvantage to a lift and shift is that it is not always the most cost effective, and the existing architecture may not readily map to a solution in the cloud.

A re-architecture unlocks many advantages, including optimization of costs and efficiencies. With re-architecture, you move to the latest and greatest software, have better integration with native cloud tools, and lower operational burden by leveraging native cloud products and services.

This paper provides advantages and disadvantages of each migration approach from the perspective of the Apache Spark and Hadoop ecosystems. To read the paper, download the Amazon EMR Migration Guide now.

For a more general resource on deciding which approach is ideal for your workflow, see An E-Book of Cloud Best Practices for Your Enterprise, which outlines the best practices for performing migrations to the cloud at a higher level.

 


About the Author

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Optimizing downstream data processing with Amazon Kinesis Data Firehose and Amazon EMR running Apache Spark

Post Syndicated from Srikanth Kodali original https://aws.amazon.com/blogs/big-data/optimizing-downstream-data-processing-with-amazon-kinesis-data-firehose-and-amazon-emr-running-apache-spark/

For most organizations, working with ever-increasing volumes of data and incorporating new data sources can be a challenge.  Often, AWS customers have messages coming from various connected devices and sensors that must be efficiently ingested and processed before further analysis.  Amazon S3 is a natural landing spot for data of all types.  However, the way data is stored in Amazon S3 can make a significant difference in the efficiency and cost of downstream data processing.  Specifically, Apache Spark can be over-burdened with file operations if it is processing a large number of small files versus fewer larger files.  Each of these files has its own overhead of a few milliseconds for opening, reading metadata information, and closing. This overhead of file operations on these large numbers of files results in slow processing. This blog post shows how to use Amazon Kinesis Data Firehose to merge many small messages into larger messages for delivery to Amazon S3.  This results in faster processing with Amazon EMR running Spark.

Like Amazon Kinesis Data Streams, Kinesis Data Firehose accepts a maximum incoming message size of 1 MB.  If a single message is greater than 1 MB, it can be compressed before placing it on the stream.  However, at large volumes, a message or file size of 1 MB or less is usually too small.  Although there is no right answer for file size, 1 MB for many datasets would just yield too many files and file operations.

This post also shows how to read the compressed files using Apache Spark that are in Amazon S3, which does not have a proper file name extension and store back in Amazon S3 in parquet format.

Solution overview

The steps we follow in this blog post are:

  1. Create a virtual private cloud (VPC) and an Amazon S3 bucket.
  2. Provision a Kinesis data stream, and an AWS Lambda function to process the messages from the Kinesis data stream.
  3. Provision Kinesis Data Firehose to deliver messages to Amazon S3 sent from the Lambda function in step 2. This step also provisions an Amazon EMR cluster to process the data in Amazon S3.
  4. Generate test data with custom code running on an Amazon EC2
  5. Run a sample Spark program from the Amazon EMR cluster’s master instance to read the files from Amazon S3, convert them into parquet format and write back to an Amazon S3 destination.

The following diagram explains how the services work together:

The AWS Lambda function in the diagram reads the messages, append additional data to them, and compress them with gzip before sending to Amazon Kinesis Data Firehose. The reason for this is most customers need some enrichment to the data before arriving to Amazon S3.

Amazon Kinesis Data Firehose can buffer incoming messages into larger records before delivering them to your Amazon S3 bucket. It does so according to two conditions, buffer size (up to 128 MB) and buffer interval (up to 900 seconds). Record delivery is triggered once either of these conditions has been satisfied.

An Apache Spark job reads the messages from Amazon S3, and stores them in parquet format. With parquet, data is stored in a columnar format that provides more efficient scanning and enables ad hoc querying or further processing by services like Amazon Athena.

Considerations

The maximum size of a record sent to Kinesis Data Firehose is 1,000 KB. If your message size is greater than this value, compressing the message before it is sent to Kinesis Data Firehose is the best approach. Kinesis Data Firehose  also offers compression of messages after they are written to the Kinesis Data Firehose data stream. Unfortunately, this does not overcome the message size limitation, because this compression happens after the message is written. When Kinesis Data Firehose delivers a previously compressed message to Amazon S3 it is written as an object without a file extension. For example, if a message is compressed with gzip before it is written to Kinesis Data Firehose, it is delivered to Amazon S3 without the .gz extension. This is problematic if you are using Apache Spark for downstream processing because a “.gz” extension is required.

We will see how to overcome this issue by reading the files using the Amazon S3 API operations later in this blog.

Prerequisites and assumptions

To follow the steps outlined in this blog post, you need the following:

  • An AWS account that provides access to AWS services.
  • An AWS Identity and Access Management (IAM) user with an access key and secret access key to configure the AWS CLI.
  • The templates and code are intended to work in the US East (N. Virginia) Region only.

Additionally, be aware of the following:

  • We configure all services in the same VPC to simplify networking considerations.
  • Important: The AWS CloudFormation templates and the sample code that we provide use hardcoded user names and passwords and open security groups. These are for testing purposes only. They aren’t intended for production use without any modifications.

Implementing the solution

You can use this downloadable template for single-click deployment. This template is launched in the US East (N. Virginia) Region by default. Do not change to a different Region. The template is designed to work only in the US East (N. Virginia) Region. To launch directly through the console, choose the Launch Stack button.

This template takes the following parameters. Some of the parameters have default values, and you can’t edit these. These predefined names are hardcoded in the code. For some of the parameters, you must provide the values. The following table provides additional details.

For this parameterProvide this
StackNameProvide the stack name.

ClientIP

 

The IP address range of the client that is allowed to connect to the cluster using SSH.
FirehoseDeliveryStreamNameThe name of the Amazon Firehose delivery stream. Default value is set to “AWSBlogs-LambdaToFireHose”.
InstanceTypeThe EC2 instance type.
KeyNameThe name of an existing EC2 key pair to enable access to login.
KinesisStreamNameThe name of the Amazon Kinesis Stream. Default value is set to “AWS-Blog-BaseKinesisStream”
RegionAWS Region – By default it is us-east-1 — US East (N. Virginia). Do not change this as the scripts are developed to work in this Region only.

EMRClusterName

 

A name for the EMR cluster.
S3BucketNameThe name of the bucket that is created in your account. Provide some unique name to this bucket. This bucket is used for storing the messages and output from the Spark code.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, select the check box for I acknowledge that AWS CloudFormation might create IAM resources with custom names and for I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND. And then click on the Create button.

If you use this one-step solution, you can skip to Step 7: Generate test dataset and load into Kinesis Data Streams.

To create each component individually, use the following steps.

1. Use the AWS CloudFormation template to configure Amazon VPC and create an Amazon S3 bucket

In this step, we set up a VPC, public subnet, internet gateway, route table, and a security group. The security group has two inbound access rules. The first inbound rule allows access to the TCP port 22 (SSH) from the provided client IP CIDR range and the second inbound rule allows access to any TCP port from any host with in the same security group. We use this VPC and subnet for all other services that are created in the next steps. In addition to these resources, we will also create a standard Amazon S3 bucket with a provided bucket name to store the incoming data and processed data. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameterDo this
StackNameProvide the stack name.
S3BucketNameProvide a unique Amazon S3 bucket. This bucket is created in your account.
ClientIpProvide a CIDR IP address range that is added to inbound rule of the security group. You can get your current IP address from “checkip.amazon.com” web url.

After you specify the template details, choose Next. On the Review page, choose Create.

When the stack launch is complete, it should return outputs similar to the following.

KeyValue
StackNameName
VPCIDVpc-xxxxxxx
SubnetIDsubnet-xxxxxxxx
SecurityGroupsg-xxxxxxxxxx
S3BucketDomain<S3_BUCKET_NAME>.s3.amazonaws.com
S3BucketARNarn:aws:s3:::<S3_BUCKET_NAME>

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

2.  Use the AWS CloudFormation template to create necessary IAM Roles

In this step, we set up two AWS IAM roles. One of the IAM roles will be used by an AWS Lambda function to allow access to Amazon S3 service, Amazon Kinesis Data Firehose, Amazon CloudWatch Logs, and Amazon EC2 instances.  The second IAM role is used by the Amazon Kinesis Data Firehose service to access Amazon S3 service. You can use this downloadable CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameterDo this
StackNameProvide the stack name.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, select the check box for I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Create.

When the stack launch is complete, it should return outputs similar to the following.

KeyValue
LambdaRoleArnarn:aws:iam::<ACCOUNT_NUMBER>:role/small-files-lamdarole
FirehoseRoleArnarn:aws:iam::<ACCOUNT_NUMBER>:role/small-files-firehoserole

When the stack launch is complete, it returns the output with information about the resources that were created. Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

3. Use an AWS CloudFormation template to configure the Amazon Kinesis Data Firehose data stream

In this step, we set up Amazon Kinesis Data Firehose with Amazon S3 as destination for the incoming messages. We select the Uncompressed option for compression format, buffering options with 128 MB size and interval seconds of 300. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameterDo this
StackNameProvide the stack name.
FirehoseDeliveryStreamNameProvide the name of the Amazon Kinesis Data Firehose delivery stream. The default value is set to “AWSBlogs-LambdaToFirehose”
RoleProvide the Kinesis Data Firehose IAM role ARN that was created as part of step 2.
S3BucketARNSelect the S3BucketARN. You can get this from the step 1 AWS CloudFormation output.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

4. Use an AWS CloudFormation template to create a Kinesis data stream and a Lambda function

In this step, we set up a Kinesis data stream and an AWS Lambda function. We can use the AWS Lambda function to process incoming messages in a Kinesis data stream. An event source mapping is also created as part of this template. This adds a trigger to the AWS Lambda function for the Kinesis data stream source. For more information about creating event source mapping, see Creating an Event Source Mapping. This Kinesis data stream is created with 10 shards and the Lambda function is created with a Java 8 runtime. We allocate memory size of 1920 MB and timeout of 300 seconds. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides details.

For this parameterDo this
StackNameProvide the stack name.
KinesisStreamNameProvide the name of the Amazon Kinesis stream. Default value is set to ‘AWS-Blog-BaseKinesisStream’
RoleProvide the IAM Role created for Lambda function as part of the second AWS CloudFormation template. Get the value from the output of second AWS CloudFormation template.
S3BucketProvide the existing Amazon S3 bucket name that was created using first AWS CloudFormation template. Do not use the domain name. Provide the bucket name only.
RegionSelect the AWS Region. By default it is us-east-1 — US East (N. Virginia).

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

5. Use an AWS CloudFormation template to configure the Amazon EMR cluster

In this step, we set up an Amazon EMR 5.16.0 cluster with “Spark”, “Ganglia” and “Hive” applications. We create this cluster with one master and two core nodes, and use an r4.xlarge instance type. The template uses an AWS Glue metastore for the Amazon EMR hive metastore. This Amazon EMR cluster is used to process the messages in Amazon S3 bucket that are created by the Amazon Kinesis Data Firehose data stream. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameterDo this
EMRClusterNameProvide the name for the EMR cluster.
ClusterSecurityGroupSelect the security group ID that was created as part of the first AWS CloudFormation template.
ClusterSubnetIDSelect the subnet ID that was created as part of the first AWS CloudFormation template.
AllowedCIDRProvide the IP address range of the client that is allowed to connect to the cluster.
KeyNameProvide the name of an existing EC2 key pair to access the Amazon EMR cluster.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

When the stack launch is complete, it should return outputs similar to the following.

KeyValue
EMRClusterMasterssh [email protected] -i <KEY_PAIR_NAME>.pem

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

6. Use an AWS CloudFormation template to create an Amazon EC2 Instance to generate test data

In this step, we set up an Amazon EC2 instance and install open-jdk version 1.8. The AWS CloudFormation script that creates this EC2 instance runs two additional steps. First, it downloads and installs open-jdk version 1.8. Second, it downloads a Java program jar file on to the EC2 instance’s ec2-user home directory. We use this Java program to generate test data messages with an approximate size of ~900 KB. We then send them to the Kinesis data stream that was created as part of the previous steps. The Java jar file name is: “sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar”.

You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameterDo this
EC2SecurityGroupSelect the security group ID that was created from the first AWS CloudFormation template.
EC2SubnetSelect the subnet that was created from the first AWS CloudFormation template.
InstanceTypeSelect the provided instance type. By default, it selects r4.4xlarge instance.
KeyNameName of an existing EC2 key pair to enable SSH access to the EC2 instance.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page select “I acknowledge that AWS CloudFormation might create IAM resources with custom names” option and, click Create button.

When the stack launch is complete, it should return outputs similar to the following.

KeyValue
EC2Instancessh [email protected]<Public-IP> -i <KEY_PAIR_NAME>.pem

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

7. Generate the test dataset and load into Kinesis Data Streams

After all of the previous AWS CloudFormation stacks are created successfully, log in to the EC2 instance that was created as part of the step 6. Use the “ssh” command as shown in the CloudFormation stack template output. This template copies the “sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar” file, which we use to generate the test data and send to Amazon Kinesis Data Streams. You can find the code corresponding to this sample Kinesis producer in this Git repository.

Make sure your EC2 instance’s security group allows ssh port 22 (Inbound) from your IP address. If not, update your security group inbound access.

Run the following commands to generate some test data.

$ cd;

 

$ ls -ltra sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar

-rwxr-xr-x 1 ec2-user ec2-user 27536802 Oct 29 21:19 sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar

 

$java -Xms1024m -Xmx25600m -XX:+UseG1GC -cp sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar com.optimize.downstream.entry.Main 10000

 

This java program uses PutRecords API method that allows many records to be sent with a single HTTP request. For more information on this you can check this AWS blog. Once you run the above java program, you will see the below output that shows messages are in the process of sending to Kinesis Data Stream.

“Starting producer and consumer.....
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 0
Producer Thread # 9 is going to sleep mode for 500 ms.
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 1
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 2
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 3
::
::
Record sent to Kinesis Stream. Record size is ::: 5042850 KB
Sending a record to Kinesis Stream with 5 messages grouped together.
Record sent to Kinesis Stream. Record size is ::: 5042726 KB
Sending a record to Kinesis Stream with 5 messages grouped together.
Record sent to Kinesis Stream. Record size is ::: 5042729 KB”

When running the sample Kinesis producer jar, notice that the number of messages is 10,000. This program generates the test data messages and is not a replacement for your load testing tool. This is created to demonstrate the use case presented in this post.

After all of the messages generated and sent to Amazon Kinesis Data Streams, program will exit gracefully.

The sample JSON input message format is shown as follows:

   "processedDate":"2018/10/30 19:05:19",
   "currentDate":"2018/10/30 19:05:07",
   "hashDeviceId":"0c2745e4-c2d6-4d43-8339-9c2401e80e92",
   "deviceId":"94581b5f-a117-484a-8e3c-4fcc2dbd53b7",
   "accelerometerSensorList":[  
      {  
         "accelerometer_Y":8,
         "gravitySensor_X":5,
         "accelerometer_X":9,
         "gravitySensor_Z":4,
         "accelerometer_Z":1,
         "gravitySensor_Y":5,
         "linearAccelerationSensor_Z":3,
         "linearAccelerationSensor_Y":9,
         "linearAccelerationSensor_X":9
      },
      {  
         "accelerometer_Y":1,
         "gravitySensor_X":3,
         "accelerometer_X":5,
         "gravitySensor_Z":5,
         "accelerometer_Z":7,
         "gravitySensor_Y":9,
         "linearAccelerationSensor_Z":6,
         "linearAccelerationSensor_Y":5,
         "linearAccelerationSensor_X":3
      },
 {
   …
 },
 {
   …
 },
 :
 :
   ],
   "tempSensorList":[  
      {  
         "kelvin":585.4928040286752,
         "celsius":43.329574923775425,
         "fahrenheit":50.13864584530086
      },
      {  
         "kelvin":349.95625855125814,
         "celsius":95.68423052685313,
         "fahrenheit":7.854854574219985
      },
 {
   …
 },
 {
   …
 },
 :
 :
 
   ],
   "illuminancesSensorList":[  
      {  
         "illuminance":44.65135784368194
      },
      {  
         "illuminance":98.15404017082403
      },
 {
   …
 },
 {
   …
 },
 :
 :
   ],
   "gpsSensorList":[  
      {  
         "altitude":4.38273213294682,
         "heading":7.416314616289915,
         "latitude":5.759723677991661,
         "longitude":1.4732885894731842
      },
      {  
         "altitude":9.816473807569487,
         "heading":5.118919157684835,
         "latitude":3.581361614110458,
         "longitude":1.3699272610616127
      },
 {
   …
 },
 {
   …
 },
 :
 :
   }

 

Log in to the Kinesis Data Streams console, then choose the Kinesis data stream that was created as part of the step 4.  Choose the Monitor tab to see the graphs. Run the data generation utility for at least 15 mins to generate enough data.

8. Processing Kinesis Data Streams messages using AWS Lambda

As part of the previously-described setup, we also use an AWS Lambda function (name:LambdaForProcessingKinesisRecords) to process the messages from the Kinesis data stream. This Lambda function reads each message content and appends “additional data.” This demonstrates that the incoming message from Kinesis data stream is read, and appended with some additional information to make the message size more than 1 MB.  Several customers have a use case like this to enrich the incoming messages by adding additional information. After the AWS Lambda function appends additional data to incoming messages, it sends them to Amazon Kinesis Data Firehose. Because Kinesis Data Firehose accepts only messages that are less than 1 MB, we must compress the messages before sending to it. In the Lambda function, we are compressing the message using gzip compression before sending it to Kinesis Data Firehose. In addition to compressing each message, we are also appending a new line character (“/n”) to each message after compressing it to separate the messages.

We set the buffer size to 128 MB and duration of the buffer is 900 seconds while creating the Kinesis Data Firehose. This helps merge the incoming compressed messages into larger messages and sends to the provided Amazon S3 bucket.

The AWS Lambda function appends the following content to the original message in Kinesis Data Streams after reading it.

"testAdditonalDataList": [
  {
    "dimesnion_X": 9,
    "dimesnion_Y": 2,
    "dimesnion_Z": 2
  },
  {
    "dimesnion_X": 3,
    "dimesnion_Y": 10,
    "dimesnion_Z": 5
  }
  {
    …
  },
  {
    …
  },
  :
  :
]

If we do not compress the message before sending to Kinesis Data Firehose, it throws this error message in the Amazon CloudWatch Logs.

Here is the code snippet where we are compressing the message in the AWS Lambda function. The complete code can be found in this Git repository.

private String sendToFireHose(String mergedJsonString)
{
    PutRecordResult res = null;
    try {
        //To Firehose -
        System.out.println("MESSAGE SIZE BEFORE COMPRESSION IS : " + mergedJsonString.toString().getBytes(charset).length);
        System.out.println("MESSAGE SIZE AFTER GZIP COMPRESSION IS : " + compressMessage(mergedJsonString.toString().getBytes(charset)).length);
        PutRecordRequest req = new PutRecordRequest()
                .withDeliveryStreamName(firehoseStreamName);

        // Without compression - Send to Firehose
        //Record record = new Record().withData(ByteBuffer.wrap((mergedJsonString.toString() + "\r\n").getBytes()));

        // With compression - send to Firehose
        Record record = new Record().withData(ByteBuffer.wrap(compressMessage((mergedJsonString.toString() + "\r\n").getBytes())));
        req.setRecord(record);
        res = kinesisFirehoseClient.putRecord(req);
    }
    catch (IOException ie) {
        ie.printStackTrace();
    }
    return res.getRecordId();
}

You can check the provided bucket to see if the messages are flowing into the bucket. The Amazon S3 bucket should show something similar to the following example:

You see the files generated from Kinesis Data Firehose that do not have any extension. By default, Kinesis Data Firehose does not provide any extension to the files that are generated in Amazon S3 bucket unless you select a compression option. But in our use case, since the size of the uncompressed input message is greater than 1 MB, we are compressing it before sending to Kinesis Data Firehose. As the message is already compressed, we are not selecting any compression option in Kinesis Data Firehose, as it double-compresses the message and the downstream Spark application cannot process this.

9. Reading and converting the data into parquet format using Apache Spark program with Amazon EMR

As we noted down from the previous screen shot, Kinesis Data Firehose by default does not generate any file extensions to the files that are written into Amazon S3 bucket. This creates a problem while reading the files using Apache Spark. Apache Spark, by default, checks for a valid file name extension if the file is compressed. In this case for gzip compression, it looks for <filename>.gz to successfully read it.

To overcome this issue, we can use Amazon S3 API operations, particularly AmazonS3Client class, to list all the Amazon S3 keys and use Spark’s parallelize method to read the contents of the files. After reading the file content, we can uncompress it using GZipInputStream class. You can find the code snippet below. The complete code can be found in the Git repository.

val allLinesRDD = spark.sparkContext.parallelize(s3ObjectKeys).flatMap
{ key => Source.fromInputStream
  (
   new GZipInputStream(s3Client.getObject(bucketName, key).getObjectContent:   InputStream)
  ).getLines 
}

var finalDF = spark.read.json(allLinesRDD).toDF()

Once the Amazon EMR cluster creation is completed successfully, login to the Amazon EMR master machine using the following command. You can get the “ssh” login command from the AWS CloudFormation stack 5 (step 5) outputs parameter “EMRClusterMaster”.

  • ssh [email protected] -i <KEYPAIR_NAME>.pem
  • Make sure the security port 22 is opened to connect to the Amazon EMR master machine.

Run the Spark program using the following Spark submit command.

spark-submit --class com.optimize.downstream.process.ProcessFilesFromS3AndConvertToParquet --master yarn --deploy-mode client s3://aws-bigdata-blog/artifacts/aws-blog-optimize-downstream-data-processing/appjars/spark-process-1.0-SNAPSHOT-jar-with-dependencies.jar <S3_BUCKET_NAME> fromfirehose/<YEAR>/ output-emr-parquet/

Change the S3_BUCKET_NAME and YEAR values from the previous Spark command.

Argument #PropertyValue
1–classcom.optimize.downstream.process.ProcessFilesFromS3AndConvertToParquet
2–masteryarn
3–deploy-modeclient
4s3://aws-bigdata-blog/artifacts/aws-blog-avoid-small-files/appjars/spark-process-1.0-SNAPSHOT-jar-with-dependencies.jar
5S3_BUCKET_NAMEThe Amazon S3 bucket name that was created as part of the AWS CloudFormation template. The source files are created in this bucket.
6<INPUT S3 LOCATION>“fromfirehose/<YYYY>/”. The input files are created in this Amazon S3 key location under the bucket that was created. “YYYY” represents the current year. For example, “fromfirehose/2018/”
7<OUTPUT S3 LOCATION>Provide an output directory name that will be created under the above provided Amazon S3 bucket. For example: “output-emr-parquet/”

 

When the program finishes running, you can check the Amazon S3 output location to see the files that are written in parquet format.

Cleaning up after the migration

After completing and testing this solution, clean up the resources by stopping your tasks and deleting the AWS CloudFormation stacks. The stack deletion fails if you have any files in the created Amazon S3 bucket. Make sure that you cleaned up the Amazon S3 bucket that was created before deleting the AWS CloudFormation templates.

Conclusion

In this post, we described the process of avoiding small file creation in Amazon S3 by sending the incoming messages to Amazon Kinesis Data Firehose. We also went through the process of reading and storing the data in parquet format using Apache Spark with an Amazon EMR cluster.

 


About the Author

Photo of Srikanth KodaliSrikanth Kodali is a Sr. IOT Data analytics architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on building IoT data and analytics solutions, helping them improve the value of their solutions when using AWS.

 

 

Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/trigger-cross-region-replication-of-pre-existing-objects-using-amazon-s3-inventory-amazon-emr-and-amazon-athena/

In Amazon Simple Storage Service (Amazon S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in the source bucket, or optionally a subset, controlled by prefix and tags.

Objects that exist before you enable CRR (pre-existing objects) are not replicated. Similarly, objects might fail to replicate (failed objects) if permissions aren’t in place, either on the IAM role used for replication or the bucket policy (if the buckets are in different AWS accounts).

In our work with customers, we have seen situations where large numbers of objects aren’t replicated for the previously mentioned reasons. In this post, we show you how to trigger cross-region replication for pre-existing and failed objects.

Methodology

At a high level, our strategy is to perform a copy-in-place operation on pre-existing and failed objects. This operation uses the Amazon S3 API to copy the objects over the top of themselves, preserving tags, access control lists (ACLs), metadata, and encryption keys. The operation also resets the Replication_Status flag on the objects. This triggers cross-region replication, which then copies the objects to the destination bucket.

To accomplish this, we use the following:

  • Amazon S3 inventory to identify objects to copy in place. These objects don’t have a replication status, or they have a status of FAILED.
  • Amazon Athena and AWS Glue to expose the S3 inventory files as a table.
  • Amazon EMR to execute an Apache Spark job that queries the AWS Glue table and performs the copy-in-place operation.

Object filtering

To reduce the size of the problem (we’ve seen buckets with billions of objects!) and eliminate S3 List operations, we use Amazon S3 inventory. S3 inventory is enabled at the bucket level, and it provides a report of S3 objects. The inventory files contain the objects’ replication status: PENDING, COMPLETED, FAILED, or REPLICA. Pre-existing objects do not have a replication status in the inventory.

Interactive analysis

To simplify working with the files that are created by S3 inventory, we create a table in the AWS Glue Data Catalog. You can query this table using Amazon Athena and analyze the objects.  You can also use this table in the Spark job running on Amazon EMR to identify the objects to copy in place.

Copy-in-place execution

We use a Spark job running on Amazon EMR to perform concurrent copy-in-place operations of the S3 objects. This step allows the number of simultaneous copy operations to be scaled up. This improves performance on a large number of objects compared to doing the copy operations consecutively with a single-threaded application.

Account setup

For the purpose of this example, we created three S3 buckets. The buckets are specific to our demonstration. If you’re following along, you need to create your own buckets (with different names).

We’re using a source bucket named crr-preexisting-demo-source and a destination bucket named crr-preexisting-demo-destination. The source bucket contains the pre-existing objects and the objects with the replication status of FAILED. We store the S3 inventory files in a third bucket named crr-preexisting-demo-inventory.

The following diagram illustrates the basic setup.

You can use any bucket to store the inventory, but the bucket policy must include the following statement (change Resource and aws:SourceAccount to match yours).

{
    "Version": "2012-10-17",
    "Id": "S3InventoryPolicy",
    "Statement": [
        {
            "Sid": "S3InventoryStatement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::crr-preexisting-demo-inventory/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control",
                    "aws:SourceAccount": "111111111111"
                }
            }
        }
    ]
}

In our example, we uploaded six objects to crr-preexisting-demo-source. We added three objects (preexisting-*.txt) before CRR was enabled. We also added three objects (failed-*.txt) after permissions were removed from the CRR IAM role, causing CRR to fail.

Enable S3 inventory

You need to enable S3 inventory on the source bucket. You can do this on the Amazon S3 console as follows:

On the Management tab for the source bucket, choose Inventory.

Choose Add new, and complete the settings as shown, choosing the CSV format and selecting the Replication status check box. For detailed instructions for creating an inventory, see How Do I Configure Amazon S3 Inventory? in the Amazon S3 Console User Guide.

After enabling S3 inventory, you need to wait for the inventory files to be delivered. It can take up to 48 hours to deliver the first report. If you’re following the demo, ensure that the inventory report is delivered before proceeding.

Here’s what our example inventory file looks like:

You can also look on the S3 console on the objects’ Overview tab. The pre-existing objects do not have a replication status, but the failed objects show the following:

Register the table in the AWS Glue Data Catalog using Amazon Athena

To be able to query the inventory files using SQL, first you need to create an external table in the AWS Glue Data Catalog. Open the Amazon Athena console at https://console.aws.amazon.com/athena/home.

On the Query Editor tab, run the following SQL statement. This statement registers the external table in the AWS Glue Data Catalog.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo (
    `bucket` string,
    key string,
    replication_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://crr-preexisting-demo-inventory/crr-preexisting-demo-source/crr-preexisting-demo/hive';

After creating the table, you need to make the AWS Glue Data Catalog aware of any existing data and partitions by adding partition metadata to the table. To do this, you use the Metastore Consistency Check utility to scan for and add partition metadata to the AWS Glue Data Catalog.

MSCK REPAIR TABLE crr_preexisting_demo;

To learn more about why this is required, see the documentation on MSCK REPAIR TABLE and data partitioning in the Amazon Athena User Guide.

Now that the table and partitions are registered in the Data Catalog, you can query the inventory files with Amazon Athena.

SELECT * FROM crr_preexisting_demo where dt='2019-02-24-04-00';

The results of the query are as follows.

The query returns all rows in the S3 inventory for a specific delivery date. You’re now ready to launch an EMR cluster to copy in place the pre-existing and failed objects.

Note: If your goal is to fix FAILED objects, make sure that you correct what caused the failure (IAM permissions or S3 bucket policies) before proceeding to the next step.

Create an EMR cluster to copy objects

To parallelize the copy-in-place operations, run a Spark job on Amazon EMR. To facilitate EMR cluster creation and EMR step submission, we wrote a bash script (available in this GitHub repository).

To run the script, clone the GitHub repo. Then launch the EMR cluster as follows:

$ git clone https://github.com/aws-samples/amazon-s3-crr-preexisting-objects
$ ./launch emr.sh

Note: Running the bash script results in AWS charges. By default, it creates two Amazon EC2 instances, one m4.xlarge and one m4.2xlarge. Auto-termination is enabled so when the cluster is finished with the in-place copies, it terminates.

The script performs the following tasks:

  1. Creates the default EMR roles (EMR_EC2_DefaultRole and EMR_DefaultRole).
  2. Uploads the files used for bootstrap actions and steps to Amazon S3 (we use crr-preexisting-demo-inventory to store these files).
  3. Creates an EMR cluster with Apache Spark installed using the create-cluster

After the cluster is provisioned:

  1. A bootstrap action installs boto3 and awscli.
  2. Two steps execute, copying the Spark application to the master node and then running the application.

The following are highlights from the Spark application. You can find the complete code for this example in the amazon-s3-crr-preexisting-objects repo on GitHub.

Here we select records from the table registered with the AWS Glue Data Catalog, filtering for objects with a replication_status of "FAILED" or “”.

query = """
        SELECT bucket, key
        FROM {}
        WHERE dt = '{}'
        AND (replication_status = '""'
        OR replication_status = '"FAILED"')
        """.format(inventory_table, inventory_date)

print('Query: {}'.format(query))

crr_failed = spark.sql(query)

We call the copy_object function for each key returned by the previous query.

def copy_object(self, bucket, key, copy_acls):
        dest_bucket = self._s3.Bucket(bucket)
        dest_obj = dest_bucket.Object(key)

        src_bucket = self._s3.Bucket(bucket)
        src_obj = src_bucket.Object(key)

        # Get the S3 Object's Storage Class, Metadata, 
        # and Server Side Encryption
        storage_class, metadata, sse_type, last_modified = \
            self._get_object_attributes(src_obj)

        # Update the Metadata so the copy will work
        metadata['forcedreplication'] = runtime

        # Get and copy the current ACL
        if copy_acls:
            src_acl = src_obj.Acl()
            src_acl.load()
            dest_acl = {
                'Grants': src_acl.grants,
                'Owner': src_acl.owner
            }

        params = {
            'CopySource': {
                'Bucket': bucket,
                'Key': key
            },
            'MetadataDirective': 'REPLACE',
            'TaggingDirective': 'COPY',
            'Metadata': metadata,
            'StorageClass': storage_class
        }

        # Set Server Side Encryption
        if sse_type == 'AES256':
            params['ServerSideEncryption'] = 'AES256'
        elif sse_type == 'aws:kms':
            kms_key = src_obj.ssekms_key_id
            params['ServerSideEncryption'] = 'aws:kms'
            params['SSEKMSKeyId'] = kms_key

        # Copy the S3 Object over the top of itself, 
        # with the Storage Class, updated Metadata, 
        # and Server Side Encryption
        result = dest_obj.copy_from(**params)

        # Put the ACL back on the Object
        if copy_acls:
            dest_obj.Acl().put(AccessControlPolicy=dest_acl)

        return {
            'CopyInPlace': 'TRUE',
            'LastModified': str(result['CopyObjectResult']['LastModified'])
        }

Note: The Spark application adds a forcedreplication key to the objects’ metadata. It does this because Amazon S3 doesn’t allow you to copy in place without changing the object or its metadata.

Verify the success of the EMR job by running a query in Amazon Athena

The Spark application outputs its results to S3. You can create another external table with Amazon Athena and register it with the AWS Glue Data Catalog. You can then query the table with Athena to ensure that the copy-in-place operation was successful.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo_results (
  `bucket` string,
  key string,
  replication_status string,
  last_modified string
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  STORED AS TEXTFILE
LOCATION 's3://crr-preexisting-demo-inventory/results';

SELECT * FROM crr_preexisting_demo_results;

The results appear as follows on the console.

Although this shows that the copy-in-place operation was successful, CRR still needs to replicate the objects. Subsequent inventory files show the objects’ replication status as COMPLETED. You can also verify on the console that preexisting-*.txt and failed-*.txt are COMPLETED.

It is worth noting that because CRR requires versioned buckets, the copy-in-place operation produces another version of the objects. You can use S3 lifecycle policies to manage noncurrent versions.

Conclusion

In this post, we showed how to use Amazon S3 inventory, Amazon Athena, the AWS Glue Data Catalog, and Amazon EMR to perform copy-in-place operations on pre-existing and failed objects at scale.

Note: Amazon S3 batch operations is an alternative for copying objects. The difference is that S3 batch operations will not check each object’s existing properties and set object ACLs, storage class, and encryption on an object-by-object basis. For more information, see Introduction to Amazon S3 Batch Operations in the Amazon S3 Console User Guide.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Chauncy McCaughey is a senior data architect at AWS. His current side project is using statistical analysis of driving habits and traffic patterns to understand how he always ends up in the slow lane.

 

 

 

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

Post Syndicated from Vignesh Rajamani original https://aws.amazon.com/blogs/big-data/emr-notebooks-a-managed-analytics-environment-based-on-jupyter-notebooks/

Notebooks are increasingly becoming the standard tool for interactively developing big data applications. It’s easy to see why. Their flexible architecture allows you to experiment with data in multiple languages, test code interactively, and visualize large datasets. To help scientists and developers easily access notebook tools, we launched Amazon EMR Notebooks, a managed notebook environment that is based on the popular open-source Jupyter notebook application. EMR Notebooks support Spark Magic kernels, which allows you to submit jobs remotely on your EMR cluster using languages like PySpark, Spark SQL, Spark R, and Scala. The kernels submit your Spark code through Apache Livy, which is a REST server for Spark running on your cluster.

EMR Notebooks is designed to make it easy for you to experiment and build applications with Apache Spark. In this blog post, I first cover some of the benefits that EMR Notebooks offers. Then I introduce you to some of its capabilities such as detaching and attaching a notebook to different EMR clusters, monitoring Spark activity from within the notebook, using tags to control user permissions, and setting up user-impersonation to track notebook users and their actions. To learn about creating and using EMR Notebooks, you can visit Using Amazon EMR Notebooks or follow along with the AWS Online Tech Talks webinar.

Benefits of EMR Notebooks

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Finally, with EMR Notebooks, you don’t have to spend time to manually configure your notebook to store files persistently. Your notebook files are saved automatically to a chosen Amazon S3 bucket periodically so you don’t have to worry about losing your work if your cluster is shut down. You can retrieve your saved notebooks from the console or download it locally from your S3 bucket in Jupyter “ipynb” format.

Detaching and attaching EMR Notebooks to different clusters

With EMR Notebooks, you can detach an active notebook from a cluster and attach it to a different cluster and promptly resume your work. This capability can be useful in scenarios where you want to move your notebook from a sandbox development cluster to a production environment or attach to a different cluster with appropriate CPU or memory resources and library packages required to execute your notebook against large datasets. To detach an active notebook:

First select the notebook name and then choose Stop.

Wait for the notebook status shown next to the notebook name to change from Ready to Stopped and then choose Change Cluster.

After you stop the notebook, you can choose to attach it to a different cluster in the same VPC or create a new cluster. EMR Notebooks automatically attaches the notebook to the cluster and re-starts the notebook.

Monitoring and debugging Spark jobs

EMR Notebooks supports a built-in Jupyter notebook widget called SparkMonitor that allows you to monitor the status of all your Spark jobs launched from the notebook without connecting to the Spark web UI server.

A widget appears and automatically integrates within the cell structure of your notebook and displays detailed status about the job submitted from each cell of your notebook, providing you with real-time progress of the different stages of the job. For any failed jobs, this widget also offers embedded links to the container logs in Amazon S3, allowing you to get to the relevant logs and debug your jobs.

Additionally, if you have configured your cluster to accept SSH connections, then you can access the Spark application web UI and Hadoop jobs history server from within the notebook. This allows you to view the event timelines, visualize Directed Acyclic Graphs (DAG) of each job, and view the detailed system and runtime information to inspect and debug your code. These web UIs are available automatically the first time you start to run your Spark code from a notebook.

Writing tag-based policies to control user permissions for notebooks and users

EMR Notebooks are by default shared resources that anyone from your organization with access to your AWS account can open, edit, or even delete. If you want more control over your notebook, you can use tags to label your notebook and write IAM policies that control access for other users. To get you started, when you create a notebook, a default tag with a key string, creatorUserId, is set to the value of the IAM User ID of the user who created the notebook.

You can use this tag to limit allowed actions on the notebook to only the creator of the notebook. For example, the permissions policy statement below, when attached to a role or user, enables the IAM user to view, start, stop, edit, or delete only those notebooks that they have created. This policy statement uses the default tag that is applied by EMR Notebooks when you create the notebook.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:DescribeEditor",
                "elasticmapreduce:StartEditor",
                "elasticmapreduce:StopEditor",
                "elasticmapreduce:DeleteEditor",
                "elasticmapreduce:OpenEditorInConsole"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/creatorUserId": "${aws:userId}"
                }
            }
        }
    ]

You can write policies that enforce the creation of tags before starting a notebook. For example, the policy below requires that the user not change or delete the creatorUserID tag that is added by default. The variable ${aws:userId}, specifies the currently active user’s User ID, which is the default value of the tag.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:CreateEditor"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:RequestTag/creatorUserId": "${aws:userid}"
                }
            }
        }
    ]
}

You can use the notebook tags along with EMR cluster tags to control notebook user access to your cluster. Tagging your notebook and clusters, in addition to securing your resources, also allows you to categorize, track, and allocate your EMR cluster costs across your different line of businesses. For example, the policy below allows a user to create a notebook only if the notebook has a tag with a key string “department” with its value set to “Analytics” and only if the notebook is attached to the EMR cluster that has a tag with the key string “cost-center” and value set to “12345.”

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:StartEditor"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:elasticmapreduce:*:123456789012:editor/*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/department": [
                        "Analytics"
                    ]
                }
            }
        },
        {
            "Action": [
                "elasticmapreduce:StartEditor"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:elasticmapreduce:*:123456789012:cluster/*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/cost-center": [
                        "12345"
                    ]
                }
            }
        }
    ]
}

You can learn more about notebook and cluster tags by visiting Using Tags to Control User Permissions and Tag Clusters. Also, visit Using Cost Allocation Tags to learn more using tags to generate cost allocation report in the AWS Billing and Cost Management Console.

Tracking notebook users by enforcing user-impersonation

EMR Notebooks enables multiple users to execute their notebooks’ code concurrently in a shared EMR cluster, improving cluster utilization. By default, all Spark jobs spawned by these different users from their notebook run as the same user (the livy user) on your EMR cluster. If your corporate policy requires you to set up an audit trail and track individual notebook user actions on shared clusters, you can use the user-impersonation feature of EMR Notebooks. This capability allows you to discriminate and audit all notebook users by associating the jobs they executed from their notebook with the user’s IAM identity. To use this feature, you should enable Livy user-impersonation by configuring the core-site and livy-conf configuration classifications when you create and launch an EMR cluster as follows:

[
    {
        "Classification": "core-site",
        "Properties": {
          "hadoop.proxyuser.livy.groups": "*",
          "hadoop.proxyuser.livy.hosts": "*"
        }
    },
    {
        "Classification": "livy-conf",
        "Properties": {
          "livy.impersonation.enabled": "true"
        }
    }
]

Visit Configuring Applications to learn more about configuring application. After this feature is enabled, EMR Notebooks creates HDFS user directories on the master node for each user identity. This means all the Spark jobs from the notebook run as the IAM user instead of the indistinct user livy. For example, if user NB_User1 runs code from the notebook editor, then a user directory named user_NB_User1 is created on the master node and all Spark jobs run as user_NB_User1. You can then use a service like AWS CloudTrail to audit the record of actions by the user NbUser1 by creating a trail. To learn more about setting up audit trails, see logging Amazon EMR API calls in AWS CloudTrail.

Conclusion

In this post, I highlighted some of the capabilities of EMR Notebooks such as the ability to change clusters, monitor Spark jobs from each notebook cell, control user permission, and categorize resource costs. You can do this by using notebook and cluster tags and setting up user-impersonation to track notebook user actions.

Oh by the way, there is no additional charge for using EMR Notebooks and you only pay for the use of the EMR cluster as-usual!

If you have questions or suggestions, feel free to leave a comment.

 


About the Authors

Vignesh Rajamani is a senior product manager for EMR at AWS.

 

 

 

 

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Test data quality at scale with Deequ

Post Syndicated from Dustin Lange original https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

You generally write unit tests for your code, but do you also test your data? Incorrect or malformed data can have a large impact on production systems. Examples of data quality issues are:

  • Missing values can lead to failures in production system that require non-null values (NullPointerException).
  • Changes in the distribution of data can lead to unexpected outputs of machine learning models.
  • Aggregations of incorrect data can lead to wrong business decisions.

In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.

Deequ at Amazon

Deequ is being used internally at Amazon for verifying the quality of many large production datasets. Dataset producers can add and edit data quality constraints. The system computes data quality metrics on a regular basis (with every new version of a dataset), verifies constraints defined by dataset producers, and publishes datasets to consumers in case of success. In error cases, dataset publication can be stopped, and producers are notified to take action. Data quality issues do not propagate to consumer data pipelines, reducing their blast radius.

Overview of Deequ

To use Deequ, let’s look at its main components (also shown in Figure 1).

  • Metrics Computation — Deequ computes data quality metrics, that is, statistics such as completeness, maximum, or correlation. Deequ uses Spark to read from sources such as Amazon S3, and to compute metrics through an optimized set of aggregation queries. You have direct access to the raw metrics computed on the data.
  • Constraint Verification — As a user, you focus on defining a set of data quality constraints to be verified. Deequ takes care of deriving the required set of metrics to be computed on the data. Deequ generates a data quality report, which contains the result of the constraint verification.
  • Constraint Suggestion — You can choose to define your own custom data quality constraints, or use the automated constraint suggestion methods that profile the data to infer useful constraints.

Figure 1: Overview of Deequ components

Setup: Launch the Spark cluster

This section shows the steps to use Deequ on your own data. First, set up Spark and Deequ on an Amazon EMR cluster. Then, load a sample dataset provided by AWS, run some analysis, and then run data tests.

Deequ is built on top of Apache Spark to support fast, distributed calculations on large datasets. Deequ depends on Spark version 2.2.0 or later. As a first step, create a cluster with Spark on Amazon EMR. Amazon EMR takes care of the configuration of Spark for you. Also, you canuse the EMR File System (EMRFS) to directly access data in Amazon S3. For testing, you can also install Spark on a single machine in standalone mode.

Connect to the Amazon EMR master node using SSH. Load the latest Deequ JAR from Maven Repository. To load the JAR of version 1.0.1, use the following:

wget http://repo1.maven.org/maven2/com/amazon/deequ/deequ/1.0.1/deequ-1.0.1.jar

Launch Spark Shell and use the spark.jars argument for referencing the Deequ JAR file:

spark-shell --conf spark.jars=deequ-1.0.1.jar

For more information about how to set up Spark, see the Spark Quick Start guide, and the overview of Spark configuration options.

Load data

As a running example, we use a customer review dataset provided by Amazon on Amazon S3. Let’s load the dataset containing reviews for the category “Electronics” in Spark. Make sure to enter the code in the Spark shell:

val dataset = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

You can see the following selected attributes if you run dataset.printSchema() in the Spark shell:

root
|-- marketplace: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- review_id: string (nullable = true)
|-- product_title: string (nullable = true)
|-- star_rating: integer (nullable = true)
|-- helpful_votes: integer (nullable = true)
|-- total_votes: integer (nullable = true)
|-- vine: string (nullable = true)
|-- year: integer (nullable = true)

Data analysis

Before we define checks on the data, we want to calculate some statistics on the dataset; we call them metrics. Deequ supports the following metrics (they are defined in this Deequ package):

Metric

Description

Usage Example

ApproxCountDistinctApproximate number of distinct value, computed with HyperLogLogPlusPlus sketches.ApproxCountDistinct("review_id")
ApproxQuantileApproximate quantile of a distribution.ApproxQuantile("star_rating", quantile = 0.5)
ApproxQuantilesApproximate quantiles of a distribution.ApproxQuantiles("star_rating", quantiles = Seq(0.1, 0.5, 0.9))
CompletenessFraction of non-null values in a column.Completeness("review_id")
ComplianceFraction of rows that comply with the given column constraint.Compliance("top star_rating", "star_rating >= 4.0")
CorrelationPearson correlation coefficient, measures the linear correlation between two columns. The result is in the range [-1, 1], where 1 means positive linear correlation, -1 means negative linear correlation, and 0 means no correlation.Correlation("total_votes", "star_rating")
CountDistinctNumber of distinct values.CountDistinct("review_id")
DataTypeDistribution of data types such as Boolean, Fractional, Integral, and String. The resulting histogram allows filtering by relative or absolute fractions.DataType("year")
DistinctnessFraction of distinct values of a column over the number of all values of a column. Distinct values occur at least once. Example: [a, a, b] contains two distinct values a and b, so distinctness is 2/3.Distinctness("review_id")
EntropyEntropy is a measure of the level of information contained in an event (value in a column) when considering all possible events (values in a column). It is measured in nats (natural units of information). Entropy is estimated using observed value counts as the negative sum of (value_count/total_count) * log(value_count/total_count). Example: [a, b, b, c, c] has three distinct values with counts [1, 2, 2]. Entropy is then (-1/5*log(1/5)-2/5*log(2/5)-2/5*log(2/5)) = 1.055.Entropy("star_rating")
MaximumMaximum value.Maximum("star_rating")
MeanMean value; null values are excluded.Mean("star_rating")
MinimumMinimum value.Minimum("star_rating")
MutualInformationMutual information describes how much information about one column (one random variable) can be inferred from another column (another random variable). If the two columns are independent, mutual information is zero. If one column is a function of the other column, mutual information is the entropy of the column. Mutual information is symmetric and nonnegative.MutualInformation(Seq("total_votes", "star_rating"))
PatternMatchFraction of rows that comply with a given regular experssion.PatternMatch("marketplace", pattern = raw"\w{2}".r)
SizeNumber of rows in a DataFrame.Size()
SumSum of all values of a column.Sum("total_votes")
UniqueValueRatioFraction of unique values over the number of all distinct values of a column. Unique values occur exactly once; distinct values occur at least once. Example: [a, a, b] contains one unique value b, and two distinct values a and b, so the unique value ratio is 1/2.UniqueValueRatio("star_rating")
UniquenessFraction of unique values over the number of all values of a column. Unique values occur exactly once. Example: [a, a, b] contains one unique value b, so uniqueness is 1/3.Uniqueness("star_rating")

In the following example, we show how to use the AnalysisRunner to define the metrics you are interested in. You can run the following code in the Spark shell by either just pasting it in the shell or by saving it in a local file on the master node and loading it in the Spark shell with the following command:

:load PATH_TO_FILE

import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Compliance, Correlation, Size, Completeness, Mean, ApproxCountDistinct}

val analysisResult: AnalyzerContext = { AnalysisRunner
  // data to run the analysis on
  .onData(dataset)
  // define analyzers that compute metrics
  .addAnalyzer(Size())
  .addAnalyzer(Completeness("review_id"))
  .addAnalyzer(ApproxCountDistinct("review_id"))
  .addAnalyzer(Mean("star_rating"))
  .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0"))
  .addAnalyzer(Correlation("total_votes", "star_rating"))
  .addAnalyzer(Correlation("total_votes", "helpful_votes"))
  // compute metrics
  .run()
}

// retrieve successfully computed metrics as a Spark data frame
val metrics = successMetricsAsDataFrame(spark, analysisResult)

The resulting data frame contains the calculated metrics (call metrics.show() in the Spark shell):

nameinstancevalue
ApproxCountDistinctreview_id3010972
Completenessreview_id1
Compliancetop star_rating0.74941
Correlationhelpful_votes,total_votes0.99365
Correlationtotal_votes,star_rating-0.03451
Meanstar_rating4.03614
Size*3120938

We can learn that:

  • review_id has no missing values and approximately 3,010,972 unique values.
  • 74.9 % of reviews have a star_rating of 4 or higher.
  • total_votes and star_rating are not correlated.
  • helpful_votes and total_votes are strongly correlated.
  • The average star_rating is 4.0.
  • The dataset contains 3,120,938 reviews.

Define and Run Tests for Data

After analyzing and understanding the data, we want to verify that the properties we have derived also hold for new versions of the dataset. By defining assertions on the data distribution as part of a data pipeline, we can ensure that every processed dataset is of high quality, and that any application consuming the data can rely on it.

For writing tests on data, we start with the VerificationSuite and add Checks on attributes of the data. In this example, we test for the following properties of our data:

  • There are at least 3 million rows in total.
  • review_id is never NULL.
  • review_id is unique.
  • star_rating has a minimum of 1.0 and a maximum of 5.0.
  • marketplace only contains “US”, “UK”, “DE”, “JP”, or “FR”.
  • year does not contain negative values.

This is the code that reflects the previous statements. For information about all available checks, see this GitHub repository. You can run this directly in the Spark shell as previously explained:

import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel}

val verificationResult: VerificationResult = { VerificationSuite()
  // data to run the verification on
  .onData(dataset)
  // define a data quality check
  .addCheck(
    Check(CheckLevel.Error, "Review Check") 
      .hasSize(_ >= 3000000) // at least 3 million rows
      .hasMin("star_rating", _ == 1.0) // min is 1.0
      .hasMax("star_rating", _ == 5.0) // max is 5.0
      .isComplete("review_id") // should never be NULL
      .isUnique("review_id") // should not contain duplicates
      .isComplete("marketplace") // should never be NULL
      // contains only the listed values
      .isContainedIn("marketplace", Array("US", "UK", "DE", "JP", "FR"))
      .isNonNegative("year")) // should not contain negative values
  // compute metrics and verify check conditions
  .run()
}

// convert check results to a Spark data frame
val resultDataFrame = checkResultsAsDataFrame(spark, verificationResult)

After calling run, Deequ translates your test description into a series of Spark jobs, which are executed to compute metrics on the data. Afterwards, it invokes your assertion functions (e.g., _ == 1.0 for the minimum star-rating check) on these metrics to see if the constraints hold on the data.

Call resultDataFrame.show(truncate=false) in the Spark shell to inspect the result. The resulting table shows the verification result for every test, for example:

constraintconstraint_statusconstraint_message
SizeConstraint(Size(None))Success
MinimumConstraint(Minimum(star_rating,None))Success
MaximumConstraint(Maximum(star_rating,None))Success
CompletenessConstraint(Completeness(review_id,None))Success
UniquenessConstraint(Uniqueness(List(review_id)))FailureValue: 0.9926566948782706 does not meet the constraint requirement!
CompletenessConstraint(Completeness(marketplace,None))Success
ComplianceConstraint(Compliance(marketplace contained in US,UK,DE,JP,FR,marketplace IS NULL OR marketplace IN (‘US’,’UK’,’DE’,’JP’,’FR’),None))Success
ComplianceConstraint(Compliance(year is non-negative,COALESCE(year, 0.0) >= 0,None))Success

Interestingly, the review_id column is not unique, which resulted in a failure of the check on uniqueness.

We can also look at all the metrics that Deequ computed for this check:

VerificationResult.successMetricsAsDataFrame(spark, verificationResult).show(truncate=False)

Result:

nameinstancevalue
Completenessreview_id1
Completenessmarketplace1
Compliancemarketplace contained in US,UK,DE,JP,FR1
Complianceyear is non-negative1
Maximumstar_rating5
Minimumstar_rating1
Size*3120938
Uniquenessreview_id0.99266

Automated Constraint Suggestion

If you own a large number of datasets or if your dataset has many columns, it may be challenging for you to manually define appropriate constraints. Deequ can automatically suggest useful constraints based on the data distribution. Deequ first runs a data profiling method and then applies a set of rules on the result. For more information about how to run a data profiling method, see this GitHub repository.

import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import spark.implicits._ // for toDS method

// We ask deequ to compute constraint suggestions for us on the data
val suggestionResult = { ConstraintSuggestionRunner()
  // data to suggest constraints for
  .onData(dataset)
  // default set of rules for constraint suggestion
  .addConstraintRules(Rules.DEFAULT)
  // run data profiling and constraint suggestion
  .run()
}

// We can now investigate the constraints that Deequ suggested. 
val suggestionDataFrame = suggestionResult.constraintSuggestions.flatMap { 
  case (column, suggestions) => 
    suggestions.map { constraint =>
      (column, constraint.description, constraint.codeForConstraint)
    } 
}.toSeq.toDS()

The result contains a list of constraints with descriptions and Scala code, so that you can directly apply it in your data quality checks. Call suggestionDataFrame.show(truncate=false) in the Spark shell to inspect the suggested constraints; here we show a subset:

columnconstraintscala code
customer_id‘customer_id’ is not null.isComplete("customer_id")
customer_id‘customer_id’ has type Integral.hasDataType("customer_id", ConstrainableDataTypes.Integral)
customer_id‘customer_id’ has no negative values.isNonNegative("customer_id")
helpful_votes‘helpful_votes’ is not null.isComplete("helpful_votes")
helpful_votes‘helpful_votes’ has no negative values.isNonNegative("helpful_votes")
marketplace‘marketplace’ has value range ‘US’, ‘UK’, ‘DE’, ‘JP’, ‘FR’.isContainedIn("marketplace", Array("US", "UK", "DE", "JP", "FR"))
product_title‘product_title’ is not null.isComplete("product_title")
star_rating‘star_rating’ is not null.isComplete("star_rating")
star_rating‘star_rating’ has no negative values.isNonNegative("star_rating")
vine‘vine’ has value range ‘N’, ‘Y’.isContainedIn("vine", Array("N", "Y"))

Note that the constraint suggestion is based on heuristic rules and assumes that the data it is shown is correct, which might not be the case. We recommend to review the suggestions before applying them in production.

More Examples on GitHub

You can find examples of more advanced features at Deequ’s GitHub page:

  • Deequ not only provides data quality checks with fixed thresholds. Learn how to use anomaly detection on data quality metrics to apply tests on metrics that change over time.
  • Deequ offers support for storing and loading metrics. Learn how to use the MetricsRepository for this use case.
  • If your dataset grows over time or is partitioned, you can use Deequ’s incremental metrics computation capability. For each partition, Deequ stores a state for each computed metric. To compute metrics for the union of partitions, Deequ can use these states to efficiently derive overall metrics without reloading the data.

Additional Resources

Learn more about the inner workings of Deequ in our VLDB 2018 paper “Automating large-scale data quality verification.

Conclusion

This blog post showed you how to use Deequ for calculating data quality metrics, verifying data quality metrics, and profiling data to automate the configuration of data quality checks. Deequ is available for you now to build your own data quality management pipeline.

 


About the Authors

Dustin Lange is an Applied Science Manager at Amazon Search in Berlin. Dustin’s team develops algorithms for improving the search customer experience through machine learning and data quality tracking. He completed his PhD in similarity search in databases in 2013 and started his Amazon career as an Applied Scientist in forecasting the same year.

 

 

Sebastian Schelter is a Senior Applied Scientist at Amazon Search, working on problems at the intersection of data management and machine learning. He holds a Ph.D. in large-scale data processing from TU Berlin and is an elected member of the Apache Software Foundation, where he currently serves as a mentor for the Apache Incubator.

 

 

Philipp Schmidt is an ML Engineer at Amazon Search. After his graduation from TU Berlin he worked at the University of Potsdam and in several startups in Berlin. At Amazon he is working on enabling data quality tracking for large scale datasets and refining the customer shopping experience through machine learning.

 

 

Tammo Rukat is an Applied Scientist at Amazon Search in Berlin. He holds a PhD in statistical machine learning from the University of Oxford. At Amazon he makes use of the abundance and complexity of the company’s large-scale noisy datasets to contribute to a more intelligent customer experience.

 

 

 

 

Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda

Post Syndicated from Praveen Krishnamoorthy Ravikumar original https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/

Many customers use Amazon EMR to run big data workloads, such as Apache Spark and Apache Hive queries, in their development environment. Data analysts and data scientists frequently use these types of clusters, known as analytics EMR clusters. Users often forget to terminate the clusters after their work is done. This leads to idle running of the clusters and in turn, adds up unnecessary costs.

To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.

In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.

Prerequisites

You must have the following before you can create and deploy this framework:

Note: This solution is designed as an additional feature. It can be applied to any existing EMR clusters by executing the scheduler script (explained later in the post) as an EMR step. If you want to implement this solution as a mandatory feature for your future clusters, you can include the EMR step as part of your cluster deployment. You can also apply this solution to EMR clusters that are spun up through AWS CloudFormation, the AWS CLI, and even the AWS Management Console.

Key components

The following are the key components of the solution.

Analytics EMR cluster

Amazon EMR provides a managed Apache Hadoop framework that lets you easily process large amounts of data across dynamically scalable Amazon EC2 instances. Data scientists use analytics EMR clusters for data analysis, machine learning using notebook applications (such as Apache Zeppelin or JupyterHub), and running big data workloads based on Apache Spark, Presto, etc.

Scheduler script

The schedule_script.sh is the shell script to be executed as an Amazon EMR step. When executed, it copies the monitoring script from the Amazon S3 artifacts folder and schedules the monitoring script to run every 5 minutes. The S3 location of the monitoring script should be passed as an argument.

Monitoring script

The pushShutDownMetrin.sh script is a monitoring script that is implemented using shell commands. It should be installed in the master node of the EMR cluster as an Amazon EMR step. The script is scheduled to run every 5 minutes and sends the cluster activity status to CloudWatch.  

JupyterHub API token script

The jupyterhub_addAdminToken.sh script is a shell script to be executed as an Amazon EMR step if JupyterHub is enabled on the cluster. In our design, the monitoring script uses REST APIs provided by JupyterHub to check whether the application is in use.

To send the request to JupyterHub, you must pass an API token along with the request. By default, the application does not generate API tokens. This script generates the API token and assigns it to the admin user, which is then picked up by the jupyterhub module in the monitoring script to track the activity of the application.

Custom CloudWatch metric

All Amazon EMR clusters send data for several metrics to CloudWatch. Metrics are updated every 5 minutes, automatically collected, and pushed to CloudWatch. For this use case, we created the Amazon EMR metric EMR-INUSE. This metric represents the active status of the cluster based on the module checks implemented in the monitoring script. The metric is set to 0 when the cluster is inactive and 1 when active.

Amazon CloudWatch

CloudWatch is a monitoring service that you can use to set high-resolution alarms to take automated actions. In this case, CloudWatch triggers an alarm if it receives 0 continuously for the configured number of hours.

AWS Lambda

Lambda is a serverless technology that lets you run code without provisioning or managing servers. With Lambda, you can run code for virtually any type of application or backend service—all with zero administration. You can set up your code to automatically trigger from other AWS services. In this case, the triggered CloudWatch alarm mentioned earlier signals Lambda to terminate the cluster.

Architectural diagram

The following diagram illustrates the sequence of events when the solution is enabled, showing what happens to the EMR cluster that is spun up via AWS CloudFormation.

 

The diagram shows the following steps:

  1. The AWS CloudFormation stack is launched to spin up an EMR cluster.
  2. The Amazon EMR step is executed (installs the pushShutDownMetric.sh and then schedules it as a cron job to run every 5 minutes).
  3. If the EMR cluster is active (executing jobs), the master node sets the EMR-INUSE metric to 1 and sends it to CloudWatch.
  4. If the EMR cluster is inactive, the master node sets the EMR-INUSE metric to 0 and sends it to CloudWatch.
  5. On receiving 0 for a predefined number of data points, CloudWatch triggers a CloudWatch alarm.
  6. The CloudWatch alarm sends notification to AWS Lambda to terminate the cluster.
  7. AWS Lambda executes the Lambda function.
  8. The Lambda function then deletes all the stack resources associated with the cluster.
  9. Finally, the EMR cluster is terminated, and the Stack ID is removed from AWS CloudFormation.

Modules in the monitoring script

Following are the different activity checks that are implemented in the monitoring script (pushShutDownMetric.sh). The script is designed in a modular fashion so that you can easily include new modules without modifying the core functionality.

ActiveSSHCheck

The ActiveSSHCheck module checks whether there are any active SSH connections to the master node. If there is an active SSH connection, and it’s idle for less than 10 minutes, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch.

YARNCheck

Apache Hadoop YARN is the resource manager of the EMR Hadoop ecosystem. All the Spark Submits and Hive queries reach YARN initially. It then schedules and processes these jobs. The YARNCheck module checks whether there are any running jobs in YARN or jobs completed within last 5 minutes. If it finds any, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch.The checks are performed by calling REST APIs exposed by YARN.

The API to fetch the running jobs is http://localhost:8088/ws/v1/cluster/apps?state=RUNNING.

The API to fetch the completed jobs is

http://localhost:8088/ws/v1/cluster/apps?state=FINISHED.

PRESTOCheck

Presto is an open-source distributed query engine for running interactive analytic queries. It is included in EMR release version 5.0.0 and later.The PRESTOCheck module checks whether there are any running Presto queries or if any queries have been completed within last 5 minutes. If there are some, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. These checks are performed by calling REST APIs exposed by the Presto server.

The API to fetch the Presto jobs is http://localhost:8889/v1/query.

ZeppelinCheck

Amazon EMR users use Apache Zeppelin as a notebook for interactive data exploration. The ZeppelinCheck module checks whether there are any jobs running or if any have been completed within the last 5 minutes. If so, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. These checks are performed by calling the REST APIs exposed by Zeppelin.

The API to fetch the list of notebook IDs is http://localhost:8890/api/notebook.

The API to fetch the status of each cell inside each notebook ID is http://localhost:8890/api/notebook/job/$notebookID.

JupyterHubCheck

Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text. JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server.The JupyterHubCheck module checks whether any Jupyter notebook is currently in use.

The function uses REST APIs exposed by JupyterHub to fetch the list of Jupyter notebook users and gathers the data about individual notebook servers. From the response, it extracts the last activity time of the servers and checks whether any server was used in the last 5 minutes. If so, the function sets the EMR-INUSE metric to 1 and pushes it to CloudWatch. The jupyterhub_addAdminToken.sh script needs to be executed as an EMR step before enabling the scheduler script.

The API to fetch the list of notebook users is https://localhost:9443/hub/api/users -H "Authorization: token $admin_token".

The API to fetch individual server information is https://localhost:9443/hub/api/users/$user -H "Authorization: token $admin_token.

If any one of these checks fails, the cluster is considered to be inactive, and the monitoring script sets the EMR-INUSE metric to 0 and pushes it to CloudWatch.

Note:

The scheduler script schedules the monitoring script (pushShutDownMetric.sh) to run every 5 minutes. Internal cron jobs that run for a very few minutes are not considered in calibrating the EMR-INUSE metric.

Deploying each component

Follow the steps in this section to deploy each component of the proposed design.

Step 1. Create the Lambda function and SNS subscription

The Lambda function and the SNS subscription are the core components of the design. You must set up these components initially, and they are common for every cluster. The following are the AWS resources to be created for these components:

  • Execution role for the Lambda function
  • Terminate Idle EMR Lambda function
  • SNS topic and Lambda subscription

For one-step deployment, use this AWS CloudFormation template to launch and configure the resources in a single go.

The following parameters are available in the template.

ParameterDefaultDescription
s3Bucketemr-shutdown-blogartifactsThe name of the S3 bucket that contains the Lambda file
s3KeyEMRTerminate.zipThe Amazon S3 key of the Lambda file

For manual deployment, follow these steps on the AWS Management Console.

Execution role for the Lambda function

  1. Open the AWS Identity and Access Management (IAM) consoleand choose PoliciesCreate policy.
  2. Choose the JSON tab, paste the following policy text, and then choose Review policy.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:HeadBucket",
                "s3:ListObjects",
                "s3:GetObject",
                "cloudformation:ListStacks",
                "cloudformation:DeleteStack",
                "cloudformation:DescribeStacks",
                "cloudformation:ListStackResources",
                "elasticmapreduce:TerminateJobFlows"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "GenericAccess"
        },
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*",
            "Effect": "Allow",
            "Sid": "LogAccess"
        }
    ]
}
  1. For Name, enter TerminateEMRPolicy and choose Create policy.
  2. Choose RolesCreate role.
  3. Under Choose the service that will use this role, choose Lambda, and then choose Next: Permissions.
  4. For Attach permissions policies, choose the arrow next to Filter policies and choose Customer managed in the drop-down list.
  5. Attach the TerminateEMRPolicy policy that you just created, and choose Review.
  6. For Role name, enter TerminateEMRLambdaRole and then choose Create role.

Terminate idle EMR – Lambda function

I created a deployment package to use with this function.

  1. Open the Lambda consoleand choose Create function.
  2. Choose Author from scratch, and provide the details as shown in the following screenshot:
  • Name: lambdaTerminateEMR
  • Runtime: Python 2.7
  • Role: Choose an existing role
  • Existing role: TerminateEMRLambdaRole

  1. Choose Create function.
  2. In the Function code section, for Code entry type, choose Upload a file from Amazon S3, and for Runtime, choose Python 2.7.

The Lambda function S3 link URL is

s3://emr-shutdown-blogartifacts/EMRTerminate.zip.

Link to the function: https://s3.amazonaws.com/emr-shutdown-blogartifacts/EMRTerminate.zip

This Lambda function is triggered by a CloudWatch alarm. It parses the input event, retrieves the JobFlowId, and deletes the AWS CloudFormation stack of the corresponding JobFlowId.

SNS topic and Lambda subscription

For setting the CloudWatch alarm in the further stages, you must create an Amazon SNS topic that notifies the preceding Lambda function to execute. Follow these steps to create an SNS topic and configure the Lambda endpoint.

  1. Navigate to the Amazon SNS console, and choose Create topic.
  2. Enter the Topic name and Display name, and choose Create topic.

  1. The topic is created and displayed in the Topics
  2. Select the topic and choose Actions, Subscribe to topic.

  1. In the Create subscription, choose the AWS Lambda Choose lambdaterminateEMR as the endpoint, and choose Create subscription.

Step 2. Execute the JupyterHub API token script as an EMR step

This step is required only when JupyterHub is enabled in the cluster.

Navigate to the EMR cluster to be monitored, and execute the scheduler script as an EMR step.

Command: s3://emr-shutdown-blogartifacts/jupyterhub_addAdminToken.sh

This script generates an API token and assigns it to the admin user. It is then picked up by the jupyterhub module in the monitoring script to track the activity of the application.

Step 3. Execute the scheduler script as an EMR step

Navigate to the EMR cluster to be monitored and execute the scheduler script as an EMR step.

Note:

Ensure that termination protection is disabled in the cluster. The termination protection flag causes the Lambda function to fail.

Command: s3://emr-shutdown-blogartifacts/schedule_script.sh

Parameter: s3://emr-shutdown-blogartifacts/pushShutDownMetrin.sh

The step function copies the pushShutDownMetric.sh script to the master node and schedules it to run every 5 minutes.

The schedule_script.sh is at https://s3.amazonaws.com/emr-shutdown-blogartifacts/schedule_script.sh.

The pushShutDownMetrin.sh is at https://s3.amazonaws.com/emr-shutdown-blogartifacts/pushShutDownMetrin.sh.

Step 4. Create a CloudWatch alarm

For single-step deployment, use this AWS CloudFormation template to launch and configure the resources in a single go.

The following parameters are available in the template.

ParameterDefaultDescription
AlarmNameTerminateIDLE-EMRAlarmThe name for the alarm.
EMRJobFlowIDRequires inputThe Jobflowid of the cluster.
EvaluationPeriodRequires inputThe idle timeout value—input should be in data points (1 data point equals 5 minutes). For example, to terminate the cluster if it is idle for 20 minutes, the input should be 4.
SNSSubscribeTopicRequires inputThe Amazon Resource Name (ARN) of the SNS topic to be triggered on the alarm.

 

The AWS CloudFormation CLI command is as follows:

aws cloudformation create-stack --stack-name EMRAlarmStack \
      --template-body s3://emr-shutdown-blogartifacts/Cloudformation/alarm.json \
      --parameters AlarmName=TerminateIDLE-EMRAlarm,EMRJobFlowID=<Input>,                 EvaluationPeriod=4,SNSSubscribeTopic=<Input>

For manual deployment, follow these steps to create the alarm.

  1. Open the Amazon CloudWatch console and choose Alarms.
  2. Choose Create Alarm.
  3. On the Select Metric page, under Custom Metrics, choose EMRShutdown/Cluster-Metric.

  1. Choose the isEMRUsed metric of the EMR JobFlowId, and then choose Next.

  1. Define the alarm as required. In this case, the alarm is set to send notification to the SNS topic shutDownEMRTest when CloudWatch receives the IsEMRUsed metric as 0 for every data point in the last 2 hours.

  1. Choose Create Alarm.

Summary

In this post, we focused on building a framework to cut down the additional cost that you might incur due to the idle running of an EMR cluster. The modules implemented in the shell script, the tracking of the execution status of the Spark scripts, and the Hive/Presto queries using the lightweight REST API calls make this approach an efficient solution.

If you have questions or suggestions, please comment below.

 


About the Author

Praveen Krishnamoorthy Ravikumar is an associate big data consultant with Amazon Web Services.