All posts by Brandon Scheller

Modifying your cluster on the fly with Amazon EMR reconfiguration

Post Syndicated from Brandon Scheller original https://aws.amazon.com/blogs/big-data/modifying-your-cluster-on-the-fly-with-amazon-emr-reconfiguration/

If you are a developer or data scientist using long-running Amazon EMR clusters, you face fast-changing workloads. These changes often require different application configurations to run optimally on your cluster.

With the reconfiguration feature, you can now change configurations on running EMR clusters. Starting with EMR release emr-5.21.0, this feature allows you to modify configurations without creating a new cluster or manually connecting by SSH into each node.

In this post, I go over the following topics:

  • Using reconfiguration
  • Instance group states, configuration versions, and events
  • Reconfiguration example use cases
  • Reconfiguration benefits

Using reconfiguration

The following tasks are updated in EMR release emr-5.21.0:

  • Submitting a reconfiguration
  • Modifying configurations
  • Defining configuration levels

Submitting a reconfiguration

You can submit a recognition through the EMR console, SDK, or AWS CLI. For more information, see submitting a reconfiguration and additional information.

Modifying configurations

When submitting a reconfiguration, you must include all of the configurations you want to apply to the cluster. The update only applies those items, removing all others. As you modify configurations, the EMR console also tracks your previous cluster configurations for you.

Defining configuration levels

Define cluster-level and instance-group-level configurations for your applications. Supply cluster-level configurations as you create a cluster. These configurations are then automatically applied to all your instance groups, even those added after the cluster’s up and running. After the configuration starts, you can’t modify your cluster-level configurations. But you can supplement or override those configurations on the instance-group level through reconfiguration requests. Whenever you submit a reconfiguration request for an instance group, these new instance-group-level configurations take precedence over inherited cluster-level configurations.

To better understand how cluster-level and instance-group-level configurations work together on an instance group, look at a simple demonstration in the EMR console:

Under the Configuration tab, select an instance group in the Filter drop-down list. Navigate to the desired instance group’s configuration table. The Source column of the configuration table indicates the level of your configurations.

This cluster starts with the cluster-level configuration set:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-1",
      "Key-B": "Value-2"
    }
  }
]

As you can see in the console, the instance group ig-Y4E3MN8C4YBP automatically inherited the cluster-level configuration set. Now, reconfigure the instance group as follows:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-a",
      "Key-C": "Value-3"
    }
  }
]

Once the request goes through, the value of configuration “Key-A” gets overridden by the instance-group-level configuration and changes from “Value-1” to “Value-a.”  In contrast, the value of configuration “Key-B” remains unchanged. Meanwhile, your request introduces the new, supplemental configuration, “Key-C.” The configuration table in your console always displays these kinds of subtle changes.

For more information about how to customize cluster-level and instance-group-level configurations, see Supplying a Configuration when Creating a Cluster.

Instance group states, configuration versions, and events

The states of your reconfiguration requests appear in instance group state transitions, configuration version increases, and CloudWatch events. Understand how each works to keep from losing track of any reconfiguration request:

  • Instance group states: After an instance group receives a reconfiguration request, it transitions from the RUNNING state to the RECONFIGURING state. The RECONFIGURING state indicates the start of the reconfiguration process. After the process completes and the new configurations have taken effect, the instance group returns to the RUNNING state. Then, you can verify your configurations either via your application’s Web UI or application-specific commands.
  • Configuration versions: Every reconfiguration request you submit establishes a new configuration set, distinguished by a new version number. Configuration versions start from 0 and increase by 1 for each new configuration set that you submit. Each instance group keeps its respective configuration version number. Version numbers increase depending on the number of times that you reconfigure the different instance groups.
  • Events:EMR posts a state for each reconfiguration request as an Amazon CloudWatch event. These events list the exact times when the request is submitted, the reconfiguration operation starts, and when it completes. For easy tracking, each request is posted together with its associated configuration version. For example, the following event flow shows how EMR executes a typical reconfiguration request in an instance group:

For a complete list of EMR events and instance group state transitions for reconfiguration operation, see the EMR Management Guide.

Reconfiguration example use cases

Here are some use case examples of reconfiguration operations:

  • Reconfiguring HDFS blocksize
  • Configuring capacity-scheduler queues

Reconfiguring HDFS blocksize

You may deal with fluctuating workloads. These changes can call for new application configurations throughout the lifetime of a cluster.

For example, suppose that you’ve recently seen growth in the workload and filesize for your long-running cluster. You’d like to account for this growth without replacing your current cluster.

To increase your HDFS block size for better performance, take advantage of the new reconfiguration feature. HDFS NameNode tracks each data block in your cluster. Increasing this block size could increase HDFS performance by reducing the number of blocks watched by NameNode. In addition, this feature improves job performance by reducing the number of required mappers.

To increase the HDFS block size from the default of 128 GB to 256 GB, submit a reconfiguration request to the master instance group, which runs the same node:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
  "ClusterId": "j-MyClusterID",
  "InstanceGroups": [
  {
    "InstanceGroupId": "ig-MyMasterId",
    "Configurations": [
    {
      "Classification": "hdfs-site",
      "Properties": {
        "dfs.blocksize": "256m"
      },
    "Configurations": []
    }]
  }]
}

The EMR reconfiguration process then modifies the “dfs.blocksize” parameter to the provided “256 m” value within the hdfs-size.xml file. The reconfiguration process also automatically restarts NameNode, to pick up the new configuration. Any new blocks added to the cluster automatically use the new default blocksize of 256 MB. If you’d like any existing blocks to pick up this default, follow these steps:

  1. Copy the blocks to a new location.
  2. Delete the originals.
  3. Copy the blocks back to their original location.

The restored blocks pick up the new default block size. NameNode is inactive during the short restart period.

Configuring capacity-scheduler queues

Do you want to change cluster resources sharing strategies among different Hadoop jobs? Modify YARN CapacityScheduler configurations on a running cluster? Add new queues on a large shared cluster that you manage with another organization? Alter the capacity allocation between different queues to meet your changing workloads?

Using the EMR reconfiguration feature, you can make changes by submitting a reconfiguration request to the master node. New configurations take effect on your queues in a few minutes. You don’t have to go through the hassle of logging into the master node, directly updating the configuration file, or manually refresh queues.

EMR clusters come with a single queue by default. To create two additional queues, alpha, and beta, and allocate each 30% of the total resource capacity of your cluster to each of them. Here’s a sample command that submits a reconfiguration request to accomplish the desired change:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
   "ClusterId":"j-MyClusterID",
   "InstanceGroups":[
      {
         "InstanceGroupId":"ig-MyMasterId",
         "Configurations":[
            {
               "Classification":"capacity-scheduler",
               "Properties":{
                  "yarn.scheduler.capacity.root.queues":"default,alpha,beta",
                  "yarn.scheduler.capacity.root.default.capacity":"40",
                  "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity":"40",
                  "yarn.scheduler.capacity.root.alpha.capacity":"30",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels.CORE.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels.CORE.capacity":"30"
               },
               "Configurations":[]
            }
         ]
      }
   ]
}

Access to the “*” label was given to both queues so that each can access labeled core nodes. Additionally, the sum of capacities for all queues must be equal to 100. The capacity of the default queue decreases to 40%.

Finally, the capacity for each queue’s access to the core label matches the capacity of the queue itself. That means that the core partition splits between queues at the same ratio as the rest of the cluster.

After completing this step, go to the YARN ResourceManager Web UI to verify that your modifications have taken place.

EMR reconfiguration benefits

The following are EMR reconfiguration benefits:

  • Rolling reconfiguration process
  • Reconfiguration failure and reversion

Rolling reconfiguration process

One key benefit of EMR reconfiguration is a rolling reconfiguration process. From the documentation:

“Amazon EMR follows a ‘rolling’ process to reconfigure the instances in the Task and Core instance groups. Only 10 percent of the instances in an instance group are modified and restarted at a time. This process takes longer to finish but reduces the chance of potential application failure in a running cluster.”

Rolling reconfiguration protects against any HDFS downtime by allowing 90% of core nodes to stay running during reconfiguration. YARN on EMR additionally has NodeManager recovery enabled. NodeManager recovers containers running after the reconfiguration restart.

Because containers are always active, some MapReduce jobs can continue to run successfully during the reconfiguration process. However, not all applications can recover after a restart. For example, Spark on YARN (the EMR default) may encounter executor issues and job failure after NodeManager restart.

Test applications with the type of reconfiguration that you plan to do in a safe environment before reconfiguring in production.

Finally, the rolling reconfiguration process might result in a temporary mismatch of your instance group’s configuration state. While mismatched, some instances may have old configurations while others may have the newly requested ones. When reconfiguring your cluster, consider any possible side effects of this situation.

Reconfiguration failure and reversion

EMR can also recover your instance group from a reconfiguration failure.

To make your new configuration take effect, EMR restarts your reconfigured applications and ensures that they are running before declaring the reconfiguration operation complete.

However, if any application fails to restart successfully on any node, the reconfiguration operation fails and the instance group remains in the RECONFIGURING state. Such failures might result from problematic configuration values. For example, an invalid address for `yarn.resourcemanager.scheduler.address` can cause the YARN ResourceManager to fail to restart.

In such situations, EMR automatically triggers a configuration reversion. Reversion re-applies the previous working configuration set on the instance group. Reversion brings the instance group state back to the RUNNING state as soon as the reversion completes. Your instance group thus returns to a functioning state and maintains the availability of your applications on the cluster. Rolling reconfiguration continues throughout the process.

If applications still fail to start after the previous working configurations have been re-applied, EMR places the instance group in te ARRESTED state rather than make further reconfiguration attempts. To release the instance group from the ARRESTED state, submit a new reconfiguration request.

Summary

In this post, I showed you the basics of how to configure instance groups on running clusters using the new EMR cluster reconfiguration feature. I walked through the extra semantics of submitting reconfiguration requests, important configuration level concepts, and ways of reconfiguration tracking methods. I provided some real-world reconfiguration examples and covered two useful features of reconfiguration.

Try the new cluster reconfiguration feature and share your experience with us in the comments below!

 


About the Authors

Brandon Scheller is a software development engineer for Amazon EMR. His passion lies in developing and advancing the applications of the Hadoop ecosystem and working with the open source community. He enjoys mountaineering in the Cascades with his free time.

 

 

 

Junyang Li is a software development engineer for Amazon EMR. She works on cutting-edge features of EMR and is also involved in open source projects. Besides work, she enjoys arts and crafts, exercising and traveling.

 

 

 

 

Best Practices for resizing and automatic scaling in Amazon EMR

Post Syndicated from Brandon Scheller original https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/

You can increase your savings by taking advantage of the dynamic scaling feature set available in Amazon EMR. The ability to scale the number of nodes in your cluster up and down on the fly is among the major features that make Amazon EMR elastic. You can take advantage of scaling in EMR by resizing your cluster down when you have little or no workload. You can also scale your cluster up to add processing power when the job gets too slow. This allows you to spend just enough to cover the cost of your job and little more.

Knowing the complex logic behind this feature can help you take advantage of it to save on cluster costs. In this post, I detail how EMR clusters resize, and I present some best practices for getting the maximum benefit and resulting cost savings for your own cluster through this feature.

EMR scaling is more complex than simply adding or removing nodes from the cluster. One common misconception is that scaling in Amazon EMR works exactly like Amazon EC2 scaling. With EC2 scaling, you can add/remove nodes almost instantly and without worry, but EMR has more complexity to it, especially when scaling a cluster down. This is because important data or jobs could be running on your nodes.

To prevent data loss, Amazon EMR scaling ensures that your node has no running Apache Hadoop tasks or unique data that could be lost before removing your node. It is worth considering this decommissioning delay when resizing your EMR cluster. By understanding and accounting for how this process works, you can avoid issues that have plagued others, such as slow cluster resizes and inefficient automatic scaling policies.

When an EMR scale cluster is scaled down, two different decommission processes are triggered on the nodes that will be terminated. The first process is the decommissioning of Hadoop YARN, which is the Hadoop resource manager. Hadoop tasks that are submitted to Amazon EMR generally run through YARN, so EMR must ensure that any running YARN tasks are complete before removing the node. If for some reason the YARN task is stuck, there is a configurable timeout to ensure that the decommissioning still finishes. When this timeout happens, the YARN task is terminated and is instead rescheduled to a different node so that the task can finish.

The second decommission process is that of the Hadoop Distributed File System or HDFS. HDFS stores data in blocks that are spread through the EMR cluster on any nodes that are running HDFS. When an HDFS node is decommissioning, it must replicate those data blocks to other HDFS nodes so that they are not lost when the node is terminated.

So how can you use this knowledge in Amazon EMR?

Tips for resizing clusters

The following are some issues to consider when resizing your clusters.

EMR clusters can use two types of nodes for Hadoop tasks: core nodes and task nodes. Core nodes host persistent data by running the HDFS DataNode process and run Hadoop tasks through YARN’s resource manager. Task nodes only run Hadoop tasks through YARN and DO NOT store data in HDFS.

When scaling down task nodes on a running cluster, expect a short delay for any running Hadoop task on the cluster to decommission. This allows you to get the best usage of your task node by not losing task progress through interruption. However, if your job allows for this interruption, you can adjust the one hour default timeout on the resize by adjusting the yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs property (in EMR 5.14) in yarn-site.xml. When this process times out, your task node is shut down regardless of any running tasks. This process is usually relatively quick, which makes it fast to scale down task nodes.

When you’re scaling down core nodes, Amazon EMR must also wait for HDFS to decommission to protect your data. HDFS can take a relatively long time to decommission. This is because HDFS block replication is throttled by design through configurations located in hdfs-site.xml. This in turn means that HDFS decommissioning is throttled. This protects your cluster from a spiked workload if a node goes down, but it slows down decommissioning. When scaling down a large number of core nodes, consider adjusting these configurations beforehand so that you can scale down more quickly.

For example, consider this exercise with HDFS and resizing speed.

The HDFS configurations, located in hdfs-site.xml, have some of the most significant impact on throttling block replication:

  • datanode.balance.bandwidthPerSec: Bandwidth for each node’s replication
  • namenode.replication.max-streams: Max streams running for block replication
  • namenode.replication.max-streams-hard-limit: Hard limit on max streams
  • datanode.balance.max.concurrent.moves: Number of threads used by the block balancer for pending moves
  • namenode.replication.work.multiplier.per.iteration: Used to determine the number of blocks to begin transfers immediately during each replication interval

(Beware when modifying: Changing these configurations improperly, especially on a cluster with high load, can seriously degrade cluster performance.)

Cluster resizing speed exercise

Modifying these configurations can speed up the decommissioning time significantly. Try the following exercise to see this difference for yourself.

  1. Create an EMR cluster with the following hardware configuration:
  • Master: 1 node – m3.xlarge
  • Core: 6 nodes – m3.xlarge
  1. Connect to the master node of your cluster using SSH (Secure Shell).

For more information, see Connect to the Master Node Using SSH in the Amazon EMR documentation.

  1. Load data into HDFS by using the following jobs:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000000000 /user/hadoop/data1/
$ s3-dist-cp --src s3://aws-bigdata-blog/artifacts/ClusterResize/smallfiles25k/ --dest  hdfs:///user/hadoop/data2/
  1. Edit your hdfs-site.xml configs:
$ sudo vim /etc/hadoop/conf/hdfs-site.xml

Then paste in the following configuration setup in the hdfs-site properties.

Disclaimer: These values are relatively high for example purposes and should not necessarily be used in production. Be sure to test config values for production clusters under load before modifying them.

<property>
  <name>dfs.datanode.balance.bandwidthPerSec</name>
  <value>100m</value>
</property>

<property>
  <name>dfs.namenode.replication.max-streams</name>
  <value>100</value>
</property>

<property>
  <name>dfs.namenode.replication.max-streams-hard-limit</name>
  <value>200</value>
</property>

<property>
  <name>dfs.datanode.balance.max.concurrent.moves</name>
  <value>500</value>
</property>

<property>
  <name>dfs.namenode.replication.work.multiplier.per.iteration</name>
  <value>30</value>
</property>
  1. Resize your EMR cluster from six to five core nodes, and look in the EMR events tab to see how long the resize took.
  2. Repeat the previous steps without modifying the configurations, and check the difference in resize time.

While performing this exercise, I saw resizing time lower from 45+ minutes (without config changes) down to about 6 minutes (with modified hdfs-site configs). This exercise demonstrates how much HDFS is throttled under default configurations. Although removing these throttles is dangerous and performance using them should be tested first, they can significantly speed up decommissioning time and therefore resizing.

The following are some additional tips for resizing clusters:

  • Shrink resizing timeouts. You can configure EMR nodes in two ways: instance groups or instance fleets. For more information, see Create a Cluster with Instance Fleets or Uniform Instance Groups. EMR has implemented shrink resize timeouts when nodes are configured in instance fleets. This timeout prevents an instance fleet from attempting to resize forever if something goes wrong during the resize. It currently defaults to one day, so keep it in mind when you are resizing an instance fleet down.

If an instance fleet shrink request takes longer than one day, it finishes and pauses at however many instances are currently running. On the other hand, instance groups have no default shrink resize timeout. However, both types have the one-hour YARN timeout described earlier in the yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs property (in EMR 5.14) in yarn-site.xml.

  • Watch out for high frequency HDFS writes when resizing core nodes. If HDFS is receiving a lot of writes, it will modify a large number of blocks that require replication. This replication can interfere with the block replication from any decommissioning core nodes and significantly slow down the resizing process.

Setting up policies for automatic scaling

Although manual scaling is useful, a majority of the time cluster resizes are executed dynamically through Amazon EMR automatic scaling. Generally, the details of the automatic scaling policy must be tailored to the specific Hadoop job, so I won’t go into detail there. Instead, I provide some general guidelines for setting up your cluster’s auto scaling policies.

The following are some considerations when setting up your auto scaling policy.

Metrics for scaling

Choose the right metrics for your node types to trigger scaling. For example, scaling core nodes solely on the YARNMemoryAvailablePercentage metric doesn’t make sense. This is because you would be increasing/decreasing HDFS total size when really you only need more processing power. Scaling task nodes on HDFSUtilization also doesn’t make sense because you would want more HDFS storage space that does not come with task nodes. A common automatic scaling metric for core nodes is HDFSUtilization. Common auto scaling metrics for task nodes include ContainerPending-Out and YarnMemoryAvailablePercentage.

Note: Keep in mind that Amazon EMR currently requires HDFS, so you must have at least one core node in your cluster. Core nodes can also provide CPU and memory resources. But if you don’t need to scale HDFS, and you just need more CPU or memory resources for your job, we recommend that you use task nodes for that purpose.

Scaling core nodes

As described earlier, one of the two EMR node types in your cluster is the core node. Core nodes run HDFS, so they have a longer decommissioning delay. This means that they are slow to scale and should not be aggressively scaled. Only adding and removing a few core nodes at a time will help you avoid scaling issues. Unless you need the HDFS storage, scaling task nodes is usually a better option. If you find that you have to scale large numbers of core nodes, consider changing hdfs-site.xml configurations to allow faster decommission time and faster scale down.

Scaling task nodes

Task nodes don’t run HDFS, which makes them perfect for aggressively scaling with a dynamic job. When your Hadoop task has spikes of work between periods of downtime, this is the node type that you want to use.

You can set up task nodes with a very aggressive auto scaling policy, and they can be scaled up or down easily. If you don’t need HDFS space, you can use task nodes in your cluster.

Using Spot Instances

Automatic scaling is a perfect time to use EMR Spot Instance types. The tendency of Spot Instances to disappear and reappear makes them perfect for task nodes. Because these task nodes are already used to scale in and out aggressively, Spot Instances can have very little disadvantage here. However, for time-sensitive Hadoop tasks, On-Demand Instances might be prioritized for the guaranteed availability.

Scale-in vs. scale-out policies for core nodes

Don’t fall into the trap of making your scale-in policy the exact opposite of your scale-out policy, especially for core nodes. Many times, scaling in results in additional delays for decommissioning. Take this into account and allow your scale-in policy to be more forgiving than your scale-out policy. This means longer cooldowns and higher metric requirements to trigger resizes.

You can think of scale-out policies as easily triggered with a low cooldown and small node increments. Scale-in policies should be hard to trigger, with larger cooldowns and node increments.

Minimum nodes for core node auto scaling

One last thing to consider when scaling core nodes is the yarn.app.mapreduce.am.labels property located in yarn-site.xml. In Amazon EMR, yarn.app.mapreduce.am.labels is set to “CORE” by default, which means that the application master always runs on core nodes and not task nodes. This is to help prevent application failure in a scenario where Spot Instances are used for the task nodes.

This means that when setting a minimum number of core nodes, you should choose a number that is greater than or at least equal to the number of simultaneous application masters that you plan to have running on your cluster. If you want the application master to also run on task nodes, you should modify this property to include “TASK.” However, as a best practice, don’t set the yarn.app.mapreduce.am.labels property to TASK if Spot Instances are used for task nodes.

Aggregating data using S3DistCp

Before wrapping up this post, I have one last piece of information to share about cluster resizing. When resizing core nodes, you might notice that HDFS decommissioning takes a very long time. Often this is the result of storing many small files in your cluster’s HDFS. Having many small files within HDFS (files smaller than the HDFS block size of 128 MB) adds lots of metadata overhead and can cause slowdowns in both decommissioning and Hadoop tasks.

Keeping your small files to a minimum by aggregating your data can help your cluster and jobs run more smoothly. For information about how to aggregate files, see the post Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3.

Summary

In this post, you read about how Amazon EMR resizing logic works to protect your data and Hadoop tasks. I also provided some additional considerations for EMR resizing and automatic scaling. Keeping these practices in mind can help you maximize cluster savings by allowing you to use only the required cluster resources.

If you have questions or suggestions, please leave a comment below.

 


Additional Reading

If you found this post useful, be sure to check out Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 and Dynamically Scale Applications on Amazon EMR with Auto Scaling.

 


About the Author

Brandon Scheller is a software development engineer for Amazon EMR. His passion lies in developing and advancing the applications of the Hadoop ecosystem and working with the open source community. He enjoys mountaineering in the Cascades with his free time.