Tag Archives: Apache HBase

Enhancing data durability in Amazon EMR HBase on Amazon S3 with the Amazon EMR WAL feature

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/enhancing-data-durability-in-amazon-emr-hbase-on-amazon-s3-with-the-amazon-emr-wal-feature/

Apache HBase, an open source NoSQL database, enables quick access to massive datasets. Amazon EMR, from version 5.2.0, lets you use HBase on Amazon Simple Storage Service (Amazon S3). This combines HBase’s speed with the durability advantages of Amazon S3. Also, it helps achieve the data lake architecture benefits such as the ability to scale storage and compute requirements separately. We see our customers choosing Amazon S3 over Hadoop Distributed File Systems (HDFS) when they want to achieve greater durability, availability, and simplified storage management. Amazon EMR continually improves HBase on Amazon S3, focusing on performance, availability, and reliability.

Despite these durability benefits of HBase on Amazon S3 architecture, a critical concern remains regarding data recovery when the Write-Ahead Log (WAL) is lost. Within the EMR framework, HBase data attains durability when it’s flushed, or written, to Amazon S3. This flushing process is triggered by reaching specific size and time thresholds or through manual initiation. Until data is successfully flushed to S3, it persists within the WAL, which is stored in HDFS. In this post, we dive deep into the new Amazon EMR WAL feature to help you understand how it works, how it enhances durability, and why it’s needed. We explore several scenarios that are well-suited for this feature.

HBase WAL overview

Each RegionServer in HBase is responsible for managing data from multiple tables. These tables are horizontally partitioned into regions, where each region represents a contiguous range of row keys. A RegionServer can host multiple such regions, potentially from different tables. At the RegionServer level, there is a single, shared WAL that records all write operations across all regions and tables in a sequential, append-only manner. This shared WAL makes sure durability is maintained by persisting each mutation before applying it to in-memory structures, enabling recovery in case of unexpected failures. Within each region, the memory structure of the MemStore is further divided by column families, which are the fundamental units of physical storage and I/O in HBase. Each column family maintains:

  • Its own MemStore, which holds recently written data in memory for fast access and buffering before it flushes to disk.
  • A set of HFiles, which are immutable data files stored on HDFS (or Amazon S3 in HBase on S3 mode) that hold the persistent, flushed data.

Although all column families within a region are served by the same RegionServer process, they operate independently in terms of memory buffering, flushing, and compaction. However, they still share the same WAL and RegionServer-level resources, which introduces a degree of coordination, hence they operate semi-independently within the broader region context. This architecture is shown in the following diagram.

Architecture diagram of HBase Region Server showing WAL, two regions with in-memory Memstores and persistent HFiles

Understanding the HBase write process: WAL, MemStore, and HFiles

The HBase write path initiates when a client issues a write request, typically through an RPC call directed to the appropriate RegionServer that hosts the target region. Upon receiving the request, the RegionServer identifies the correct HBase region based on the row key and forwards the KeyValue pair accordingly. The write operation follows a two-step process. First, the data is appended to the WAL, which promotes durability by recording every change before it’s committed to memory. The WAL resides on HDFS by default and exists independently on each RegionServer. Its primary purpose is to provide a recovery mechanism in the event of a failure, particularly for edits that have not yet been flushed to disk. When the WAL append is successful, the data is written to the MemStore, an in-memory store for each column family within the region. The MemStore accumulates updates until it reaches a predefined size threshold, controlled by the hbase.hregion.memstore.flush.size parameter (default is 128 MB). When this threshold is exceeded, a flush is triggered.Flushing is handled asynchronously by a background thread in the RegionServer. The thread writes the contents of the MemStore to a new HFile, which is then persisted to long-term storage. In Amazon EMR, the location of this HFile depends on the deployment mode: for HBase on Amazon S3, HFiles are stored in Amazon S3, but for HBase on HDFS, they’re stored in HDFS.This workflow is shown in the following diagram.

HBase write process workflow showing data path through WAL, Memstore, and HFile with AWS services

A region server serves multiple regions, and they all share a common WAL. The WAL records all data changes, storing them in local HDFS. Puts and deletes are initially logged to the WAL by the region server before being recorded in the MemStore for the affected store. Scan and get operations in HBase don’t require the use of the WAL. In the event of a region server crash or unavailability before MemStore flushing, the WAL is crucial for replaying data changes, which promotes data integrity. Because this log by default resides on a replicated filesystem, it enables an alternate server to access and replay the log, requiring nothing from the physically failed server for a complete recovery. When a RegionServer fails abruptly, HBase initiates an automated recovery process orchestrated by the HMaster. First, the ZooKeeper session timeout detects the RegionServer failure, notifying the HMaster. The HMaster then identifies all regions previously hosted on the failed RegionServer and marks them as unassigned. The WAL files from the failed RegionServer are split by region, and these split WAL files are distributed to the new RegionServers that will host the reassigned regions. Each new RegionServer replays its assigned WAL segments to recover the MemStore state that existed before the failure, preventing data loss. When WAL replay is complete, the regions become operational on their new RegionServers, and the recovery process concludes.

HBase recovery workflow from RegionServer failure through WAL splitting to regions online

The effectiveness of the HDFS WAL model relies on the successful completion of the write request in the WAL and the subsequent data replication in HDFS. In cases where some nodes are terminated, HDFS can still recover from the WAL files, allowing HBase to autonomously heal by replaying data from the WALs and rebalancing the regions. However, if all CORE nodes are simultaneously terminated, achieving complete cluster recovery is a challenge because the data to replay from the WAL is lost. The issue arises when WALs are lost due to CORE node shutdown (for example, all three replicas of a file block). In this scenario, HBase enters a loop attempting to replay data from the WALs. Unfortunately, the absence of available blocks in this case causes the HBase server crash procedure to fail and retry indefinitely.

Amazon EMR WAL

To address the mentioned challenge of HDFS WAL and to provide data durability in HBase, Amazon EMR introduces a new EMR WAL feature starting from versions emr-7.0 and emr-6.15. This feature facilitates the recovery of data that hasn’t been flushed to Amazon S3 (HFile). Using this feature provides thorough backup for your HBase clusters. Behind the scenes, the RegionServer writes WAL data to EMR WAL, which is a service outside the EMR cluster. With this feature enabled, concerns about loss of WAL data in HDFS are alleviated. Also, in the event of cluster or Availability Zone failure issues, you can create a new cluster, directing it to the same Amazon S3 root directory and EMR WAL workspace. This enables the automatic recovery of data in the WAL in the order of minutes. Recovery of unflushed data is supported for a duration of 30 days, after which remaining unflushed data is deleted. This workflow is shown in the following diagram.

Detailed sequence diagram of HBase write operations with EMR WAL service integration and S3 storage

Key benefits

Upon enabling EMR WAL, the WALs are located external to the EMR cluster. The key benefits are:

  • High availability – You can remain confident about data integrity even in the face of Availability Zone failures. Their HFiles are stored in Amazon S3, and the WALs are externally stored in EMR WAL. This setup enables cluster recovery and WAL replay in the same or a different Availability Zone within the region. However, for true high availability with zero downtime, relying solely on EMR WAL is not sufficient because recovery still involves brief interruptions. To provide seamless failover and uninterrupted service, HBase replication across multiple Availability Zones is essential along with EMR WAL, providing robust zero-downtime high availability.
  • Data durability improvement – Customers no longer need to concern themselves with potential data loss in scenarios involving WAL data corruption in HDFS or the removal of all replicas in HDFS due to instance terminations.

The following flow diagram compares the sequence of events with and without EMR WAL enabled.

HBase write process and failure recovery workflow with EMR WAL and S3 integration

Key EMR WAL features

In this section, we explore the key enhancements introduced in the EMR WAL service across recent Amazon EMR versions. From grouping multiple HBase regions into a single EMR WAL to advanced configuration options, these new capabilities address specific usage scenarios.

Grouping multiple HBase regions into a single Amazon EMR WAL

In Amazon EMR versions up to 7.2, a separate EMR WAL is created for each region, which can become expensive due to the EMR-WAL-WALHours pricing model, especially when the HBase cluster contains many regions. To address this, starting from Amazon EMR 7.3, we introduced the EMR WAL grouping feature, which enables consolidating multiple HBase regions per EMR WAL, offering significant cost savings (over 99% cost savings in our sample evaluation) and improved operational efficiency. By default, each HBase RegionServer has two Amazon EMR WALs. If you have many regions per RegionServer and want to increase throughput, you can customize the number of WALs per RegionServer by configuring the hbase.wal.regiongrouping.numgroups property. For instance, to set 10 EMR WALs per HBase RegionServer, you can use the following configuration:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "hbase.wal.regiongrouping.numgroups": "10"
    }
  }
]

The two HBase system tables hbase:meta and hbase:master (masterstore) don’t participate in the WAL grouping mechanisms.

In a performance test using m5.8xlarge instances with 1,000 regions per RegionServer, we observed a significant increase in throughput as the number of WALs grew from 1 to 20 per RegionServer (from 1,570 to 3,384 operations per sec). This led to a 54% improvement in average latency (from 40.5 ms to 18.8 ms) and a 72% reduction in 95th percentile latency (from 231 ms to 64 ms). However, beyond 20 WALs, we noted diminishing returns, with only slight performance improvements between 20 and 50 WALs, and average latency stabilized around 18.7ms. Based on these results, we recommend maintaining a lower region density (around 10 regions per WAL) for optimal performance. Nonetheless, it’s crucial to fine-tune this configuration according to your specific workload characteristics and performance requirements and conduct tests in your lower environment to validate the best setup.

Configurable maximum record size in EMR WAL

Until Amazon EMR version 7.4, the EMR WAL had a record size limit of 4 MB, which was insufficient for some customers. Starting from EMR 7.5, the maximum record size in EMR WAL is configurable through the emr.wal.max.payload.size property. The default value is set to 1 GB. The following is an example of how to set the maximum record size to 2 GB:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "emr.wal.max.payload.size": "2147483648"
    }
  }
]

AWS PrivateLink support

EMR WAL supports AWS PrivateLink, if you want to keep your connection within the AWS network. To set it up, create a virtual private cloud (VPC) endpoint using the AWS Management Console or AWS Command Line Interface (AWS CLI) and select the service labeled com.amazonaws.region.emrwal.prod. Make sure your VPC endpoint uses the same security groups as the EMR cluster. You have two DNS configuration options: enabling private DNS, which uses the standard endpoint format and automatically routes traffic privately, or using the provided VPC endpoint-specific DNS name for more explicit control. Regardless of the DNS option chosen, both methods mean that traffic remains within the AWS network, enhancing security. To implement this in the EMR cluster, update your cluster configuration to use the PrivateLink endpoint, as shown in the following code sample (for private DNS):

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "emr.wal.client.endpoint": "https://prod.emrwal.region.amazonaws.com"
        }
    }
]

For more details, refer to Access Amazon EMR WAL through AWS PrivateLink in the Amazon EMR documentation.

Encryption options for WAL in Amazon EMR

Amazon EMR automatically encrypts data in transit in the EMR WAL service. You can enable server-side encryption (SSE) for WAL (data at rest) with two key management options:

  • SSE-EMR-WAL: Amazon EMR manages the encryption keys
  • SSE-KMS-WAL: You use an AWS Key Management Service (AWS KMS) key for encryption policies

EMR WAL cross-cluster replication

From EMR 7.5, EMR WAL supports cross-cluster replay, allowing clusters in an active-passive HBase replication setup to use EMR WAL.

For more details on the setup, refer to EMR WAL cross-cluster replication in the Amazon EMR documentation.

EMR WAL enhancement: Minimizing CPU load from HBase sync threads

Starting from EMR 7.9, we’ve implemented code optimizations in EMR WAL to address the high CPU utilization caused by sync threads used by HBase processes to write WAL edits, leading to improved CPU efficiency.

Sample use cases benefitting from this feature

Based on our customer interactions and feedback, this feature can help in the following scenarios.

Continuity during service disruptions

If your business demands disaster recovery with no data loss for an HBase on an S3 cluster due to unexpected service disruptions, such as an Availability Zone failure, the newly introduced feature means you don’t have to rely on a persistent event store solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis. Without EMR WAL, you had to set up a complex event-streaming pipeline to retain the most recently ingested data and enable replay from the point of failure. This new feature eliminates that dependency by storing Hbase WALs in the EMR WAL service.

Note: During an Availability Zone (AZ) failure or service-level issue, make sure to fully terminate the original Hbase cluster before launching a new one that points to the same S3 root directory. Running two active Hbase clusters that access the same S3 root can lead to data corruption.

Upgrading to the latest EMR releases or cluster rotations

Without EMR WAL, moving to the latest EMR version or managing cluster rotations with HBase on Amazon S3 necessitated manual interruptions for data flushing to S3. With the new feature, the requirement for data flushing is eliminated. However, during cluster termination and the subsequent launch of a new HBase cluster, there is an inevitable service downtime, during which data producers or ingestion pipelines must handle write disruptions or buffer incoming data until the system is fully restored. Also, the downstream services should account for temporary unavailability, which can be mitigated using a read replica cluster.

Overcoming HDFS challenges during HBase auto scaling

Without EMR WAL feature, having HDFS for your WAL files was a requirement. When implementing custom auto scaling for your HBase clusters, it sometimes resulted in WAL data corruption due to issues linked to HDFS. This is because, to prevent data loss, data blocks had to be moved to different HDFS nodes when one HDFS node was being decommissioned. When nodes continued to be terminated swiftly during scale-down process without allowing sufficient time for graceful decommissioning, it could result in WAL data corruption issues, primarily attributed to missing blocks.

Addressing HDFS disk space issues due to old WALs

When a WAL file is no longer required for recovery, indicating that HBase has made sure all data within the WAL file has been flushed, it’s transferred to the oldWALs folder for archival purposes. The log remains in this location until all other references to the WAL file are completed. In HBase use cases with high write activity, some customers have expressed concerns about the oldWALs directory (/usr/hbase/oldWALs) expanding and occupying excessive disk space and eventually causing disk space issues. With the complete relocation of these WALs to an external EMR WAL service, you will no longer encounter this issue.

Assessing HBase in Amazon EMR clusters with and without EMR WAL for fault tolerance

We conducted a data durability test employing two scripts. The first was for installing YCSB, creating a pre-split table, and loading 8 million records on the master node. The second was for terminating a core node every 90 seconds after a 3-minute wait, totaling five terminations. Two EMR clusters with eight core nodes each were created, one configured with EMR WAL enabled and the other as a standard EMR HBase cluster with the WAL stored in HDFS. After completion of EMR steps, a count was run on the HBase table. In the EMR cluster with EMR WAL enabled, all records were successfully inserted without corruption. In the cluster not using EMR WALs, regions in HBase remained “OPENING” if the node hosting the meta was terminated. For other core node terminations, inserts failed, resulting in a lower record count during validation.

Understanding when EMR WAL read charges apply in HBase

In HBase, standard table read operations such as Get and Scan don’t access WALs. Therefore, EMR WAL read (GiB) charges are only incurred during operations that involve reading from WALs, such as:

  • Restoring data from EMR WALs in a newly launched cluster
  • Replaying WALs to recover data on a crashed RegionServer
  • Performing HBase replication, which involves reading WALs to replicate data across clusters

In a normal scenario, you’re billed only for the following two components related to EMR WAL usage:

  • EMR-WAL-WALHours – Represents the hourly cost of WAL storage, calculated based on the number of WALs maintained. You can use the EMRWALCount metric in Amazon CloudWatch to monitor the number of WALs and track associated usage over time.
  • EMR-WAL-WriteRequestGiB – This reflects the volume of data written to the WAL service, charged by the amount of data written in GiB.

For further details on pricing, refer to Amazon EMR pricing and Amazon EMR Release Guide.

To monitor and analyze EMR WAL related costs in the AWS Cost and Usage Reports (CUR), look under product_servicecode = ‘ElasticMapReduce’, where you’ll find the following product_usagetype entries associated with WAL usage:

  • USE1-EMR-WAL-ReadRequestGiB
  • USE1-EMR-WAL-WALHours
  • USE1-EMR-WAL-WriteRequestGiB

The prefix USE1 indicates the Region (in this case, us-east-1) and will vary depending on where your EMR cluster is deployed.

Summary

This new EMR WAL feature allows you to improve durability of your Amazon EMR HBase on S3 clusters, addressing critical workload scenarios by eliminating the need for streaming solutions for Availability Zone level service disruptions, streamlining processes for upgrading or rotating clusters, preventing data corruption during HBase auto scaling or node termination events, and resolving disk space issues associated with old WALs. Because many of the EMR WAL features are added on the latest releases of Amazon EMR, we recommend that customers use Amazon EMR version 7.9 or later to fully benefit from these improvements.


About the authors

Suthan Phillips is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to ensure efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking and exploring the Pacific Northwest.

Implement Amazon EMR HBase Graceful Scaling

Post Syndicated from Yu-Ting Su original https://aws.amazon.com/blogs/big-data/implement-amazon-emr-hbase-graceful-scaling/

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. We can use Amazon EMR with HBase on top of Amazon Simple Storage Service (Amazon S3) for random, strictly consistent real-time access for tables with Apache Kylin. It ingests data through spark jobs and queries the HTables through Apache Kylin cubes. The HBase cluster uses HBase write-ahead logs (WAL) instead of Amazon EMR WAL.

A time goes by, companies may want to scale in long-running Amazon EMR HBase clusters because of issues such as Amazon Elastic Compute Cloud (Amazon EC2) scheduling events and budget concerns. Another issue is that companies may use Spot Instances and auto scaling for task nodes for short-term parallel computation power, like MapReduce tasks and spark executors. Amazon EMR also runs HBase region servers on task nodes for Amazon EMR on S3 clusters. Spot interruptions will lead to an unexpected shutdown on HBase region servers. For an Amazon EMR HBase cluster without enabling write-ahead logs (WAL) for Amazon EMR feature, an unexpected shutdown on HBase region servers will cause WAL splits with server recovery process, and it will bring extra load to the cluster and sometimes makes HTables inconsistent.

For these reasons, administrators look for a way to scale-in Amazon EMR HBase cluster gracefully and stop all HBase region servers on the task nodes.

This post demonstrates how to gracefully decommission target region servers programmatically. The scripts do the following tasks. The script also tests successfully in Amazon EMR 7.3.0, Amazon EMR 6.15.0, and 5.36.2.

  • Automatically move the HRegions through a script
  • Raise the decommission priority
  • Decommission HBase region servers gracefully
  • Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR software configurations
  • Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR steps

Overview of solution

For graceful scaling in, the script uses HBase built-in graceful_stop.sh to move regions to other region servers to avoid WAL splits when decommissioning nodes. The script uses HDFS CLI and web interface to make sure there are no missing and corrupted HDFS block during the scaling events. To prevent Amazon EMR provisions HBase region servers on task nodes, administrators need to specify software configurations per instance groups when launching a cluster. For existing clusters, administrators can either use a step to terminate HBase region servers on task nodes, or reconfigure the task instance group’s HBase storagerootdir.

Solution

For a running Amazon EMR cluster, administrators can use AWS Command Line Interface (AWS CLI) to issue a modify-instance-groups with EC2InstanceIdsToTerminate to terminate specified instances immediately. But terminating an instance in this way can cause a data loss and unpredictable cluster behavior when HDFS blocks have not enough copies or there are ongoing tasks on those decommissioned nodes. To avoid these risks, administrators can send a modify-instance-groups with a new instance request count without a specific instance ID that administrators want to terminate. This command triggers a graceful decommission process on the Amazon EMR side. However, Amazon EMR only supports graceful decommission for YARN and HDFS. Amazon EMR doesn’t support graceful decommission for HBase.

Hence, administrators can try method 1, as described later in this post, to raise the decommission priority of the decommission targets as the first step. In case tweaking the decommissions priority didn’t work, move forward to the second approach, method 2. Method 2 is to stop the resizing request, and move the HRegions manually before terminating the target core nodes. Note that Amazon EMR is a managed service. Amazon EMR service will terminate the EC2 instance after anyone stops it or detach its Amazon Elastic Block Store (Amazon EBS) volumes. Therefore, don’t try to detach EBS volumes on the decommission targets and attach them to new nodes.

Method 1: Decommission HBase region servers through resizing

To decommission Hadoop nodes, administrators can add decommission targets to HDFS’s and YARN’s exclude list, which were dfs.hosts.exclude and yarn.nodes.exclude.xml. However, Amazon EMR disallows manual update to these files. The reason is that the Amazon EMR service daemon, master instance controller, is the only valid process to update these two files on master nodes. Manual updates to these two files will be reset.

Thus, one of the most accessible ways to raise a core node’s decommission priority according to Amazon EMR is having less instance controller heartbeat.

As the first step, pass move_regions to the following script on Amazon S3, blog_HBase_graceful_decommission.sh, as an Amazon EMR step to move HRegions to other region servers and shutdown processes of region server and instance controller. Please also provide targetRS and S3Path to blog_HBase_graceful_decommission.sh. targetRS represents to the private DNS of the decommission target region server. S3Path represents the location of the region migration script.

This step needs to be run in off-peak hours. After all HRegions on the target region server are moved to other nodes, splitting WAL activities after stopping the HBase region server will generate a very low workload to the cluster because it serves 0 regions.

For more information , refer to blog_HBase_graceful_decommission.sh.

Taking a closer look at the move_regions option in blog_HBase_graceful_decommission.sh, this script disables the region balancer and moves the regions to other region servers. The script retrieves Secure Shell (SSH) credentials from AWS Secrets Manager to access worker nodes.

In addition, the script included some AWS CLI operations. Please make sure the instance profile, EMR_EC2_DefaultRole, can operate the following APIs and have SecretsManagaerReadWrite permission.

Amazon EMR APIs:

  • describe-cluster
  • list-instances
  • modify-instance-groups

Amazon S3 APIs:

  • cp

Secrets Manager APIs:

  • get-secret-value

In Amazon EMR 5.x, HBase on Amazon S3 will make the master node also work as a region server hosting hbase:meta regions. This script will get stuck when trying to move non-hbase:meta HRegions to the master. To automate the script, the parameter, maxthreads, is increased to move regions through multiple threads. By moving regions in a while loop, one of the threads got a runtime error because it tries to move non-hbase:meta HRegions to the master node. Other threads can keep on moving HRegions to other region servers. After the only stuck thread timed out after 300 seconds, it moves forward to the next run. After six retries, manual actions will be required, such as using a move action through the HBase shell for the remaining regions’ movement or resubmitting the step.

The following is the syntax to use the script to invoke the move_regions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions <your-secret-id> <targetRS: target_region_server_private_DNS> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions your-secret-id ip-172-0-0-1.us-west-2.compute.internal s3://yourbucket/yourpath/
Action on failure:Continue

In the HBase web UI, the target region server will serve 0 regions after the evacuation, as shown in the following screenshot.

After that, the stop_RS_IC function in the script stopped the HBase region server and instance controller process on the decommission target after making sure that there is no running YARN container on that node.

Note that the script is for Amazon EMR 5.30.0 and later release versions. For Amazon EMR 4.x-5.29.0 release versions, stop_RS_IC in the script needs to be updated by referring to How do I restart a service in Amazon EMR? In the AWS Knowledge Center. Also, in Amazon EMR versions earlier than 5.30.0, Amazon EMR uses a service nanny to watch the status of other processes. If a service nanny automatically restarts the instance controller, please stop the service nanny using the stop_RS_IC function before stopping the instance controller on that node. Here’s an example:

if [ "\$runningContainers" -eq 0 ]; then
        echo "0 container is running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
        echo "Shutdown IC" | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/service-nanny stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller status | tee -a /tmp/graceful_stop.log;
else
        echo "Still have \${runningContainers} containers running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
     	echo "Not to shutdown IC" | tee -a /tmp/graceful_stop.log;
fi

After the step is successfully completed, scale in and define (current core node amount is −1) as the desired target node amount using the Amazon EMR console. Amazon EMR might pick up the target core node to decommission it because the instance controller isn’t running on that node. There can be a few minutes of delay for Amazon EMR to detect the heartbeat loss of that target node through polling the instance controller. Thus, make sure the workload is very low and there will be no container to the target node for a while.

Stopping the instance controller merely increases the decommissioning priority. But method 1 doesn’t guarantee that the target core node will be picked up as the decommissioning target by Amazon EMR. If Amazon EMR doesn’t pick up the decommission target as the decommissioning victim after using method 1, administrators can stop the resize activity using the AWS Management Console. Then, proceed to method 2.

Method 2: Manually decommission the target core nodes

Administrators can terminate the node using the EC2InstanceIdsToTerminate option in the modify-instance-groups API. But this action will directly terminate the EC2 instance and will risk losing HDFS blocks. To mitigate the risk of having a data loss, administrators can use the following steps in off-peak hours with zero or very few running jobs.

First, run the move_hregions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step in method 1. The function moves HRegions to other region servers and stopped the HBase region server as well as the instance controller process.

Then, run the terminate_ec2 function in blog_HBase_graceful_decommission.sh as an Amazon EMR step. To run this function successfully, please provide the target instance group ID and target instance ID to the script. This function merely terminates one node at a time by specifying the EC2InstanceIdsToTerminate option in the modify-instance-groups API. This makes sure that the core nodes are not terminated back-to-back and lowered the risks of missing HDFS blocks. It inspects HDFS and makes sure all HDFS blocks had at least two copies. If an HDFS block have only one copy, the script will exit with an error message similar to, “Some HDFS blocks have only 1 copy. Please increase HDFS replication factor through the following command for existing HDFS blocks.”

$ hdfs dfs -setrep -R -w 2 <the-file-or-directory-you-want-to-modify>

To make sure all upcoming HDFS blocks have at least two copies, reconfigure the core instance group with the following software configuration:

[{
    "classification": "hdfs-site",
    "properties": {
        "dfs.replication": "2"
    },
    "configurations": []
}]

In addition, the terminateEC2 function compares the metadata of the replicating blocks before and after terminating the core node using hdfs dfsadmin -report. This makes sure no under-replicating, corrupted, or missing HDFS block increased.

The terminateEC2 function tracked decommission status. The script will complete after the decommission completes. It can take some time to recover HDFS blocks. The elapsed time depends on several factors such as the total number of blocks, I/O, bandwidth, HDFS handler amount, and name node resources. If there are many HDFS blocks to be recovered, it may take a few hours to complete. Before running the script, please make sure that the instance profile, EMR_EC2_DefaultRole, have permission of elasticmapreduce:ModifyInstanceGroups.

The following is the syntax to use the script to invoke the terminate_ec2 function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 <your-secret-id> <instance_groupID> <target_EC2_Instance_ID>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 your-secret-id ig-ABCDEFGH12345 i-1234567890abcdef
Action on failure:Continue

While invoking terminate_ec2, the script checks HDFS Name Node Web UI for the decommission target to understand how many blocks need to be recovered on other nodes after submitting the decommission request. Here are the steps:

  1. On the Amazon EMR console, version 6.x, find HDFS NameNode web UI. For example, enter http://<master-node-public-DNS>:9870
  2. On the top menu bar, choose Datanodes
  3. In the In operation section, check the on-service data nodes and the total number of data blocks on the nodes, as shown in the following screenshot.
  4. To view the HDFS decommissioning progress, go to Overview, as shown in the following screenshot.

On the Datanodes page, the decommission target node will not have a green checkmark, and the node will be in the Decommissioning section, as shown in the following screenshot.

The step’s STDOUT also reveals the decommission status:

Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommission in progress

The decommission target will transit from Decommissioning to Decommissioned in the HDFS NameNode web UI, as shown in the following screenshot.

The decommissioned target will appear in the Dead datanodes section in the step’s STDOUT after the process is completed:

Dead datanodes (1):
Name: 172.31.4.197:50010 (ip-172-31-4-197.us-west-2.compute.internal)
Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommissioned
Configured Capacity: 62245027840 (57.97 GB)
DFS Used: 394412032 (376.14 MB)
Non DFS Used: 0 (0 B)
DFS Remaining: 61179640063 (56.98 GB)
DFS Used%: 0.63%
DFS Remaining%: 98.29%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jan 14 06:09:17 UTC 2025

After the target node is decommissioned, the hdfs dfsadmin report will be displayed in the last section in the step’s STDOUT . There should be no difference between rep_blocks_${beforeDate} and rep_blocks_${afterDate} as described in the script. It means no additional amount of under-replicated, missing, or corrupt blocks after the decommission. In HBase web UI, the decommissioned region server will be moved to dead region servers. The dead region server records will be reset after restarting HMaster during routine maintenance.

After the Amazon EMR step is completed without errors, please repeat the preceding steps to decommission the next target core node because administrators may have more than one core nodes to decommission.

After administrators complete all decommission tasks, administrators can manually enable the HBase balancer through the HBase shell again:

$ echo "balance_switch true" | sudo -u hbase hbase shell
## To make sure balance_switch is enabled, submit the same command again. The output should say it’s already in “true” status.
$ echo "balance_switch true" | sudo -u hbase hbase shell

Prevent Amazon EMR from provisioning HBase region servers on task nodes

For new clusters, configure HBase settings for master and core groups only and keep the HBase settings empty when launching an Amazon EMR HBase on an S3 cluster. This prevents provisioning HBase region servers on task nodes.

For example, define configurations for applications other than HBase settings in the software configuration textbox in the Software settings section on the Amazon EMR console, as shown in the following screenshot.

Image 007

Then, configure HBase settings in Node configuration – optional for each instance group in the Cluster configuration – required section, as shown in the following screenshot.

Image 008

For master and core instance groups, HBase configurations will be like the following screenshot.

Image 009

Here’s a json formatted example:

[
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "s3"
         }
    },
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "s3://my/HBase/on/S3/RootDir/"
        }
    }
]

For task instance groups, there will be no HBase configuration, as shown in the following screenshot.

Image 010

Here’s a json formatted example:

[]

Here’s an example in AWS CLI:

$ aws emr create-cluster \
--applications Name=Hadoop Name=HBase Name=ZooKeeper \
... (skip) \
--instance-groups '[ {"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Master - 1"},\
{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"TASK","InstanceType":"m5.xlarge","Name":"Task - 3"},\
{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":64,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Core - 2"}]' --configurations '[{"Classification":"hdfs-site","Properties":{"dfs.replication":"2"}}]' \
--auto-scaling-role Amazon EMR_AutoScaling_DefaultRole \
... (skip) \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-west-2

Stop decommission the HBase region servers on task nodes

For an existing Amazon EMR HBase on an S3 cluster, pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step to stop HBase region servers on nodes in a task instance group. The script requirs a task instance group ID and an S3 location to place sharing scripts for task nodes.

The following is the syntax to pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location: s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments: s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh stop_and_check_task_rs <your-secret-id> <instance_groupID> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to stop HBase regions on nodes in a task group:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/ blog_HBase_graceful_decommission.sh your-secret-id stop_and_check_task_rs ig-ABCDEFGH12345 s3://yourbucket/yourpath/
Action on failure:Continue

This step above not only stops HBase region servers on existing task nodes. To avoid provisioning HBase region servers on new task nodes, the script also reconfigures and scales in the task group. Here are the steps:

  1. Using the move_regions function, in blog_HBase_graceful_decommission.sh, move HRegions on the task group to other nodes and stop region servers on those task nodes.

After making sure that the HBase region servers are stopped at these task nodes, the script reconfigures the task instance group. The reconfiguration details are to let HBase rootdir point to a non-existing location. These settings only apply to the task group. Here’s an example:

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "hdfs://non/existing/location"
        }
    },
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "hdfs"
        }
    }
]

When the task group’s state returns to RUNNING, the script scales in these task nodes to 0. New task nodes in the upcoming scaling out events will not run HBase region servers.

Conclusion

These scaling steps demonstrate how to handle Amazon EMR HBase scaling gracefully. The functions in the script can help administrators to resolve problems when companies want to gracefully scale the Amazon EMR HBase on S3 clusters without Amazon EMR WAL.

If you have a similar request to scale in an Amazon EMR HBase on an S3 cluster gracefully because the cluster doesn’t enable Amazon EMR WAL, you can refer to this post. Please test the steps in the testing environment for verifications first. After you confirm the steps can meet your production requirements, you can proceed and apply the steps to production environment.


About the Authors

Image 011Yu-Ting Su is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.

Image 012Hsing-Han Wang is a Cloud Support Engineer at Amazon Web Services (AWS). He focuses on Amazon EMR and AWS Lambda. Outside of work, he enjoys hiking and jogging, and he is also an Eorzean.

Image 013Cheng Wang is a Technical Account Manager at AWS who has over 10 years of industry experience, focusing on enterprise service support, data analysis, and business intelligence solutions.

Chris Li is an Enterprise Support manager at AWS. He leads a team of Technical Account Managers to solve complex customer problems and implement well-structured solutions.