All posts by Jigar Mistry

Dynamically scale up storage on Amazon EMR clusters

Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/dynamically-scale-up-storage-on-amazon-emr-clusters/

In a managed Apache Hadoop environment—like an Amazon EMR cluster—when the storage capacity on your cluster fills up, there is no convenient solution to deal with it. This situation occurs because you set up Amazon Elastic Block Store (Amazon EBS) volumes and configure mount points when the cluster is launched, so it’s difficult to modify the storage capacity after the cluster is running. The feasible solutions usually involve adding more nodes to your cluster, backing up your data to a data lake, and then launching a new cluster with a higher storage capacity. Or, if the data that occupies the storage is expendable, removing the excess data is usually the way to go.

To help you deal with this issue in a manageable way in Amazon EMR, I will show you how to dynamically scale up the storage using the elastic volumes feature of Amazon EBS. With this feature, you can increase the volume size, adjust the performance, or change the volume type while the volume is in use. You can continue to use your EMR cluster to run big data applications while the changes take effect.

How HDFS and YARN use disk space on an Amazon EMR cluster

When you create an Amazon EMR cluster, by default, HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator) are configured to use the local disk storage available on all the core/task nodes. You configure this inside the yarn-site.xml and hdfs-site.xml configuration files.

Specifically, for HDFS, the dfs.datanode.data.dir parameter is configured to use local storage, where it stores the data blocks for the files maintained by the NameNode daemon. And for YARN, the yarn.nodemanager.local-dirs parameter is configured to store the intermediate files needed for the NodeManager daemon to run the YARN containers.

For example, when the cluster is running a MapReduce job, the map tasks store their output files inside the directories defined by yarn.nodemanager.local-dirs. Additionally, the yarn.nodemanager.log-dirs parameter is configured to store the container logs on the core and task nodes after the YARN application has finished.

General best practices for avoiding storage issues

As you plan to run your jobs on an Amazon EMR cluster, the following are some helpful tips to avoid exceeding the available storage on your cluster.

Estimate your future storage needs

Plan ahead regarding the storage needs of your jobs. When you launch a cluster using the default storage configuration, it might not be sufficient for your workloads, and you might have issues while your jobs are running. It is a good practice to estimate how much intermediate storage your jobs will need. And based on that, you can customize the storage configuration when launching a new cluster.

Store passive data in a data lake

Try to design your workloads in such a way that all your passive data is stored in a data lake like Amazon Simple Storage Service (Amazon S3). This way, you use your cluster only for data processing, performing miscellaneous computation tasks, and storing the results back to the data lake for persistent storage. This approach minimizes the storage requirements on the running cluster.

Plan for more capacity

If your use case dictates that the input or output data should be stored locally on the cluster (HDFS or local storage), you should plan the size of your cluster accordingly. For example, if you are using HDFS, you can create a cluster with a higher number of core nodes that will be enough to store your data. Or, you can customize the core instance group to have more EBS storage capacity than what the default configuration provides.

Possible issues if the storage reaches its maximum capacity

As the EMR cluster is being used to run different data processing applications, at some point, the storage capacity on the cluster might be used up. In that case, the following are some of the issues that can affect your cluster.

Issues from a YARN perspective

If the directories that are defined by the yarn.nodemanager.local-dirs or yarn.nodemanager.log-dirs parameters are filled up to at least 90 percent of the total storage of the volume, the NodeManager marks that particular disk as unhealthy. This action then causes the NodeManager to mark the node that has those disks as unhealthy also. So, if the node is unhealthy, the ResourceManager will not assign any containers to the node.

Additionally, if the termination protection feature is turned off on your EMR cluster, the EMR service eventually terminates the node from your cluster.

Issues from an HDFS perspective

If the HDFS usage on the cluster increases, it corresponds to an increase in the usage of local storage on EBS volumes also. In EMR, the HDFS data directories are configured under the same mount point as the YARN local and log directories. So, if the usage of the mount point exceeds the storage threshold (90 percent) due to HDFS, that again causes YARN to mark that disk as unhealthy, and the ResourceManager blacklists the node.

Dynamically resize the storage space on core and task nodes

To scale up the storage of core and task nodes on your cluster, use the following bootstrap action script:

s3://aws-bigdata-blog/artifacts/resize_storage/resize_storage.sh

Additionally, the EC2 instance profile of your cluster must have the ec2:ModifyVolume permissions to be able to resize the volume.

The script runs on all the nodes of an EMR cluster. It configures a cron job on the nodes and performs a disk utilization check every 2 minutes. On the master node, it performs the check on the root volume and the volumes that are storing the logs of various master daemons. On core and task nodes, it performs the check on volumes that YARN and HDFS use and determines whether there is a need to scale up the storage.

When it determines that a volume has exceeded 90 percent of its usage, the size of the volume is expanded by the percentage specified by the “--scaling-factor” parameter. During the resize process, the partition of the volume is expanded, and the file system is extended to reflect the updated capacity. All of this happens without affecting the applications that are running on the cluster.

Consider the following caveats before using this solution:

  • You can scale up the storage capacity of nodes in an EMR cluster only if the cluster uses EBS volumes as its storage backend. Certain EC2 instance types use only instance store volumes or both instance store and EBS volumes. You can’t resize the storage capacity on clusters that use such EC2 instance types.
  • While you are deciding on the scaling factor option of the script, plan ahead to increase the volume so that the updated configuration will last for quite some time. The scaling up of storage has to wait at least 6 hours before applying further modifications to the same volume.

Conclusion

In this post, I explained how HDFS and YARN use the local storage on Amazon EMR cluster nodes. I covered how you can scale up the storage on an EMR cluster using the elastic volumes feature of Amazon EBS. You can use this feature to increase volume size, adjust performance, or change volume type while the volume is in use. You can then continue to use your EMR cluster to run big data applications while the changes are being applied.

If you have any questions or suggestions, please leave a comment below.

 


About the Author

Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.

 

Turbocharge your Apache Hive queries on Amazon EMR using LLAP

Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/

Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data.

With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has changed. LLAP uses long-lived daemons with intelligent in-memory caching to circumvent batch-oriented latency and provide sub-second query response times.

This post provides an overview of Hive LLAP, including its architecture and common use cases for boosting query performance. You will learn how to install and configure Hive LLAP on an Amazon EMR cluster and run queries on LLAP daemons.

What is Hive LLAP?

Hive LLAP was introduced in Apache Hive 2.0, which provides very fast processing of queries. It uses persistent daemons that are deployed on a Hadoop YARN cluster using Apache Slider. These daemons are long-running and provide functionality such as I/O with DataNode, in-memory caching, query processing, and fine-grained access control. And since the daemons are always running in the cluster, it saves substantial overhead of launching new YARN containers for every new Hive session, thereby avoiding long startup times.

When Hive is configured in hybrid execution mode, small and short queries execute directly on LLAP daemons. Heavy lifting (like large shuffles in the reduce stage) is performed in YARN containers that belong to the application. Resources (CPU, memory, etc.) are obtained in a traditional fashion using YARN. After the resources are obtained, the execution engine can decide which resources are to be allocated to LLAP, or it can launch Apache Tez processors in separate YARN containers. You can also configure Hive to run all the processing workloads on LLAP daemons for querying small datasets at lightning fast speeds.

LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You can use scheduling queues to make sure that there is enough compute capacity for other YARN applications to run.

Why use Hive LLAP?

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL  over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

  • For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
  • It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
  • It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
  • LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
  • It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
  • It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
  • It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

How do you install Hive LLAP in Amazon EMR?

To install and configure LLAP on an EMR cluster, use the following bootstrap action (BA):

s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh

This BA downloads and installs Apache Slider on the cluster and configures LLAP so that it works with EMR Hive. For LLAP to work, the EMR cluster must have Hive, Tez, and Apache Zookeeper installed.

You can pass the following arguments to the BA.

ArgumentDefinitionDefault value
--instancesNumber of instances of LLAP daemonNumber of core/task nodes of the cluster
--cacheCache size per instance20% of physical memory of the node
--executorsNumber of executors per instanceNumber of CPU cores of the node
--iothreadsNumber of IO threads per instanceNumber of CPU cores of the node
--sizeContainer size per instance50% of physical memory of the node
--xmxWorking memory size50% of container size
--log-levelLog levels for the LLAP instanceINFO

LLAP example

This section describes how you can try the faster Hive queries with LLAP using the TPC-DS testbench for Hive on Amazon EMR.

Use the following AWS command line interface (AWS CLI) command to launch a 1+3 nodes m4.xlarge EMR 5.6.0 cluster with the bootstrap action to install LLAP:

aws emr create-cluster --release-label emr-5.6.0 \
--applications Name=Hadoop Name=Hive Name=Hue Name=ZooKeeper Name=Tez \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh","Name":"Custom action"}]' \ 
--ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' 
--service-role EMR_DefaultRole \
--enable-debugging \
--log-uri 's3n://<YOUR-BUCKET/' --name 'test-hive-llap' \
--instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master - 1"},{"InstanceCount":3,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"Core - 2"}]' 
--region us-east-1

After the cluster is launched, log in to the master node using SSH, and do the following:

  1. Open the hive-tpcds folder:
    cd /home/hadoop/hive-tpcds/
  2. Start Hive CLI using the testbench configuration, create the required tables, and run the sample query:

    hive –i testbench.settings
    hive> source create_tables.sql;
    hive> source query55.sql;

    This sample query runs on a 40 GB dataset that is stored on Amazon S3. The dataset is generated using the data generation tool in the TPC-DS testbench for Hive.It results in output like the following:
  3. This screenshot shows that the query finished in about 47 seconds for LLAP mode. Now, to compare this to the execution time without LLAP, you can run the same workload using only Tez containers:
    hive> set hive.llap.execution.mode=none;
    hive> source query55.sql;


    This query finished in about 80 seconds.

The difference in query execution time is almost 1.7 times when using just YARN containers in contrast to running the query on LLAP daemons. And with every rerun of the query, you notice that the execution time substantially decreases by the virtue of in-memory caching by LLAP daemons.

Conclusion

In this post, I introduced Hive LLAP as a way to boost Hive query performance. I discussed its architecture and described several use cases for the component. I showed how you can install and configure Hive LLAP on an Amazon EMR cluster and how you can run queries on LLAP daemons.

If you have questions about using Hive LLAP on Amazon EMR or would like to share your use cases, please leave a comment below.


Additional Reading

Learn how to to automatically partition Hive external tables with AWS.


About the Author

Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.