Tag Archives: Apache HBase

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/160966148886



We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event:

  • Familiarize yourself with the cutting edge in Apache project developments from the committers
  • Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives
  • Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks
  • Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more
  • Attend the community showcase where you can network with sponsors and industry experts, including a host of startups and large companies like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata

Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase.

Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code MSPO20. Register here for Hadoop Summit, San Jose, California!

DAY 1. TUESDAY June 13, 2017

12:20 – 1:00 P.M. TensorFlowOnSpark – Scalable TensorFlow Learning On Spark Clusters

Andy Feng – VP Architecture, Big Data and Machine Learning

Lee Yang – Sr. Principal Engineer

In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.

2:10 – 2:50 P.M. Handling Kernel Upgrades at Scale – The Dirty Cow Story

Samy Gawande – Sr. Operations Engineer

Savitha Ravikrishnan – Site Reliability Engineer

Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).

5:00 – 5:40 P.M. Data Highway Rainbow –  Petabyte Scale Event Collection, Transport, and Delivery at Yahoo

Nilam Sharma – Sr. Software Engineer

Huibing Yin – Sr. Software Engineer

This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers.

DAY 2. WEDNESDAY June 14, 2017

9:05 – 9:15 A.M. Yahoo General Session – Shaping Data Platform for Lasting Value

Sumeet Singh  – Sr. Director, Products

With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.

12:20 – 1:00 P.M. CaffeOnSpark Update – Recent Enhancements and Use Cases

Mridul Jain – Sr. Principal Engineer

Jun Shi – Principal Engineer

By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.

12:20 – 1:00 P.M. Tez Shuffle Handler – Shuffling at Scale with Apache Hadoop

Jon Eagles – Principal Engineer  

Kuhu Shukla – Software Engineer

In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.

2:10 – 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

Thiruvel Thirumoolan – Principal Engineer

Francis Liu – Sr. Principal Engineer

At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.

2:10 – 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse

Nick Huang – Director, Data Engineering, Yahoo Mail  

Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail

Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.

DAY3. THURSDAY June 15, 2017

2:10 – 2:50 P.M. OracleStore – A Highly Performant RawStore Implementation for Hive Metastore

Chris Drome – Sr. Principal Engineer  

Jin Sun – Principal Engineer

Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.

3:00 P.M. – 3:40 P.M. Bullet – A Real Time Data Query Engine

Akshai Sarma – Sr. Software Engineer

Michael Natkovich – Director, Engineering

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.

3:00 P.M. – 3:40 P.M. Yahoo – Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez

Rohini Palaniswamy – Sr. Principal Engineer

Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation.

4:10 P.M. – 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning

Evans Ye,  Software Engineer

Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.

Register here for Hadoop Summit, San Jose, California with a 20% discount code MSPO20

Questions? Feel free to reach out to us at [email protected] Hope to see you there!

Tips for Migrating to Apache HBase on Amazon S3 from HDFS

Post Syndicated from Bruno Faria original https://aws.amazon.com/blogs/big-data/tips-for-migrating-to-apache-hbase-on-amazon-s3-from-hdfs/

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3. Running HBase on S3 gives you several added benefits, including lower costs, data durability, and easier scalability.

HBase provides several options that you can use to migrate and back up HBase tables. The steps to migrate to HBase on S3 are similar to the steps for HBase on the Apache Hadoop Distributed File System (HDFS). However, the migration can be easier if you are aware of some minor differences and a few “gotchas.”

In this post, I describe how to use some of the common HBase migration options to get started with HBase on S3.

HBase migration options

Selecting the right migration method and tools is an important step in ensuring a successful HBase table migration. However, choosing the right ones is not always an easy task.

The following HBase feature and utilities help you migrate to HBase on S3:

  • snapshots
  • Export and Import
  • CopyTable

The following diagram summarizes the steps for each option.

Various factors determine the HBase migration method that you use. For example, EMR offers HBase version 1.2.3 as the earliest version that you can run on S3. Therefore, the HBase version that you’re migrating from can be an important factor in helping you decide. For more information about HBase versions and compatibility, see the HBase version number and compatibility documentation in the Apache HBase Reference Guide.

If you’re migrating from an older version of HBase (for example, HBase 0.94), you should test your application to make sure it’s compatible with newer HBase API versions. You don’t want to spend several hours migrating a large table only to find out that your application and API have issues with a different HBase version.

The good news is that HBase provides utilities that you can use to migrate only part of a table. This lets you test your existing HBase applications without having to fully migrate entire HBase tables. For example, you can use the Export, Import, or CopyTable utilities to migrate a small part of your table to HBase on S3. After you confirm that your application works with newer HBase versions, you can proceed with migrating the entire table using HBase snapshots.

Option 1: Migrate to HBase on S3 using snapshots

You can create table backups easily by using HBase snapshots. HBase also provides the ExportSnapshot utility, which lets you export snapshots to a different location, like S3. In this section, I discuss how you can combine snapshots with ExportSnapshot to migrate tables to HBase on S3.

For details about how you can use HBase snapshots to perform table backups, see Using HBase Snapshots in the Amazon EMR Release Guide and HBase Snapshots in the Apache HBase Reference Guide. These resources provide additional settings and configurations that you can use with snapshots and ExportSnapshot.

The following example shows how to use snapshots to migrate HBase tables to HBase on S3.

Note: Earlier HBase versions, like HBase 0.94, have a different snapshot structure than HBase 1.x, which is what you’re migrating to. If you’re migrating from HBase 0.94 using snapshots, you get a TableInfoMissingException error when you try to restore the table. For details about migrating from HBase 0.94 using snapshots, see the Migrating from HBase 0.94 section.

  1. From the source HBase cluster, create a snapshot of your table:
    $ echo "snapshot '<table_name>', '<snapshot_name>'" | hbase shell

  2. Export the snapshot to an S3 bucket:
    $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot_name> -copy-to s3://<HBase_on_S3_root_dir>/

    For the -copy-to parameter in the ExportSnapshot utility, specify the S3 location that you are using for the HBase root directory of your EMR cluster. If your cluster is already up and running, you can find its S3 hbase.rootdir value by viewing the cluster’s Configurations in the EMR console, or by using the AWS CLI. Here’s the command to find that value:

    $ aws emr describe-cluster – cluster-id <cluster_id> | grep hbase.rootdir

  3. Launch an EMR cluster that uses the S3 storage option with HBase (skip this step if you already have one up and running). For detailed steps, see Creating a Cluster with HBase Using the Console in the Amazon EMR Release Guide. When launching the cluster, ensure that the HBase root directory is set to the same S3 location as your exported snapshots (that is, the location used in the -copy-to parameter in the previous step).
  4. Restore or clone the HBase table from that snapshot.
    • To restore the table and keep the same table name as the source table, use restore_snapshot:
      $ echo "restore_snapshot '<SNAPSHOT_NAME>'"| hbase shell

    • To restore the table into a different table name, use clone_snapshot:
      $ echo "clone_snapshot '<snapshot_name>', '<table_name>'" | hbase shell

Migrating from HBase 0.94 using snapshots

If you’re migrating from HBase version 0.94 using the snapshot method, you get an error if you try to restore from the snapshot. This is because the structure of a snapshot in HBase 0.94 is different from the snapshot structure in HBase 1.x.

The following steps show how to fix an HBase 0.94 snapshot so that it can be restored to an HBase on S3 table.

  1. Complete steps 1—3 in the previous example to create and export a snapshot.
  2. From your destination cluster, follow these steps to repair the snapshot:
    • Use s3-dist-cp to copy the snapshot data (archive) directory into a new directory. The archive directory contains your snapshot data. Depending on your table size, it might be large. Use s3-dist-cp to make this step faster:
      $ s3-dist-cp – src s3://<HBase_on_S3_root_dir>/.archive/<table_name> – dest s3://<HBase_on_S3_root_dir>/archive/data/default/<table_name>

    • Create and fix the snapshot descriptor file:
      $ hdfs dfs -mkdir s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc
      $ hdfs dfs -mv s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tableinfo.<*> s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc

  3. Restore the snapshot:
    $ echo "restore_snapshot '<snapshot_name>'" | hbase shell

Option 2: Migrate to HBase on S3 using Export and Import

As I discussed in the earlier sections, HBase snapshots and ExportSnapshot are great options for migrating tables. But sometimes you want to migrate only part of a table, so you need a different tool. In this section, I describe how to use the HBase Export and Import utilities.

The steps to migrate a table to HBase on S3 using Export and Import is not much different from the steps provided in the HBase documentation. In those docs, you can also find detailed information, including how you can use them to migrate part of a table.

The following steps show how you can use Export and Import to migrate a table to HBase on S3.

  1. From your source cluster, export the HBase table:
    $ hbase org.apache.hadoop.hbase.mapreduce.Export <table_name> s3://<table_s3_backup>/<location>/

  2. In the destination cluster, create the target table into which to import data. Ensure that the column families in the target table are identical to the exported/source table’s column families.
  3. From the destination cluster, import the table using the Import utility:
    $ hbase org.apache.hadoop.hbase.mapreduce.Import '<table_name>' s3://<table_s3_backup>/<location>/

HBase snapshots are usually the recommended method to migrate HBase tables. However, the Export and Import utilities can be useful for test use cases in which you migrate only a small part of your table and test your application. It’s also handy if you’re migrating from an HBase cluster that does not have the HBase snapshots feature.

Option 3: Migrate to HBase on S3 using CopyTable

Similar to the Export and Import utilities, CopyTable is an HBase utility that you can use to copy part of HBase tables. However, keep in mind that CopyTable doesn’t work if you’re copying or migrating tables between HBase versions that are not wire compatible (for example, copying from HBase 0.94 to HBase 1.x).

For more information and examples, see CopyTable in the HBase documentation.


In this post, I demonstrated how you can use common HBase backup utilities to migrate your tables easily to HBase on S3. By using HBase snapshots, you can migrate entire tables to HBase on S3. To test HBase on S3 by migrating or copying only part of your tables, you can use the HBase Export, Import, or CopyTable utilities.

If you have questions or suggestions, please comment below.


About the Author

Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.



Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3







Month in Review: November 2016

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-november-2016/

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Use Apache Flink on Amazon EMR
It is even easier to run Flink on AWS as it is now natively supported in Amazon EMR 5.1.0. EMR supports running Flink-on-YARN so you can create either a long-running cluster that accepts multiple jobs or a short-running Flink session in a transient cluster that helps reduce your costs by only charging you for the time that you use.

Scale Your Amazon Kinesis Stream Capacity with UpdateShardCount
With the new Amazon Kinesis Streams UpdateShardCount API operation, you can automatically scale your stream shard capacity by using Amazon CloudWatch alarms, Amazon SNS, and AWS Lambda. In this post, walk through an example of how you can automatically scale your shards using a few lines of code.

Build a Community of Analysts with Amazon QuickSight
In this post, learn how Amazon QuickSight can be used to share dashboards, analyses, and stories. Although fictitious, CoffeeCo, like many companies, benefits from distributing information to people who understand its context and can act on the insights that it contains. 

Dynamically Scale Applications on Amazon EMR with Auto Scaling
With new support for Auto Scaling in Amazon EMR releases 4.x and 5.x, customers can now add (scale out) and remove (scale in) nodes on a cluster more easily. Scaling actions are triggered automatically by Amazon CloudWatch metrics provided by EMR at 5 minute intervals, including several YARN metrics related to memory utilization, applications pending, and HDFS utilization.

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3
By migrating to HBase on EMR using S3 for storage, FINRA has lowered its costs by 60%, decreased operational complexity, increased durability and availability, and have created a more scalable architecture.

Introducing the Data Lake Solution on AWS
Learn why a data lake on AWS can increase the flexibility and agility of your analytics.

Analyzing Data in S3 using Amazon Athena
Learn how to use Athena on logs from Elastic Load Balancers, generated as text files in a pre-defined format. We show you how to create a table, partition the data in a format used by Athena, convert it to Parquet, and compare query performance.

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.