Tag Archives: Amazon EMR

Enhancing data durability in Amazon EMR HBase on Amazon S3 with the Amazon EMR WAL feature

2025-06-02 Suthan Phillips

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/enhancing-data-durability-in-amazon-emr-hbase-on-amazon-s3-with-the-amazon-emr-wal-feature/

Apache HBase, an open source NoSQL database, enables quick access to massive datasets. Amazon EMR, from version 5.2.0, lets you use HBase on Amazon Simple Storage Service (Amazon S3). This combines HBase’s speed with the durability advantages of Amazon S3. Also, it helps achieve the data lake architecture benefits such as the ability to scale storage and compute requirements separately. We see our customers choosing Amazon S3 over Hadoop Distributed File Systems (HDFS) when they want to achieve greater durability, availability, and simplified storage management. Amazon EMR continually improves HBase on Amazon S3, focusing on performance, availability, and reliability.

Despite these durability benefits of HBase on Amazon S3 architecture, a critical concern remains regarding data recovery when the Write-Ahead Log (WAL) is lost. Within the EMR framework, HBase data attains durability when it’s flushed, or written, to Amazon S3. This flushing process is triggered by reaching specific size and time thresholds or through manual initiation. Until data is successfully flushed to S3, it persists within the WAL, which is stored in HDFS. In this post, we dive deep into the new Amazon EMR WAL feature to help you understand how it works, how it enhances durability, and why it’s needed. We explore several scenarios that are well-suited for this feature.

HBase WAL overview

Each RegionServer in HBase is responsible for managing data from multiple tables. These tables are horizontally partitioned into regions, where each region represents a contiguous range of row keys. A RegionServer can host multiple such regions, potentially from different tables. At the RegionServer level, there is a single, shared WAL that records all write operations across all regions and tables in a sequential, append-only manner. This shared WAL makes sure durability is maintained by persisting each mutation before applying it to in-memory structures, enabling recovery in case of unexpected failures. Within each region, the memory structure of the MemStore is further divided by column families, which are the fundamental units of physical storage and I/O in HBase. Each column family maintains:

Its own MemStore, which holds recently written data in memory for fast access and buffering before it flushes to disk.
A set of HFiles, which are immutable data files stored on HDFS (or Amazon S3 in HBase on S3 mode) that hold the persistent, flushed data.

Although all column families within a region are served by the same RegionServer process, they operate independently in terms of memory buffering, flushing, and compaction. However, they still share the same WAL and RegionServer-level resources, which introduces a degree of coordination, hence they operate semi-independently within the broader region context. This architecture is shown in the following diagram.

Understanding the HBase write process: WAL, MemStore, and HFiles

The HBase write path initiates when a client issues a write request, typically through an RPC call directed to the appropriate RegionServer that hosts the target region. Upon receiving the request, the RegionServer identifies the correct HBase region based on the row key and forwards the KeyValue pair accordingly. The write operation follows a two-step process. First, the data is appended to the WAL, which promotes durability by recording every change before it’s committed to memory. The WAL resides on HDFS by default and exists independently on each RegionServer. Its primary purpose is to provide a recovery mechanism in the event of a failure, particularly for edits that have not yet been flushed to disk. When the WAL append is successful, the data is written to the MemStore, an in-memory store for each column family within the region. The MemStore accumulates updates until it reaches a predefined size threshold, controlled by the hbase.hregion.memstore.flush.size parameter (default is 128 MB). When this threshold is exceeded, a flush is triggered.Flushing is handled asynchronously by a background thread in the RegionServer. The thread writes the contents of the MemStore to a new HFile, which is then persisted to long-term storage. In Amazon EMR, the location of this HFile depends on the deployment mode: for HBase on Amazon S3, HFiles are stored in Amazon S3, but for HBase on HDFS, they’re stored in HDFS.This workflow is shown in the following diagram.

A region server serves multiple regions, and they all share a common WAL. The WAL records all data changes, storing them in local HDFS. Puts and deletes are initially logged to the WAL by the region server before being recorded in the MemStore for the affected store. Scan and get operations in HBase don’t require the use of the WAL. In the event of a region server crash or unavailability before MemStore flushing, the WAL is crucial for replaying data changes, which promotes data integrity. Because this log by default resides on a replicated filesystem, it enables an alternate server to access and replay the log, requiring nothing from the physically failed server for a complete recovery. When a RegionServer fails abruptly, HBase initiates an automated recovery process orchestrated by the HMaster. First, the ZooKeeper session timeout detects the RegionServer failure, notifying the HMaster. The HMaster then identifies all regions previously hosted on the failed RegionServer and marks them as unassigned. The WAL files from the failed RegionServer are split by region, and these split WAL files are distributed to the new RegionServers that will host the reassigned regions. Each new RegionServer replays its assigned WAL segments to recover the MemStore state that existed before the failure, preventing data loss. When WAL replay is complete, the regions become operational on their new RegionServers, and the recovery process concludes.

HBase recovery workflow from RegionServer failure through WAL splitting to regions online

The effectiveness of the HDFS WAL model relies on the successful completion of the write request in the WAL and the subsequent data replication in HDFS. In cases where some nodes are terminated, HDFS can still recover from the WAL files, allowing HBase to autonomously heal by replaying data from the WALs and rebalancing the regions. However, if all CORE nodes are simultaneously terminated, achieving complete cluster recovery is a challenge because the data to replay from the WAL is lost. The issue arises when WALs are lost due to CORE node shutdown (for example, all three replicas of a file block). In this scenario, HBase enters a loop attempting to replay data from the WALs. Unfortunately, the absence of available blocks in this case causes the HBase server crash procedure to fail and retry indefinitely.

Amazon EMR WAL

To address the mentioned challenge of HDFS WAL and to provide data durability in HBase, Amazon EMR introduces a new EMR WAL feature starting from versions emr-7.0 and emr-6.15. This feature facilitates the recovery of data that hasn’t been flushed to Amazon S3 (HFile). Using this feature provides thorough backup for your HBase clusters. Behind the scenes, the RegionServer writes WAL data to EMR WAL, which is a service outside the EMR cluster. With this feature enabled, concerns about loss of WAL data in HDFS are alleviated. Also, in the event of cluster or Availability Zone failure issues, you can create a new cluster, directing it to the same Amazon S3 root directory and EMR WAL workspace. This enables the automatic recovery of data in the WAL in the order of minutes. Recovery of unflushed data is supported for a duration of 30 days, after which remaining unflushed data is deleted. This workflow is shown in the following diagram.

Key benefits

Upon enabling EMR WAL, the WALs are located external to the EMR cluster. The key benefits are:

High availability – You can remain confident about data integrity even in the face of Availability Zone failures. Their HFiles are stored in Amazon S3, and the WALs are externally stored in EMR WAL. This setup enables cluster recovery and WAL replay in the same or a different Availability Zone within the region. However, for true high availability with zero downtime, relying solely on EMR WAL is not sufficient because recovery still involves brief interruptions. To provide seamless failover and uninterrupted service, HBase replication across multiple Availability Zones is essential along with EMR WAL, providing robust zero-downtime high availability.
Data durability improvement – Customers no longer need to concern themselves with potential data loss in scenarios involving WAL data corruption in HDFS or the removal of all replicas in HDFS due to instance terminations.

The following flow diagram compares the sequence of events with and without EMR WAL enabled.

Key EMR WAL features

In this section, we explore the key enhancements introduced in the EMR WAL service across recent Amazon EMR versions. From grouping multiple HBase regions into a single EMR WAL to advanced configuration options, these new capabilities address specific usage scenarios.

Grouping multiple HBase regions into a single Amazon EMR WAL

In Amazon EMR versions up to 7.2, a separate EMR WAL is created for each region, which can become expensive due to the EMR-WAL-WALHours pricing model, especially when the HBase cluster contains many regions. To address this, starting from Amazon EMR 7.3, we introduced the EMR WAL grouping feature, which enables consolidating multiple HBase regions per EMR WAL, offering significant cost savings (over 99% cost savings in our sample evaluation) and improved operational efficiency. By default, each HBase RegionServer has two Amazon EMR WALs. If you have many regions per RegionServer and want to increase throughput, you can customize the number of WALs per RegionServer by configuring the hbase.wal.regiongrouping.numgroups property. For instance, to set 10 EMR WALs per HBase RegionServer, you can use the following configuration:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "hbase.wal.regiongrouping.numgroups": "10"
    }
  }
]

The two HBase system tables hbase:meta and hbase:master (masterstore) don’t participate in the WAL grouping mechanisms.

In a performance test using m5.8xlarge instances with 1,000 regions per RegionServer, we observed a significant increase in throughput as the number of WALs grew from 1 to 20 per RegionServer (from 1,570 to 3,384 operations per sec). This led to a 54% improvement in average latency (from 40.5 ms to 18.8 ms) and a 72% reduction in 95th percentile latency (from 231 ms to 64 ms). However, beyond 20 WALs, we noted diminishing returns, with only slight performance improvements between 20 and 50 WALs, and average latency stabilized around 18.7ms. Based on these results, we recommend maintaining a lower region density (around 10 regions per WAL) for optimal performance. Nonetheless, it’s crucial to fine-tune this configuration according to your specific workload characteristics and performance requirements and conduct tests in your lower environment to validate the best setup.

Configurable maximum record size in EMR WAL

Until Amazon EMR version 7.4, the EMR WAL had a record size limit of 4 MB, which was insufficient for some customers. Starting from EMR 7.5, the maximum record size in EMR WAL is configurable through the emr.wal.max.payload.size property. The default value is set to 1 GB. The following is an example of how to set the maximum record size to 2 GB:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "emr.wal.max.payload.size": "2147483648"
    }
  }
]

AWS PrivateLink support

EMR WAL supports AWS PrivateLink, if you want to keep your connection within the AWS network. To set it up, create a virtual private cloud (VPC) endpoint using the AWS Management Console or AWS Command Line Interface (AWS CLI) and select the service labeled com.amazonaws.region.emrwal.prod. Make sure your VPC endpoint uses the same security groups as the EMR cluster. You have two DNS configuration options: enabling private DNS, which uses the standard endpoint format and automatically routes traffic privately, or using the provided VPC endpoint-specific DNS name for more explicit control. Regardless of the DNS option chosen, both methods mean that traffic remains within the AWS network, enhancing security. To implement this in the EMR cluster, update your cluster configuration to use the PrivateLink endpoint, as shown in the following code sample (for private DNS):

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "emr.wal.client.endpoint": "https://prod.emrwal.region.amazonaws.com"
        }
    }
]

For more details, refer to Access Amazon EMR WAL through AWS PrivateLink in the Amazon EMR documentation.

Encryption options for WAL in Amazon EMR

Amazon EMR automatically encrypts data in transit in the EMR WAL service. You can enable server-side encryption (SSE) for WAL (data at rest) with two key management options:

SSE-EMR-WAL: Amazon EMR manages the encryption keys
SSE-KMS-WAL: You use an AWS Key Management Service (AWS KMS) key for encryption policies

EMR WAL cross-cluster replication

From EMR 7.5, EMR WAL supports cross-cluster replay, allowing clusters in an active-passive HBase replication setup to use EMR WAL.

For more details on the setup, refer to EMR WAL cross-cluster replication in the Amazon EMR documentation.

EMR WAL enhancement: Minimizing CPU load from HBase sync threads

Starting from EMR 7.9, we’ve implemented code optimizations in EMR WAL to address the high CPU utilization caused by sync threads used by HBase processes to write WAL edits, leading to improved CPU efficiency.

Sample use cases benefitting from this feature

Based on our customer interactions and feedback, this feature can help in the following scenarios.

Continuity during service disruptions

If your business demands disaster recovery with no data loss for an HBase on an S3 cluster due to unexpected service disruptions, such as an Availability Zone failure, the newly introduced feature means you don’t have to rely on a persistent event store solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis. Without EMR WAL, you had to set up a complex event-streaming pipeline to retain the most recently ingested data and enable replay from the point of failure. This new feature eliminates that dependency by storing Hbase WALs in the EMR WAL service.

Note: During an Availability Zone (AZ) failure or service-level issue, make sure to fully terminate the original Hbase cluster before launching a new one that points to the same S3 root directory. Running two active Hbase clusters that access the same S3 root can lead to data corruption.

Upgrading to the latest EMR releases or cluster rotations

Without EMR WAL, moving to the latest EMR version or managing cluster rotations with HBase on Amazon S3 necessitated manual interruptions for data flushing to S3. With the new feature, the requirement for data flushing is eliminated. However, during cluster termination and the subsequent launch of a new HBase cluster, there is an inevitable service downtime, during which data producers or ingestion pipelines must handle write disruptions or buffer incoming data until the system is fully restored. Also, the downstream services should account for temporary unavailability, which can be mitigated using a read replica cluster.

Overcoming HDFS challenges during HBase auto scaling

Without EMR WAL feature, having HDFS for your WAL files was a requirement. When implementing custom auto scaling for your HBase clusters, it sometimes resulted in WAL data corruption due to issues linked to HDFS. This is because, to prevent data loss, data blocks had to be moved to different HDFS nodes when one HDFS node was being decommissioned. When nodes continued to be terminated swiftly during scale-down process without allowing sufficient time for graceful decommissioning, it could result in WAL data corruption issues, primarily attributed to missing blocks.

Addressing HDFS disk space issues due to old WALs

When a WAL file is no longer required for recovery, indicating that HBase has made sure all data within the WAL file has been flushed, it’s transferred to the oldWALs folder for archival purposes. The log remains in this location until all other references to the WAL file are completed. In HBase use cases with high write activity, some customers have expressed concerns about the oldWALs directory (/usr/hbase/oldWALs) expanding and occupying excessive disk space and eventually causing disk space issues. With the complete relocation of these WALs to an external EMR WAL service, you will no longer encounter this issue.

Assessing HBase in Amazon EMR clusters with and without EMR WAL for fault tolerance

We conducted a data durability test employing two scripts. The first was for installing YCSB, creating a pre-split table, and loading 8 million records on the master node. The second was for terminating a core node every 90 seconds after a 3-minute wait, totaling five terminations. Two EMR clusters with eight core nodes each were created, one configured with EMR WAL enabled and the other as a standard EMR HBase cluster with the WAL stored in HDFS. After completion of EMR steps, a count was run on the HBase table. In the EMR cluster with EMR WAL enabled, all records were successfully inserted without corruption. In the cluster not using EMR WALs, regions in HBase remained “OPENING” if the node hosting the meta was terminated. For other core node terminations, inserts failed, resulting in a lower record count during validation.

Understanding when EMR WAL read charges apply in HBase

In HBase, standard table read operations such as Get and Scan don’t access WALs. Therefore, EMR WAL read (GiB) charges are only incurred during operations that involve reading from WALs, such as:

Restoring data from EMR WALs in a newly launched cluster
Replaying WALs to recover data on a crashed RegionServer
Performing HBase replication, which involves reading WALs to replicate data across clusters

In a normal scenario, you’re billed only for the following two components related to EMR WAL usage:

EMR-WAL-WALHours – Represents the hourly cost of WAL storage, calculated based on the number of WALs maintained. You can use the EMRWALCount metric in Amazon CloudWatch to monitor the number of WALs and track associated usage over time.
EMR-WAL-WriteRequestGiB – This reflects the volume of data written to the WAL service, charged by the amount of data written in GiB.

For further details on pricing, refer to Amazon EMR pricing and Amazon EMR Release Guide.

To monitor and analyze EMR WAL related costs in the AWS Cost and Usage Reports (CUR), look under product_servicecode = ‘ElasticMapReduce’, where you’ll find the following product_usagetype entries associated with WAL usage:

USE1-EMR-WAL-ReadRequestGiB
USE1-EMR-WAL-WALHours
USE1-EMR-WAL-WriteRequestGiB

The prefix USE1 indicates the Region (in this case, us-east-1) and will vary depending on where your EMR cluster is deployed.

Summary

This new EMR WAL feature allows you to improve durability of your Amazon EMR HBase on S3 clusters, addressing critical workload scenarios by eliminating the need for streaming solutions for Availability Zone level service disruptions, streamlining processes for upgrading or rotating clusters, preventing data corruption during HBase auto scaling or node termination events, and resolving disk space issues associated with old WALs. Because many of the EMR WAL features are added on the latest releases of Amazon EMR, we recommend that customers use Amazon EMR version 7.9 or later to fully benefit from these improvements.

About the authors

Suthan Phillips is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to ensure efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking and exploring the Pacific Northwest.

Enhance AI-assisted development with Amazon ECS, Amazon EKS and AWS Serverless MCP server

2025-05-29 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/enhance-ai-assisted-development-with-amazon-ecs-amazon-eks-and-aws-serverless-mcp-server/

Today, we’re introducing specialized Model Context Protocol (MCP) servers for Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Serverless, now available in the AWS Labs GitHub repository. These open source solutions extend AI development assistants capabilities with real-time, contextual responses that go beyond their pre-trained knowledge. While Large Language Models (LLM) within AI assistants rely on public documentation, MCP servers deliver current context and service-specific guidance to help you prevent common deployment errors and provide more accurate service interactions.

You can use these open source solutions to develop applications faster, using up-to-date knowledge of Amazon Web Services (AWS) capabilities and configurations during the build and deployment process. Whether you’re writing code in your integrated development environment (IDE), or debugging production issues, these MCP servers support AI code assistants with deep understanding of Amazon ECS, Amazon EKS, and AWS Serverless capabilities, accelerating the journey from code to production. They work with popular AI-enabled IDEs, including Amazon Q Developer on the command line (CLI), to help you build and deploy applications using natural language commands.

The Amazon ECS MCP Server containerizes and deploys applications to Amazon ECS within minutes by configuring all relevant AWS resources, including load balancers, networking, auto-scaling, monitoring, Amazon ECS task definitions, and services. Using natural language instructions, you can manage cluster operations, implement auto-scaling strategies, and use real-time troubleshooting capabilities to identify and resolve deployment issues quickly.
For Kubernetes environments, the Amazon EKS MCP Server provides AI assistants with up-to-date, contextual information about your specific EKS environment. It offers access to the latest EKS features, knowledge base, and cluster state information. This gives AI code assistants more accurate, tailored guidance throughout the application lifecycle, from initial setup to production deployment.
The AWS Serverless MCP Server enhances the serverless development experience by providing AI coding assistants with comprehensive knowledge of serverless patterns, best practices, and AWS services. Using AWS Serverless Application Model Command Line Interface (AWS SAM CLI) integration, you can handle events and deploy infrastructure while implementing proven architectural patterns. This integration streamlines function lifecycles, service integrations, and operational requirements throughout your application development process. The server also provides contextual guidance for infrastructure as code decisions, AWS Lambda specific best practices, and event schemas for AWS Lambda event source mappings.

Let’s see it in action
If this is your first time using AWS MCP servers, visit the Installation and Setup guide in the AWS Labs GitHub repository to installation instructions. Once installed, add the following MCP server configuration to your local setup:

Install Amazon Q for command line and add the conﬁguration to ~/.aws/amazonq/mcp.json. If you’re already an Amazon Q CLI user, add only the configuration.

{
  "mcpServers": {
    "awslabs.aws-serverless-mcp":  {
      "command": "uvx",
      "timeout": 60,
      "args": ["awslabs.aws_serverless_mcp_server@latest"],
    },
    "awslabs.ecs-mcp-server": {
      "disabled": false,
      "command": "uv",
      "timeout": 60,
      "args": ["awslabs.ecs-mcp-server@latest"],
    },
    "awslabs.eks-mcp-server": {
      "disabled": false,
      "timeout": 60,
      "command": "uv",
      "args": ["awslabs.eks-mcp-server@latest"],
    }
  }
}

For this demo I’m going to use the Amazon Q CLI to create an application that understands video using 02_using_converse_api.ipynb from Amazon Nova model cookbook repository as sample code. To do this, I send the following prompt:

I want to create a backend application that automatically extracts metadata and understands the content of images and videos uploaded to an S3 bucket and stores that information in a database. I'd like to use a serverless system for processing. Could you generate everything I need, including the code and commands or steps to set up the necessary infrastructure, for it to work from start to finish? - Use 02_using_converse_api.ipynb as example code for the image and video understanding.

Amazon Q CLI identifies the necessary tools, including the MCP serverawslabs.aws-serverless-mcp-server. Through a single interaction, the AWS Serverless MCP server determines all requirements and best practices for building a robust architecture.

I ask to Amazon Q CLI that build and test the application, but encountered an error. Amazon Q CLI quickly resolved the issue using available tools. I verified success by checking the record created in the Amazon DynamoDB table and testing the application with the dog2.jpeg file.

To enhance video processing capabilities, I decided to migrate my media analysis application to a containerized architecture. I used this prompt:

I'd like you to create a simple application like the media analysis one, but instead of being serverless, it should be containerized. Please help me build it in a new CDK stack.

Amazon Q Developer begins building the application. I took advantage of this time to grab a coffee. When I returned to my desk, coffee in hand, I was pleasantly surprised to find the application ready. To ensure everything was up to current standards, I simply asked:

please review the code and all app using the awslabsecs_mcp_server tools

Amazon Q Developer CLI gives me a summary with all the improvements and a conclusion.

I ask it to make all the necessary changes, once ready I ask Amazon Q developer CLI to deploy it in my account, all using natural language.

After a few minutes, I review that I have a complete containerized application from the S3 bucket to all the necessary networking.

I ask Amazon Q developer CLI to test the app send it the-sea.mp4 video file and received a timed out error, so Amazon Q CLI decides to use the fetch_task_logs from awslabsecs_mcp_server tool to review the logs, identify the error and then fix it.

After a new deployment, I try it again, and the application successfully processed the video file

I can see the records in my Amazon DynamoDB table.

To test the Amazon EKS MCP server, I have code for a web app in the auction-website-main folder and I want to build a web robust app, for that I asked Amazon Q CLI to help me with this prompt:

Create a web application using the existing code in the auction-website-main folder. This application will grow, so I would like to create it in a new EKS cluster

Once the Docker file is created, Amazon Q CLI identifies generate_app_manifests from awslabseks_mcp_server as a reliable tool to create a Kubernetes manifests for the application.

Then create a new EKS cluster using the manage_eks_staks tool.

Once the app is ready, the Amazon Q CLI deploys it and gives me a summary of what it created.

I can see the cluster status in the console.

After a few minutes and resolving a couple of issues using the search_eks_troubleshoot_guide tool the application is ready to use.

Now I have a Kitties marketplace web app, deployed on Amazon EKS using only natural language commands through Amazon Q CLI.

Get started today
Visit the AWS Labs GitHub repository to start using these AWS MCP servers and enhance your AI-powered developmen there. The repository includes implementation guides, example configurations, and additional specialized servers to run AWS Lambda function, which transforms your existing AWS Lambda functions into AI-accessible tools without code modifications, and Amazon Bedrock Knowledge Bases Retrieval MCP server, which provides seamless access to your Amazon Bedrock knowledge bases. Other AWS specialized servers in the repository include documentation, example configurations, and implementation guides to begin building applications with greater speed and reliability.

To learn more about MCP Servers for AWS Serverless and Containers and how they can transform your AI-assisted application development, visit the Introducing AWS Serverless MCP Server: AI-powered development for modern applications, Automating AI-assisted container deployments with the Amazon ECS MCP Server, and Accelerating application development with the Amazon EKS MCP server deep-dive blogs.

— Eli

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

2025-05-22 Satesh Sonti

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/scalable-analytics-and-centralized-governance-for-apache-iceberg-tables-using-amazon-s3-tables-and-amazon-redshift/

Amazon Redshift supports querying data stored in Apache Iceberg tables managed by Amazon S3 Tables, which we previously covered as part of getting started blog post. While this blog post helps you to get started using Amazon Redshift with Amazon S3 Tables, there are additional steps you need to consider when working with your data in production environments, including who has access to your data and with what level of permissions.

In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using Amazon S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using Amazon Redshift. We’ll also review an example with simultaneously using data that resides both in Amazon Redshift and Amazon S3 Tables, enabling a unified analytics experience.

Solution overview

In this solution, we show how to query a dataset stored in Amazon S3 Tables for further analysis using data managed in Amazon Redshift. Specifically, we go through the steps shown in the following figure to load a dataset into Amazon S3 Tables, grant appropriate permissions, and finally execute queries to analyze our dataset for trends and insights.

Solution Architecture

In this post, you walk through the following steps:

Creating an Amazon S3 Table bucket: In AWS Management Console for Amazon S3, create an Amazon S3 Table bucket and integrate with other AWS analytics services
Creating an S3 Table and loading data: Run spark SQL in Amazon EMR to create a namespace and an S3 Table and load diabetic patients’ visit data
Granting permissions: Granting fine-grained access controls in AWS Lake Formation
Running SQL analytics: Querying S3 Tables using the auto mounted S3 Table catalog.

This post uses data from a healthcare use case to analyze information about diabetic patients and identify the frequency of age groups admitted to the hospital. You’ll use the preceding steps to perform this analysis.

Prerequisites

To begin, you need to add an Amazon Redshift service-linked role—AWSServiceRoleForRedshift—as a read-only administrator in Lake Formation. You can run following AWS Command Line Interface (AWS CLI) command to add the role.

Replace <account_number> with your account number and replace <region> with the AWS Region that you are using. You can run this command from AWS CloudShell or through AWS CLI configured in your environment.

aws lakeformation put-data-lake-settings \
        --region <region> \
        --data-lake-settings \
 '{
   "DataLakeAdmins": [{"DataLakePrincipalIdentifier":"arn:aws:iam::<account_number>:role/Admin"}],
   "ReadOnlyAdmins":[{"DataLakePrincipalIdentifier":"arn:aws:iam:: <account_number>:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift"}],
   "CreateDatabaseDefaultPermissions":[],
   "CreateTableDefaultPermissions":[],
   "Parameters":{"CROSS_ACCOUNT_VERSION":"4","SET_CONTEXT":"TRUE"}
  }'

You also need to create or use an existing Amazon Elastic Compute Cloud (Amazon EC2) key pair that will be used for SSH connections to cluster instances. For more information, see Amazon EC2 key pairs.

The examples in this post require the following AWS services and features:

The CloudFormation template that follows creates the following resources:

An Amazon EMR 7.6.0 cluster with Apache Iceberg packages
An Amazon Redshift Serverless instance
An AWS Identity and Access Management (IAM) instance profile, service role, and security groups
IAM roles with required policies
Two IAM users: nurse and analyst

Download the CloudFormation template, or you can use the Launch Stack button to automatically download it to your AWS environment. Note that network routes are directed to 255.255.255.255/32 for security reasons. Replace the routes with your organization’s IP addresses. Also enter your IP or VPN range for Jupyter Notebook access in the SourceCidrForNotebook parameter in CloudFormation.

Download the diabetic encounters and patient datasets and upload it into your S3 bucket. These files are from a publicly available open dataset.

This sample dataset is used to highlight this use case, the techniques covered can be adapted to your workflows. The following are more details about this dataset:

diabetic_encounters_s3.csv: Contains information about patient visits for diabetic treatment.

encounter_id: Unique number to refer to an encounter with a patient who has diabetes.
patient_nbr: Unique number to identify a patient.
num_procedures: Number of medical procedures administered.
num_medications: Number of medications provided during the visit
insulin: Insulin level observed. Valid values are steady, up, and no.
time_in_hospital: Duration of time in hospital in days.
readmitted: Readmitted to hospital within 30 days or after 30 days.

diabetic_patients_rs.csv: Contains patient information such as age group, gender, race, and number of visits.

patient_nbr: Unique number to identify a patient
race: Patient’s race
gender: Patient’s gender
age_grp: Patient’s age group. Valid values are 0-10, 10-20, 20-30, and so on
number_outpatient: Number of outpatient visits
number_emergency: Number of emergency room visits
number_inpatient: Number of inpatient visits

Now that you’ve set up the prerequisites, you’re ready to connect Amazon Redshift to query Apache Iceberg data stored in Amazon S3 Tables.

Create an S3 Table bucket

Before you can use Amazon Redshift to query the data in an Amazon S3 Table, you must create an Amazon S3 Table.

Sign in to the AWS Management Console and go to Amazon S3.
Go to Amazon S3 Table buckets. This is an option in the Amazon S3 console.
In the Table buckets view, there’s a section that describes Integration with AWS analytics services. Choose Enable Integration if you haven’t previously set this up. This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Amazon Athena.
Wait a few seconds for the status to change to Enabled.
Choose Create table bucket and enter a bucket name. You can use any name that follows the naming conventions. In this example, we used the bucket name patient-encounter. When you’re finished, choose Create table bucket.
After the S3 Table bucket is created, you’ll be redirected to the Table buckets list. Copy the Amazon Resource Name (ARN) of the table bucket you just created to use in the next section.

Now that your S3 Table bucket is set up, you can load data.

Create S3 Table and load data

The CloudFormation template in the prerequisites created an Apache Spark cluster using Amazon EMR. You’ll use the Amazon EMR cluster to load data into Amazon S3 Tables.

Connect to the Apache Spark primary node using SSH or through Jupyter Notebooks. Note that an Amazon EMR cluster was launched when you deployed the CloudFormation template.

Enter the following command to launch the Spark shell and initialize a Spark session for Iceberg that connects to your S3 Table bucket. Replace <Region>, <accountID> and <bucketname><bucket arn> with the information your region, account and bucket name.

spark-shell \
  --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160" \
  --master "local[*]" \
  --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
  --conf "spark.sql.defaultCatalog=spark_catalog" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \
  --conf "spark.sql.catalog.spark_catalog.type=rest" \
  --conf "spark.sql.catalog.spark_catalog.uri=https://s3tables.<Region>.amazonaws.com/iceberg" \
  --conf "spark.sql.catalog.spark_catalog.warehouse=arn:aws:s3tables:<Region>:<accountID>:bucket/<bucketname>" \
  --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \
  --conf "spark.sql.catalog.spark_catalog.rest.signing-name=s3tables" \
  --conf "spark.sql.catalog.spark_catalog.rest.signing-region=<Region>" \
  --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
  --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider" \
  --conf "spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled=false"

See Accessing Amazon S3 Tables with Amazon EMR for upgrades to software.amazon.s3tables package versions.

Next, create a namespace that will link your S3 Table bucket with your Amazon Redshift Serverless workgroup. We chose encounters as the namespace for this example, but you can use a different name. Use the following SparkSQL command:
```
spark.sql("CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.encounters")
```

Create an Apache Iceberg table with name diabetic_encounters.

spark.sql( 
""" CREATE TABLE IF NOT EXISTS s3tablesbucket.encounters.`diabetic_encounters` ( 
encounter_id INT, 
patient_nbr INT,
num_procedures INT,
num_medications INT,
insulin STRING,
time_in_hospital INT,
readmitted STRING 
) 
USING iceberg """
)

Load csv into the S3 Table encounters.diabetic_encounters. Replace <diabetic_encounters_s3.csv file location> with the Amazon S3 file path of the diabetic_encounters_s3.csv file you uploaded earlier.

val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("<diabetic_encounters_s3.csv file location> ")

df.writeTo("s3tablesbucket.encounters.diabetic_encounters").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()

Query the data to validate it using Spark shell.

spark.sql(""" SELECT * FROM s3tablesbucket.encounters.diabetic_encounters """).show()

Grant permissions

In this section, you grant fine-grained access control to the two IAM users created as part of the prerequisites.

nurse: Grant access to all columns in the diabetic_encounters table
analyst: Grant access to only {encounter_id, patient_nbr, readmitted} columns

First, grant access to the diabetic_encounters table for nurse user.

In AWS Lake Formation, Choose Data Permissions.
On the Grant Permissions page, under Principals, select IAM users and roles.
Select the IAM user nurse.
For Catalogs, select <accoundID>:s3tablescatalog/patient-encounter.
For Databases, select encounter
Scroll down. For Tables, select diabetic_encounters.
For Table permissions, select Select.
For Data permissions, select All data access.
Choose Grant. This will grant select access on all the columns in diabetic_encounters to the nurse

Now grant access to the diabetic_encounters table for the analyst user.

Repeat the same steps that you followed for nurse user up to step 7 in the previous section.
For Data permissions, select Column-based access. Select Include columns and select the encounter_id, patient_nbr, and readmitted columns
Choose Grant. This will grant select access on the encounter_id, patient_nbr, and readmitted columns in diabetic_encounters to the analyst

Run SQL analytics

In this section, you will access the data in the diabetic_encounters S3 Table using nurse and analyst to learn how fine-grain access control works. You will also combine data from the S3 Table data with a local table in Amazon Redshift using a single query.

In the Amazon Redshift Query Editor V2, connect to serverless:rs-demo-wg, an Amazon Redshift Serverless instance created by the CloudFormation template.
Select Database user name and password as the connection method and connect using super user awsuser. Provide the password you gave as an input parameter to the CloudFormation stack.
Run the following commands to create the IAM users nurse and analyst in Amazon Redshift.
```
CREATE USER IAM:nurse password disable;
CREATE USER IAM:analyst password disable;
```
Amazon Redshift automatically mounts the Data Catalog as an external database named awsdatacatalog to simplify accessing your tables in Data Catalog. You can grant usage access to this database for the IAM users:
```
GRANT USAGE ON DATABASE awsdatacatalog to "IAM:nurse";
GRANT USAGE ON DATABASE awsdatacatalog to "IAM:analyst";
```

For the next steps, you must first sign in to the AWS Console as the nurse IAM user. You can find the IAM user’s password in the AWS Secrets Manager console and retrieving the value from the secret ending with iam-users-credentials. See Get a secret value using the AWS console for more information.

After you’ve signed in to the console, navigate to the Amazon Redshift Query Editor V2.
Sign in to your Amazon Redshift cluster using the IAM:nurse. You can do this by connecting to serverless:rs-demo-wg as Federated user. This applies the permission provided in Lake Formation for accessing your data in Amazon S3 Tables:

Run following SQL to query S3 Table diabetic_encounters.

SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";

This returns all the data in the S3 Table for diabetic_encounters across every column in the table, as shown in the following figure:

Diabetic Encounters Output

Recall that you also created an IAM user called analyst that only has access to the encounter_id, patient_nbr, and readmitted columns. Let’s verify that analyst user can only access those columns.

Sign in to the AWS console as the analyst IAM user and open the Amazon Redshift Query Editor v2 using the same steps as above. Run the same query as before:
```
SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";
```

This time, you should only the encounter_id, patient_nbr, and readmitted columns:

Diabetic Encounters Output restricted

Now that you’ve seen how you can access data in Amazon S3 Tables from Amazon Redshift while setting the levels of access required for your users, let’s see how we can join data in S3 Tables to tables that already exist in Amazon Redshift.

Combine data from an S3 Table and a local table in Amazon Redshift

For this section, you’ll load data into your local Amazon Redshift cluster. After this is complete, you can analyze the datasets in both Amazon Redshift and S3 Tables.

First, as the analytics federated user, sign in to your Amazon Redshift cluster using Amazon Redshift Query Editor v2.

Use the following SQL command to create a table that contains patient information.:

CREATE TABLE public.patient_info (
    patient_nbr integer ENCODE az64,
    race character varying(256) ENCODE lzo,
    gender character varying(256) ENCODE lzo,
    age_grp character varying(256) ENCODE lzo,
    number_outpatient integer ENCODE az64,
    number_emergency integer ENCODE az64,
    number_inpatient integer ENCODE az64);

Copy patient information from the file csv that’s stored in your Amazon S3 object bucket. Replace <diabetic_patients_rs.csv file S3 location> with the location of the file in your S3 bucket.
```
COPY dev.public.patient_info FROM 's3://<diabetic_patients_rs.csv file S3 location>' 
IAM_ROLE default 
FORMAT AS CSV DELIMITER ',' 
IGNOREHEADER 1;
```
Use the following query to review the sample data to verify that the command was successful. This will show information from 10 patients, as shown in the following figure.
```
SELECT * FROM public.patient_info limit 10;
```

Now combine data from the Amazon S3 Table diabetic_encounters and the Amazon Redshift patient_info. In this example, the query fetches information about what age group was most frequently readmitted to the hospital within 30 days of an initial hospital visit:

SELECT
    age_grp,
    count(*) readmission_count
FROM
    "patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters" a
JOIN public.patient_info b ON b.patient_nbr = a.patient_nbr
WHERE
    a.readmitted='<30'
GROUP BY age_grp
ORDER BY readmission_count DESC
LIMIT 1;

This query returns results showing an age group and the number of re-admissions, as shown in the following figure.

Redamissions Output

Cleanup

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, see Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, you walked through an end-to-end process for setting up security and governance controls for Apache Iceberg data stored in Amazon S3 Tables and accessing it from Amazon Redshift. This includes creating S3 Tables, loading data into them, registering the tables in a data lake catalog, setting up access controls, and querying the data using Amazon Redshift. You also learned how to combine data from Amazon S3 Tables and local Amazon Redshift tables stored in Redshift Managed Storage in a single query, enabling a seamless, unified analytics experience. Try out these features and see Working with Amazon S3 Tables and table buckets for more details. We welcome your feedback in the comments section.

About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

2025-05-15 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-managed-storage-tables-through-apache-spark-on-aws-glue-and-amazon-emr-using-amazon-sagemaker-lakehouse/

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by conﬁguring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key	Value
`spark.sql.catalog.{catalog_name}`	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.{catalog_name}.type`	`glue`
`spark.sql.catalog.{catalog_name}.glue.id`	`{account_id}:{rms_catalog_name}/{database_name}`
`spark.sql.catalog.{catalog_name}.client.region`	`{aws_region}`
`spark.sql.extensions`	`org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions`

Configuration parameters:

{catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
{rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
{database_name}: The RMS database name
{aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

An AWS account
An SageMaker Unified Studio domain

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

Sign in to SageMaker Unified Studio.
Choose Select a project on the top menu and choose Create project.
For Project name, enter demo.
For Project profile, choose All capabilities.
Choose Continue.

Leave the default values and choose Continue.
Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

In the navigation pane, choose Data.
Choose the + (plus) sign.
Select Create Lakehouse catalog to create a new catalog and choose Next.

For Lakehouse catalog name, enter rms-catalog-demo.
Choose Add catalog.

Wait for the catalog to be created.

In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
On the top right, choose Select data source.
For CONNECTIONS, choose Redshift (Lakehouse).
For DATABASES, choose dev@rms-catalog-demo.
For SCHEMAS, choose public.
Choose Choose.

In the query cell, enter and execute the following query to create a new schema:

create schema "dev@rms-catalog-demo".salesdb

In a new cell, enter and execute the following query to create a new table:

create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

In a new cell, enter and execute the following query to populate the table with sample data:

insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

In a new cell, enter and run the following query to verify the table contents:

select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

Choose Create new compute resources, then choose Next.

Select EMR Serverless.

Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

In the navigation pane, choose Data
Under Lakehouse, select the down arrow next to rms-catalog-demo
Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse oﬀers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Choose Open in Jupyter Lab notebook
On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option	Pyspark code		Spark SQL
Compute option	Connection type	Compute	Connection type	Compute
AWS Glue	Pyspark	`project.spark.compatibility`	SQL	`project.spark.compatibility`
Amazon EMR Serverless	Pyspark	`emr-s.emr_serverless_application`	SQL	`emr-s.emr_serverless_application`

In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:

SHOW TABLES IN rmscatalog.salesdb

In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:

DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

In a new SQL cell, enter the following SELECT statement to view the content of the table

SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
2. Choose Actions, then select Delete project.
3. Type confirm and choose Delete project.

Delete the created Lakehouse catalog:
1. Navigate to the AWS Lake Formation page in the Catalogs section.
2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

2025-05-12 Roy Ninio

Post Syndicated from Roy Ninio original https://aws.amazon.com/blogs/big-data/petabyte-scale-data-migration-made-simple-appsflyers-best-practice-journey-with-amazon-emr-serverless/

This post is co-written with Roy Ninio from Appsflyer.

Organizations worldwide aim to harness the power of data to drive smarter, more informed decision-making by embedding data at the core of their processes. Using data-driven insights enables you to respond more effectively to unexpected challenges, foster innovation, and deliver enhanced experiences to your customers. In fact, data has transformed how organizations drive decision-making, but historically, managing the infrastructure to support it posed significant challenges and required specific skill sets and dedicated personnel. The complexity of setting up, scaling, and maintaining large-scale data systems impacted agility and pace of innovation. This reliance on experts and intricate setups often diverted resources from innovation, slowed time-to-market, and hindered the ability to respond to changes in industry demands.

AppsFlyer is a leading analytics and attribution company designed to help businesses measure and optimize their marketing efforts across mobile, web, and connected devices. With a focus on privacy-first innovation, AppsFlyer empowers organizations to make data-driven decisions while respecting user privacy and compliance regulations. AppsFlyer provides tools for tracking user acquisition, engagement, and retention, delivering actionable insights to enhance ROI and streamline marketing strategies.

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.

Why AppsFlyer embraced a serverless approach for big data

AppsFlyer manages one of the largest-scale data infrastructures in the industry, processing 100 PB of data daily, handling millions of events per second, and running thousands of jobs across nearly 100 self-managed Hadoop clusters. The AppsFlyer architecture is comprised of many data engineering open source technologies, including but not limited to Apache Spark, Apache Kafka, Apache Iceberg, and Apache Airflow. Although this setup has powered operations for years, the growing complexity of scaling resources to meet fluctuating demands, coupled with the operational overhead of maintaining clusters, prompted AppsFlyer to rethink their big data processing strategy.

EMR Serverless is a modern, scalable solution that alleviates the need for manual cluster management while dynamically adjusting resources to match real-time workload requirements. With EMR Serverless, scaling up or down happens within seconds, minimizing idle time and interruptions like spot terminations.

This shift has freed engineering teams to focus on innovation, improved resilience and high availability, and future-proofed the architecture to support their ever-increasing demands. By only paying for compute and memory resources used during runtime, AppsFlyer also optimized costs and minimized charges for idle resources, marking a significant step forward in efficiency and scalability.

Solution overview

AppsFlyer’s previous architecture was built around self-managed Hadoop clusters running on Amazon Elastic Compute Cloud (Amazon EC2) and handled the scale and complexity of the data workflows. Although this setup supported operational needs, it required substantial manual effort to maintain, scale, and optimize.

AppsFlyer orchestrated over 100,000 daily workflows with Airflow, managing both streaming and batch operations. Streaming pipelines used Spark Streaming to ingest real-time data from Kafka, writing raw datasets to an Amazon Simple Storage Service (Amazon S3) data lake while simultaneously loading them into BigQuery and Google Cloud Storage to build logical data layers. Batch jobs then processed this raw data, transforming it into actionable datasets for internal teams, dashboards, and analytics workflows. Additionally, some processed outputs were ingested into external data sources, enabling seamless delivery of AppsFlyer insights to customers across the web.

For analytics and fast queries, real-time data streams were ingested into ClickHouse and Druid to power dashboards. Additionally, Iceberg tables were created from Delta Lake raw data and made accessible through Amazon Athena for further data exploration and analytics.

With the migration to EMR Serverless, AppsFlyer replaced its self-managed Hadoop clusters, bringing significant improvements to scalability, cost-efficiency, and operational simplicity.

Spark-based workflows, including streaming and batch jobs, were migrated to run on EMR Serverless and take advantage of the elasticity of EMR Serverless, dynamically scaling to meet workload demands.

This transition has significantly reduced operational overhead, alleviating the need for manual cluster management, so teams can focus more on data processing and less on infrastructure.

The following diagram illustrates the solution architecture.

This post reviews the main challenges and lessons learned by the team at AppsFlyer from this migration.

Challenges and lessons learned

Migrating a large-scale organization like AppsFlyer, with dozens of teams, from Hadoop to EMR Serverless was a significant challenge—especially because many R&D teams had limited or no prior experience managing infrastructure. To provide a smooth transition, AppsFlyer’s Data Infrastructure (DataInfra) team developed a comprehensive migration strategy that empowered the R&D teams to seamlessly migrate their pipelines.

In this section, we discuss how AppsFlyer approached the challenge and achieved success for the entire organization.

Centralized preparation by the DataInfra team

To provide a seamless transition to EMR Serverless, the DataInfra team took the lead in centralizing preparation efforts:

Clear ownership – Taking full responsibility for the migration, the team planned, guided, and supported R&D teams throughout the process.
Structured migration guide – A detailed, step-by-step guide was created to streamline the transition from Hadoop, breaking down the complexities and making it accessible to teams with limited infrastructure experience.

Building a strong support network

To make sure the R&D teams had the resources they needed, AppsFlyer established a robust support environment:

Data community – The primary resource for answering technical questions. It encouraged knowledge sharing across teams and was spearheaded by the DataInfra team.
Slack support channel – A dedicated channel where the DataInfra team actively responded to questions and guided teams through the migration process. This real-time support significantly reduced bottlenecks and helped teams resolve issues quickly.

Infrastructure templates with best practices

Recognizing the complexity of the team’s migration, the DataInfra team had standardized templates to help teams start quickly and efficiently:

Infrastructure as code (IaC) templates – They developed Terraform templates with best practices for building applications on EMR Serverless. These templates included code examples and real production workflows already migrated to EMR Serverless. Teams could quickly bootstrap their projects by using these ready-made templates.
Cross-account access solutions – Operating across multiple AWS accounts required managing secure access between EMR Serverless accounts (where jobs run) and data storage accounts (where datasets reside). To streamline this, a step-by-step module was developed for setting up cross-account access using Assume Role permissions. Additionally, a dedicated repository was created, so teams can define and automate role and policy creation, providing seamless and scalable access management.

Airflow integration

As AppsFlyer’s primary workflow scheduler, Airflow plays a critical role, making it essential to provide a seamless transition for its users.

AppsFlyer developed a dedicated Airflow operator for executing Spark jobs on EMR Serverless, carefully designed to replicate the functionality of the existing Hadoop-based Spark operator. In addition, a Python package was made available across all Airflow clusters with the relevant operators. This approach minimized code changes, allowing teams to transition seamlessly with minimal modifications.

Solving common permission challenges

To streamline permissions management, AppsFlyer developed targeted solutions for frequent use cases:

Comprehensive documentation – Provided detailed instructions for handling permissions for services like Athena, BigQuery, Vault, GIT, Kafka, and many more.
Standardized Spark defaults configuration for teams to apply to their applications – Included built-in solutions for collecting lineage from Spark jobs running on EMR Serverless, providing accountability and traceability.

Continuous engagement with R&D teams

To promote progress and maintain alignment across teams, AppsFlyer introduced the following measures:

Weekly meetings – Weekly status meetings to review the status of each team’s migration efforts. Teams shared updates, challenges, and commitments, fostering transparency and collaboration.
Assistance – Proactive assistance was provided for issues raised during meetings to minimize delays. This made sure that the teams were on track and had the support they needed to meet their commitments.

By implementing these strategies, AppsFlyer transformed the migration process from a daunting challenge into a structured and well-supported journey. Key outcomes included:

Empowered teams – R&D teams with minimal infrastructure experience were able to confidently migrate their pipelines.
Standardized practices – Infrastructure templates and predefined solutions provided consistency and best practices across the organization.
Reduced downtime – The custom Airflow operator and detailed documentation minimized disruptions to existing workflows.
Cross-account compatibility – With seamless cross-account access, teams could run jobs and access data efficiently.
Improved collaboration – The data community and Slack support channel fostered a sense of collaboration and shared responsibility across teams.

Migrating an entire organization’s data workflows to EMR Serverless is a complex task, but by investing in preparation, templates, and support, AppsFlyer successfully streamlined the process for all R&D teams in the company.

This approach can serve as a model for organizations undertaking similar migrations.

Spark application code management and deployment

For AppsFlyer data engineers, developing and deploying Spark applications is a core daily responsibility. The Data Platform team focuses on identifying and implementing the right set of tools and safeguards that would not only simplify the migration to EMR Serverless, but also streamline ongoing operations.

There are two different approaches available for running Spark code on EMR Serverless: custom container images and JARs or Python files. At the beginning of the exploration, custom images looked promising because it allows greater customization than JARs, which should allow the DataInfra team smoother migration for existing workloads. After deeper research, it was realized that custom images have great power, but come with a cost that in large scale would need to be evaluated. Custom images presented the following challenges:

Custom images are supported as of version 6.9.0, but some of AppsFlyer’s workloads used earlier versions.
EMR Serverless resources run from the moment EMR Serverless begins downloading the image until workers are stopped. This means a payment is done for aggregate vCPU, memory, and storage resources during the image download phase.
They required a different continuous integration and delivery (CI/CD) approach than compiling a JAR or Python file, leading to operational work that should be minimized as much as possible.

AppsFlyer decided to go all in with JARs and allow only in unique cases, where the customization required the use of custom images. Eventually, it was realized that using non-custom images was suitable for AppsFlyer use cases.

CI/CD perspective

From a CI/CD perspective, AppsFlyer’s DataInfra team decided to align with AppsFlyer’s GitOps vision, making sure that both infrastructure and application code are version-controlled, built, and deployed using Git operations.

The following diagram illustrates the GitOps approach AppsFlyer adopted.

JARs continuous integration

For CI, the process in charge of building the application artifacts, several options have been explored. The following key considerations drove the exploration process:

Use Amazon S3 as the native JAR source for EMR Serverless
Support different versions for the same job
Support staging and production environments
Allow hotfixes, patches, and rollbacks

Using AppsFlyer’s current external package repository led to challenges, because it required them to build a custom delivery into Amazon S3 or a complex runtime ability to fetch the code externally.

Using Amazon S3 directly also had several alternative approaches:

Buckets – Use single vs. separated buckets for staging and production
Versions – Use Amazon S3 native object versioning vs. uploading a new file
Hotfix – Override the same job’s JAR file vs. uploading a new one

Finally, the decision was to go with immutable builds for consistent deployment across the environments.

Each Spark job git repository pushes to the main branch, triggers a CI process to validate the semantic versioning (semver) assignment, compiles the JAR artifact, and uploads it to Amazon S3. Each artifact is uploaded to three different paths according to the version of the JAR, and also include a version tag for the S3 object:

<BucketName>/<SparkJobName>/<major>"."<minor>"."<patch>/app.jar
<BucketName>/<SparkJobName>/<major>"."<minor>"/app.jar
<BucketName>/<SparkJobName>/<major>/app.jar

AppsFlyer can now have deep granularity and assign each EMR Serverless job to a pinpointed version. Some jobs can run with the latest major version, and other stability and SLA sensitive jobs require a lock to a specific patch version.

EMR Serverless continuous deployment

Uploading the files to Amazon S3 was the final step in the CI process, which then leads to a different CD process.

CD is done by changing the infrastructure code, which is Terraform based, to point to the new JAR that was uploaded to Amazon S3. Then the staging or production application can start using the newly uploaded code and the process can be considered deployed.

Spark application rollbacks

If they need an application rollback, AppsFlyer points the EMR Serverless job IaC configuration from the current impaired JAR version to the previous stable JAR version in the relevant Amazon S3 path.

AppsFlyer believes that every automation impacting production, like CD, requires a breaking glass mechanism for an emergency situation. In such cases, AppsFlyer can manually override the needed S3 object (JAR file) while still using Amazon S3 versions in order to have better visibility and manual version control.

Single-job vs. multi-job applications

When using EMR Serverless, one important architectural decision is whether to create a separate application for each Spark job or use an automatic scaling application shared across multiple Spark jobs. The following table summarizes these considerations.

Aspect	Single-Job Application	Multi-Job Application
Logical Nature	Dedicated application for each job.	Shared application for multiple jobs.
Shared Configurations	Limited shared configurations; each application is independently configured.	Allows shared configurations through spark-defaults, including executors, memory settings, and JARs.
Isolation	Maximum isolation; each job runs independently.	Maintains job-level isolation through distinct IAM roles despite sharing the application.
Flexibility	Flexible for unique configurations or resource requirements.	Reduces overhead by reusing configurations and using automatic scaling.
Overhead	Higher setup and management overhead due to multiple applications.	Lower administrative overhead but requires careful resource contention management.
Use Cases	Suitable for jobs with unique requirements or strict isolation needs.	Ideal for related workloads that benefit from shared settings and dynamic scaling.

By balancing these considerations, AppsFlyer tailored its EMR Serverless usage to efficiently meet the demands of diverse Spark workloads across their teams.

Airflow operator: Simplifying the transition to EMR Serverless

Before the migration to EMR Serverless, AppsFlyer’s teams relied on a custom Airflow Spark operator created by the DataInfra team.

This operator, packaged as a Python library, was integrated into the Airflow environment and became a key component of the data workflows.

It provided essential capabilities, including:

Retries and alerts – Built-in retry logic and PagerDuty alert integration
AWS role-based access – Automatic fetching of AWS permissions based on role names
Custom defaults – Setting Spark configurations and package defaults tailored for each job
State management – Job state tracking

This operator streamlined running Spark jobs on Hadoop and was highly tailored to AppsFlyer’s requirements.

When moving to EMR Serverless, the team chose to build a custom Airflow operator to align with their existing Spark-based workflows. They already had dozens of Directed Acyclic Graphs (DAGs) in production, so with this approach, they could maintain their familiar interface, including custom handling for retries, alerting, and configurations—all without requiring broad changes across the board.

This abstraction provided a smoother migration by preserving the same development patterns and minimizing the migration efforts of adapting to the native operator semantics.

The DataInfra team developed a dedicated, custom, EMR Serverless operator to support the following goals:

Seamless migration – The operator was designed to closely mimic the interface of the existing Spark operator on Hadoop. This made sure that teams could migrate with minimal code changes.
Feature parity – They added the features missing from the native operator:
- Built-in retry logic.
- PagerDuty integration for alerts.
- Automatic role-based permission fetching.
- Default Spark configurations and package support for each job.
Simplified integration – It’s packaged as a Python library available in Airflow clusters. Teams could use the operator just like they did with the previous Spark operator.

The custom operator abstracts some of the underlying configurations required to submit jobs to EMR Serverless, aligning with AppsFlyer’s internal best practices and adding essential features.

The following is from an example DAG using the operator:

return SparkBatchJobEmrServerlessOperator(
    task_id=task_id,  # Unique task identifier in the DAG

    jar_file=jar_file,  # Path to the Spark job JAR file on S3
    main_class="<main class path>",

    spark_conf=spark_conf,

    app_id=default_args["<emr_serverless_application_id>"],  # EMR Serverless app ID
    execution_role=default_args["<job_execution_role_arn>"],  # IAM role for job execution

    polling_interval_sec=120,  # How often to poll for job status
    execution_timeout=timedelta(hours=1),  # Max allowed runtime

    retries=5,  # Retry attempts for failed jobs
    app_args=[],  # Arguments to pass to the Spark job

    depends_on_past=True,  # Ensure sequential task execution

    tags={'owner': '<team_tag>'},  # Metadata for ownership
    aws_assume_role="<my_aws_role>",  # Role for cross-account access

    alerting_policy=ALERT_POLICY_CRITICAL.with_slack_channel(sc),  # Alerting integration
    owner="<team_owner>",

    dag=dag  # DAG this task belongs to
)

Cross-account permissions on AWS: Simplifying EMRs workflows

AppsFlyer operates across multiple AWS accounts, creating a need for secure and efficient cross-account access. EMR Serverless jobs are executed in the production account, and the data they process resides in a separate data account. To enable seamless operation, Assume Role permissions are used to verify that EMR Serverless jobs running in the production account can access the data and services in the data account. The following diagram illustrates this architecture.

Below is a diagram demonstrating the cross-account permissions AppsFlyer adopted:

Role management strategy

To manage cross-account access efficiently, three distinct roles were created and maintained:

EMR role – Used for executing and managing EMR Serverless applications in the production account. Integrated directly into Airflow workers to make it available for the DAGs on the dedicated team Airflow cluster.
Execution role – Assigned to the Spark job running on EMR Serverless. Passed by the EMR role in the DAG code to provide seamless integration.
Data role – Resides in the data account and is assumed by the execution role to access data stored in Amazon S3 and other AWS services.

To enforce access boundaries, each role and policy is tagged with team-specific identifiers.
This makes sure that teams can only access their own data and roles, minimizing unauthorized access to other teams’ resources.

Simplifying Airflow migration

A streamlined process to make cross-account permissions transparent for teams migrating their workloads to EMR Serverless was developed:

The EMR role is embedded into Airflow workers, making it available for DAGs in the dedicated Airflow cluster for each team:

{
   "Version":"2012-10-17",
   "Statement":[
      "..."{
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::account-id:role/execution-role",
         "Condition":{
            "StringEquals":{
               "iam:ResourceTag/Team":"team-tag"
            }
         }
      }
   ]
}

The EMR role automatically passes the execution role to the job within the DAG code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::data-account-id:role/data-role",
      "Condition": {
        "StringEquals": {
          "iam:ResourceTag/Team": "team-tag"
        }
      }
    }
  ]
}

The execution role assumes the data role dynamically during job execution to access the required data and services in the data account:

Allows the Execution Role in the Production account to assume the Data Role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::production-account-id:role/execution-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Policies, trust relationships, and role definitions are managed in a dedicated GitLab repository. GitLab CI/CD pipelines automate the creation and integration of roles and policies, providing consistency and reducing manual overhead.

Benefits of AppsFlyer’s approach

This approach offered the following benefits:

Seamless access – Teams no longer need to handle cross-account permissions manually because these are automated through preconfigured roles and policies, providing seamless and secure access to resources across accounts.
Scalable and secure – Role-based and tag-based permissions provide security and scalability across multiple teams and accounts. By using roles and tags, it alleviates the need to create separate hardcoded policies for each team or account. Instead, they can define generalized policies that scale automatically as new resources, accounts, or teams are added.
Automated management – GitLab CI/CD streamlines the deployment and integration of policies and roles, reducing manual effort while enhancing consistency. It also minimizes human errors, improves change transparency, and simplifies version management.
Flexibility for teams – Teams have the flexibility to use their own or native EMR Serverless operators while maintaining secure access to data.

By implementing a robust, automated cross-account permissions system, AppsFlyer has enabled secure and efficient access to data and services across multiple AWS accounts. This makes sure that teams can focus on their workloads without worrying about infrastructure complexities, accelerating their migration to EMR Serverless.

Integrating lineage into EMR Serverless

AppsFlyer developed a robust solution for column-level lineage collection to provide comprehensive visibility into data transformations across pipelines. Lineage data is stored in Amazon S3 and subsequently ingested into DataHub, AppsFlyer’s lineage and metadata management environment.

Currently, AppsFlyer collects column-level lineage from a variety of sources, including Amazon Athena, BigQuery, Spark, and more.

This section focuses on how AppsFlyer collects Spark column-level lineage specifically within the EMR Serverless infrastructure.

Collecting Spark lineage with Spline

To capture lineage from Spark jobs, AppsFlyer uses Spline, an open source tool designed for automated tracking of data lineage and pipeline structures.

AppsFlyer modified Spline’s default behavior to output a customized Spline object that aligns with AppsFlyer’s specific requirements. AppsFlyer adapted the Spline integration into both legacy and modern environments. In the pre-migration phase, they injected the Spline agent into Spark jobs through their customized Airflow Spark operator. In the post-migration phase, they integrated Spline directly into EMR Serverless applications.

The lineage workflow consists of the following steps:

As Spark jobs execute, Spline captures detailed metadata about the queries and transformations performed.
The captured metadata is exported as Spline object files to a dedicated S3 bucket.
These Spline objects are processed into column-level lineage objects customized to fit AppsFlyer’s data architecture and requirements.
The processed lineage data is ingested into DataHub, providing a centralized and interactive view of data dependencies.

The following figure is an example of a lineage diagram from DataHub.

Challenges and how AppsFlyer addressed them

AppsFlyer encountered the following challenges:

Supporting different EMR Serverless applications – Each EMR Serverless application has its own Spark and Scala version requirements.
Diverse operator usage – Teams often use custom or native EMR Serverless operators, making uniform Spline integration challenging.
Confirming universal adoption – They need to make sure Spark jobs across multiple accounts use the Spline agent for lineage tracking.

AppsFlyer addressed these challenges with the following solutions:

Version-specific Spline agents – AppsFlyer created a dedicated Spline agent for each EMR Serverless application version to match its Spark and Scala versions. For example, EMR Serverless application version 7.0.1 and Spline.7.0.1.
Spark defaults integration – They integrated the Spline agent into EMR Serverless application Spark defaults to verify lineage collection for jobs executed on the application—no job-specific modifications needed.
Automation for compliance – This process consists of the following steps:
- Detect a newly created EMR Serverless application across accounts.
- Verify that Spline is properly defined in the application’s Spark defaults.
- Send a PagerDuty alert to the dedicated team if misconfigurations are detected.

Example integration with Terraform

To automate Spline integration, AppsFlyer used Terraform and local-exec to define Spark defaults for EMR Serverless applications. With Amazon EMR, you can set unified Spark configuration properties through spark-defaults, which are then applied to Spark jobs.

This configuration makes sure the Spline agent is automatically applied to every Spark job without requiring modifications to the Airflow operator or the job itself.

This robust lineage integration provides the following benefits:

Full visibility – Automatic lineage tracking provides detailed insights into data transformations
Seamless scalability – Version-specific Spline agents provide compatibility with EMR Serverless applications
Proactive monitoring – Automated compliance checks verify that lineage tracking is consistently enabled across accounts
Enhanced governance – Ingesting lineage data into DataHub provides traceability, supports audits, and fosters a deeper understanding of data dependencies

By integrating Spline with EMR Serverless applications, AppsFlyer has provided comprehensive and automated lineage tracking, so teams can understand their data pipelines better while meeting compliance requirements. This scalable approach aligns with AppsFlyer’s commitment to maintaining transparency and reliability throughout their data landscape.

Monitoring and observability

When embarking on a large migration, and as a day-to-day best-practice process, monitoring and observability are key parts of being able to run workloads successfully for stability, debugging, and cost.

AppsFlyer’s DataInfra team set several KPIs for monitoring and observability in EMR Serverless:

Monitor infrastructure-level metrics and logs:
- EMR Serverless resource usage, including cost
- EMR Serverless API usage
Monitor Spark application-level metrics and logs:
- stdout and stderr logs
- Spark engine metrics
Centralized observability over the existing environments, Datadog

Metrics

Using EMR Serverless native metrics, AppsFlyer’s DataInfra team set up several dashboards to support tracking both the migration and the day-to-day usage of EMR Serverless across the company. The following are the main metrics that were monitored:

Service quota usage metrics:
- vCPU usage tracking (ResourceCount with vCPU dimension)
- API usage tracking (API actual usage vs. API limits)
Application status metrics:
- RunningJobs, SuccessJobs, FailedJobs, PendingJobs, CancelledJobs
Resource limits tracking:
- MaxCPUAllowed vs. CPUAllocated
- MaxMemoryAllowed vs. MemoryAllocated
- MaxStorageAllowed vs. StorageAllocated
Worker-level metrics:
- WorkerCpuAllocated vs. WorkerCpuUsed
- WorkerMemoryAllocated vs. WorkerMemoryUsed
- WorkerEphemeralStorageAllocated vs. WorkerEphemeralStorageUsed
Capacity allocation tracking:
- Metrics filtered by CapacityAllocationType (PreInitCapacity vs. OnDemandCapacity)
- ResourceCount
Worker type distribution:
- Metrics filtered by WorkerType (SPARK_DRIVER vs. SPARK_EXECUTORS)
Job success rates over time:
- SuccessJobs vs. FailedJobs ratio
- SubmitedJobs vs. PendingJobs

The following screenshot shows an example of the tracked metrics.

Logs

For logs management, AppsFlyer’s DataInfra team explored several options:

Native Amazon S3 and custom shipper (using AWS Lambda) to Datadog, the central system for logging being used at AppsFlyer
Native Amazon CloudWatch Logs with subscriptions to Lambda, Amazon Kinesis Data Streams, and Amazon Data Firehose, shipped to Datadog
Apache Log4j log shipper, which ships logs directly from the Spark application to Datadog

Streamlining EMR Serverless log shipping to Datadog

Because AppsFlyer decided to keep their logs in an external logging environment, the DataInfra team aimed to reduce the number of components involved in the shipping process and minimize maintenance overhead. Instead of managing a Lambda based log shipper, they developed a custom Spark plugin that seamlessly exports logs from EMR Serverless to Datadog.

Companies already storing logs in Amazon S3 or CloudWatch Logs can take advantage of EMR Serverless native support for those environments. However, for teams needing a direct, real-time integration with Datadog, this approach alleviates the need for extra infrastructure, providing a more efficient and maintainable logging solution.

The custom Spark plugin offers the following capabilities:

Automated log export – Streams logs from EMR Serverless to Datadog
Fewer extra components – Alleviates the need for Lambda based log shippers
Secure API key management – Uses Vault instead of hardcoding credentials
Customizable logging – Supports custom Log4j settings and log levels
Full integration with Spark – Works on both driver and executor nodes

How the plugin works

In this section, we walk through the components of how the plugin works and provide a pseudocode overview:

Driver plugin – LoggerDriverPlugin runs on the Spark driver to configure logging. The plugin fetches EMR job metadata, calls Vault to retrieve the Datadog API key, and configures logging settings.

initialize() {
  if (user provided log4j.xml) {
     Use custom log configuration
  } else {
     Fetch EMR job metadata (application name, job ID, tags)
     Retrieve Datadog API key from Vault
     Apply default logging settings
  }
}

Executor plugin – LoggerExecutorPlugin provides consistent logging across executor nodes. It inherits the driver’s log configuration and makes sure the executors use consistent logging

initialize() {
   fetch logging config from Driver
   apply log settings (log4j, log levels)
}

Main plugin – LoggerSparkPlugin registers the driver and executor plugins in Spark. It serves as the entry point for Spark and applies custom logging settings dynamically.

function registerPlugin() {
  return (driverPlugin, executorPlugin);
}

Vault authentication – AuthUtils handles Vault authentication using AWS Identity and Access Management (IAM) and retrieves the Datadog API key securely.

loginToVault(role, vaultAddress) {
    create AWS signed request
    authenticate with Vault
    return vault token
}

getDatadogApiKey(vaultToken, secretPath) {
    fetch API key from Vault
    return key
}

Set up the plugin

To set up the plugin, complete the following steps:

Add the following dependencies to your project:

<dependency>
  <groupId>com.AppsFlyer.datacom</groupId>
  <artifactId>emr-serverless-logger-plugin</artifactId>
  <version><!-- insert version here --></version>
</dependency>

Configure the Spark plugin. The following code enables the custom Spark plugin and assigns the Vault role to access the Datadog API key:

--conf "spark.plugins=com.AppsFlyer.datacom.emr.plugin.LoggerSparkPlugin"

--conf "spark.datacom.emr.plugin.vaultAuthRole=your_vault_role"

Use a custom or default Log4j configuration:

--conf "spark.datacom.emr.plugin.location=classpath:my_custom_log4j.xml"

Set the environment variables for different log levels. This adjusts the logging for specific packages.

--conf "spark.emr-serverless.driverEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.executorEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.emr-serverless.driverEnv.LOG_LEVEL=DEBUG"

--conf "spark.executorEnv.LOG_LEVEL=DEBUG"

Configure the Vault and Datadog API key and verify secure Datadog API key retrieval.

By adopting this plugin, AppsFlyer was able to significantly simplify log shipping, reducing the number of moving parts while maintaining real-time log visibility in Datadog. This approach provides reliability, security, and ease of maintenance, making it an ideal solution for teams using EMR Serverless with Datadog.

Summary

Through their migration to EMR Serverless, AppsFlyer achieved a significant transformation in team autonomy and operational efficiency. Individual teams now have greater freedom to choose and build their own resources without depending on a central infrastructure team, and can work more independently and innovatively. The minimization of spot interruptions, which were common in their previous self-managed Hadoop clusters, has substantially improved stability and agility in their operations. Thanks to this autonomy and reliability, combined with the automatic scaling capabilities of EMR Serverless, the AppsFlyer teams can focus more on data processing and innovation rather than infrastructure management. The result is a more efficient, flexible, and self-sufficient development environment where teams can better respond to their specific needs while maintaining high performance standards.

Ruli Weisbach, AppsFlyer EVP of R&D, says,

“EMR-Serverless is a game changer for AppsFlyer; we are able to save significantly our cost with remarkably lower management overhead and maximal elasticity.”

If the AppsFlyer approach sparked your interest and you are thinking about implementing a similar solution in your organization, refer to the following resources:

Migrating to EMR Serverless can transform your organization’s data processing capabilities, offering a fully managed, cloud-based experience that automatically scales resources and eases the operational complexity of traditional cluster management, while enabling advanced analytics and machine learning workloads with greater cost-efficiency.

About the authors

Roy Ninio is an AI Platform Lead with deep expertise in scalable data platform and cloud-native architectures. At AppsFlyer, Roy led the design of a high-performance Data Lake handling PB of daily events, driven the adoption of EMR Serverless for dynamic big data processing, and architected lineage and governance systems across platforms.

Avichay Marciano is a Sr. Analytics Solutions Architect at Amazon Web Services. He has over a decade of experience in building large-scale data platforms using Apache Spark, modern data lake architectures, and OpenSearch. He is passionate about data-intensive systems, analytics at scale, and it’s intersection with machine learning.

Eitav Arditti is AWS Senior Solutions Architect with 15 years in AdTech industry, specializing in Serverless, Containers, Platform engineering, and Edge technologies. Designs cost-efficient, large-scale AWS architectures that leverage the cloud-native and edge computing to deliver scalable, reliable solutions for business growth.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist, helping customers design scalable, open data lakehouse architectures and adopt modern analytics solutions across industries.

Amazon SageMaker Lakehouse now supports attribute-based access control

2025-04-24 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/amazon-sagemaker-lakehouse-now-supports-attribute-based-access-control/

Amazon SageMaker Lakehouse now supports attribute-based access control (ABAC) with AWS Lake Formation, using AWS Identity and Access Management (IAM) principals and session tags to simplify data access, grant creation, and maintenance. With ABAC, you can manage business attributes associated with user identities and enable organizations to create dynamic access control policies that adapt to the specific context.

SageMaker Lakehouse is a unified, open, and secure data lakehouse that now supports ABAC to provide unified access to general purpose Amazon S3 buckets, Amazon S3 Tables, Amazon Redshift data warehouses, and data sources such as Amazon DynamoDB or PostgreSQL. You can then query, analyze, and join the data using Redshift, Amazon Athena, Amazon EMR, and AWS Glue. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. In addition to its support for role-based and tag-based access control, Lake Formation extends support to attribute-based access to simplify data access management for SageMaker Lakehouse, with the following benefits:

Flexibility – ABAC policies are flexible and can be updated to meet changing business needs. Instead of creating new rigid roles, ABAC systems allow access rules to be modified by simply changing user or resource attributes.
Efficiency – Managing a smaller number of roles and policies is more straightforward than managing a large number of roles, reducing administrative overhead.
Scalability – ABAC systems are more scalable for larger enterprises because they can handle a large number of users and resources without requiring a large number of roles.

Attribute-based access control overview

Previously, within SageMaker Lakehouse, Lake Formation granted access to resources based on the identity of a requesting user. Our customers were requesting the capability to express the full complexity required for access control rules in organizations. ABAC allows for more flexible and nuanced access policies that can better reflect real-world needs. Organizations can now grant permissions on a resource based on user attribute and is context-driven. This allows administrators to grant permissions on a resource with conditions that specify user attribute keys and values. IAM principals with matching IAM or session tag key-value pairs will gain access to the resource.

Instead of creating a separate role for each team member’s access to a specific project, you can set up ABAC policies to grant access based on attributes like membership and user role, reducing the number of roles required. For instance, without ABAC, a company with an account manager role that covers five different geographical territories needs to create five different IAM roles and grant data access for only the specific territory for which the IAM role is meant. With ABAC, they can simply add those territory attributes as keys/values to the principal tag and provide data access grants based on those attributes. If the value of the attribute for a user changes, access to the dataset will automatically be invalidated.

With ABAC, you can use attributes such as department or country and use IAM or sessions tags to determine access to data, making it more straightforward to create and maintain data access grants. Administrators can define fine-grained access permissions with ABAC to limit access to databases, tables, rows, columns, or table cells.

In this post, we demonstrate how to get started with ABAC in SageMaker Lakehouse and use with various analytics services.

Solution overview

To illustrate the solution, we are going to consider a fictional company called Example Retail Corp. Example Retail’s leadership is interested in analyzing sales data in Amazon S3 to determine in-demand products, understand customer behavior, and identify trends, for better decision-making and increased profitability. The sales department sets up a team for sales analysis with the following data access requirements:

All data analysts in the Sales department in the US get access to only sales-specific data in only US regions
All BI analysts in the Sales department have full access to data in only US regions
All scientists in the Sales department get access to only sales-specific data across all regions
Anyone outside of Sales department have no access to sales data

For this post, we consider the database salesdb, which contains the store_sales table that has store sales details. The table store_sales has the following schema.

To demonstrate the product sales analysis use case, we will consider the following personas from the Example Retail Corp:

Ava is a data administrator in Example Retail Corp who is responsible for supporting team members with specific data permission policies
Alice is a data analyst who should be able to access sales specific US store data to perform product sales analysis
Bob is a BI analyst who should be able to access all data from US store sales to generate reports
Charlie is a data scientist who should be able to access sales specific across all regions to explore and find patterns for trend analysis

Ava decides to use SageMaker Lakehouse to unify data across various data sources while setting up fine-grained access control using ABAC. Alice is excited about this decision as she can now build daily reports using her expertise with Athena. Bob now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshift’s cost-based optimizer. Charlie, being an open source Apache Spark contributor, is excited that he can build Spark based processing with Amazon EMR to build ML forecasting models.

Ava defines the user attributes as static IAM tags that could also include attributes stored in the identity provider (IdP) or as session tags dynamically to represent the user metadata. These tags are assigned to IAM users or roles and can be used to define or restrict access to specific resources or data. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS.

For this post, Ava assigns users with static IAM tags to represent the user attributes, including their department membership, Region assignment, and current role relationship. The following table summarizes the tags that represent user attributes and user assignment.

User	Persona	Attributes	Access
Alice	Data Analyst	Department=`sales` Region=`US` Role=`Analyst`	Sales specific data in US and no access to customer data
Bob	BI Analyst	Department=`sales` Region=`US` Role=`BIAnalyst`	All data in US
Charlie	Data Scientist	Department=`sales` Region=`ALL` Role=`Scientist`	Sales specific data in All regions and no access to customer data

Ava then defines access control policies in Lake Formation that grant or restrict access to certain resources based on predefined criteria (user attributes defined using IAM tags) being satisfied. This allows for flexible and context-aware security policies where access privileges can be adjusted dynamically by modifying the user attribute assignment without changing the policy rules. The following table summarizes the policies in the Sales department.

Access	User Attributes	Policy
All analysts (including Alice) in US get access to sales specific data in US regions	Department=`sales` Region=`US` Role=`Analyst`	Table: `store_sales` (`store_id`, `transaction_date`, `product_name`, `country`, `sales_price`, `quantity` columns) Row filter: `country='US'`
All BI analysts (including Bob) in US get access to all data in US regions	Department=`sales` Region=`US` Role=`BIAnalyst`	Table: `store_sales` (all columns) Row filter: `country='US'`
All scientists (including Charlie) get access to sales-specific data from all regions	Department=`sales` Region=`ALL` Role=`Scientist`	Table: `store_sales` (all rows) Column filter: `store_id`, `transaction_date`, `product_name`, `country`, `sales_price`,`quantity`

The following diagram illustrates the solution architecture.

Implementing this solution consists of the following high-level steps. For Example Retail, Ava as a data Administrator performs these steps:

Define the user attributes and assign them to the principal.
Grant permission on the resources (database and table) to the principal based on user attributes.
Verify the permissions by querying the data using various analytics services.

Prerequisites

To follow the steps in this post, you must complete the following prerequisites:

AWS account with access to the following AWS services:
- Amazon S3
- AWS Lake Formation and AWS Glue Data Catalog
- Amazon Redshift
- Amazon Athena
- Amazon EMR
- AWS Identity and Access Management (IAM)

Set up an admin user for Ava. For instructions, see Create a user with administrative access.
Setup S3 bucket for uploading script.
Set up a data lake admin. For instructions, see Create a data lake administrator.
Create IAM user named Alice and attach permissions for Athena access. For instructions, refer to Data analyst permissions.
Create IAM user Bob and attach permissions for Redshift access.
Create IAM user Charlie and attach permissions for EMR Serverless access.
Create job runtime role: scientist_role and that will be used by Charlie. For instruction refer to: Job runtime roles for Amazon EMR Serverless
Setup EMR Serverless application with Lake Formation enabled. For instruction refer to: Using EMR Serverless with AWS Lake Formation for fine-grained access control
Have an existing AWS Glue database or table and Amazon Simple Storage Service (Amazon) S3 bucket that holds the table data. For this post, we use salesdb as our database, store_sales as our table, and data is stored in an S3 bucket.
- Register the S3 bucket with Lake Formation. For instructions, refer to Registering an Amazon S3 location.
- Revoke IAMAllowedPrincipals group permission on both the database and table to enforce Lake Formation permission for access. For instructions, refer to Revoking permission using the Lake Formation console.

Define attributes for the IAM principals Alice, Bob, Charlie

Ava completes the following steps to define the attributes for the IAM principal:

Log in as an admin user and navigate to the IAM console.
Choose Users under Access management in the navigation pane and search for the user Alice.
Choose the user and choose the Tags tab.
Choose Add new tag and provide the following key pairs:
- Key: Department and value: sales
- Key: Region and value: US
- Key: Role and value: Analyst
Choose Save changes.
Repeat the process for the user Bob and provide the following key pairs:
- Key: Department and value: sales
- Key: Region and value: US
- Key: Role and value: BIAnalyst
Repeat the process for the user Charlie and IAM role scientist_role and provide the following key pairs:
- Key: Department and value: sales
- Key: Region and value: ALL
- Key: Role and value: Scientist

Grant permissions to Alice, Bob, Charlie using ABAC

Ava now grants database and table permissions to users with ABAC.

Grant database permissions

Complete the following steps:

Ava logs in as data lake admin and navigate to the Lake Formation console.
In the navigation pane, under Permissions, choose Data lake permissions.
Choose Grant.
On the Grant permissions page, choose Principals by attribute.
Specify the following attributes:
- Key: Department and value: sales
- Key: Role and value: Analyst,Scientist
Review the resulting policy expression.
For Permission scope, select This account.
Next, choose the catalog resources to grant access:
- For Catalogs, enter the account ID.
- For Databases, enter salesdb.
For Database permissions, select Describe.
Choose Grant.

Ava now verifies the database permission by navigating to the Databases tab under the Data Catalog and searching for salesdb. Select salesdb and choose View under Actions.

Grant table permissions to Alice

Complete the following steps to create a data filter to view sales specific columns in store_sales records whose country=US:

On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
Choose Create new filter.
Provide the data filter name as us_sales_salesonlydata.
For Target catalog, enter the account ID.
For Target database, choose salesdb.
For Target table, choose store_sales.
For column-level access, choose Include columns: store_id, item_code, transaction_date, product_name, country, sales_price, and quantity.
For Row-level access, choose Filter rows and enter the row filter country='US'.
Choose Create data filter.

On the Grant permissions page, choose Principals by attribute.
Specify the attributes:
- Key: Department and value: sales
- Key: Role as value: Analyst
- Key: Region and value: US
Review the resulting policy expression.
For Permission scope, select This account.
Choose the catalog resources to grant access:
- Catalogs: Account ID
- Databases: salesdb
- Table: store_sales
- Data filters: us_sales
For Data filter permissions, select Select.
Choose Grant.

Grant table permissions to Bob

Complete the following steps to create a data filter to view only store_sales records whose country=US:

On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
Choose Create new filter.
Provide the data filter name as us_sales.
For Target catalog, enter the account ID.
For Target database, choose salesdb.
For Target table, choose store_sales.
Leave Column-level access as Access to all columns.
For Row-level access, enter the row filter country='US'.
Choose Create data filter.

Complete the following steps to grant table permissions to Bob:

On the Grant permissions page, choose Principals by attribute.
Specify the attributes:
- Key: Department and value: sales
- Key: Role as value: BIAnalyst
- Key: Region and value: US
Review the resulting policy expression.
For Permission scope, select This account.
Choose the catalog resources to grant access:
- Catalogs: Account ID
- Databases: salesdb
- Table: store_sales
For Data filter permissions, select Select.
Choose Grant.

Grant table permissions to Charlie

Complete the following steps to grant table permissions to Charlie:

On the Grant permissions page, choose Principals by attribute.
Specify the attributes:
1. Key: Department and value: sales
2. Key: Role as value: Scientist
3. Key: Region and value: ALL
Review the resulting policy expression.
For Permission scope, select This account
Choose the catalog resources to grant access:
1. Catalogs: Account ID
2. Databases: salesdb
3. Table: store_sales
For Table permissions, select Select.
For Data permissions, specify the following columns: store_id, transaction_date, product_name, country, sales_price, and quantity.
Choose Grant.

Alice now verifies the table permission by navigating to the Tables tab under the Data Catalog and searching for store_sales. Select store_sales and choose View under Actions. The following screenshots show the details for both sets of permissions.

Data Analyst uses Athena for building daily sales reports

Alice, the data analyst logs in to the Athena console and run the following query:

select * from "salesdb"."store_sales" limit 5

Alice has the user attributes as Department=sales, Role=Analyst, Region=US, and this attribute combination allows her access to US sales data to specific sales only column, without access to customer data as shown in the following screenshot.

BI Analyst uses Redshift for building sales dashboards

Bob, the BI Analyst, logs in to the Redshift console and run the following query:

select * from "salesdb"."store_sales" limit 10

Bob has the user attributes Department=sales, Role=BIAnalyst, Region=US, and this attribute combination allows him access to all columns including customer data for US sales data.

Data Scientist uses Amazon EMR to process sales data

Finally, Charlie logs in to the EMR console and submit the EMR job with runtime role as scientist_role. Charlie uses the script sales_analysis.py that is uploaded to s3 bucket created for the script. He chooses the EMR Serverless application created with Lake Formation enabled.

Charlie submits batch job runs by choosing the following values:

Name: sales_analysis_Charlie
Runtime_role: scientist_role
Script location: <s3_script_path>/sales_analysis.py
For spark properties, provide key as spark.emr-serverless.lakeformation.enabled and value as true.
Additional configurations: Under Metastore configuration select Use AWS Glue Data Catalog as metastore. Charlie keeps rest of the configuration as default.

Once the job run is completed, Charlie can view the output by selecting stdout under Driver log files.

Charlie uses scientist_role as job runtime role with the attributes Department=sales, Role=Scientist, Region=ALL, and this attribute combination allows him access to select columns of all sales data.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

Delete the IAM users created.
Delete the AWS Glue database and table resources created for the post, if any.
Delete the Athena, Redshift and EMR resources created for the post.

Conclusion

In this post, we showcased how you can use SageMaker Lakehouse attribute-based access control, using IAM principals and session tags to simplify data access, grant creation, and maintenance. With attribute-based access control, you can manage permissions using dynamic business attributes associated with user identities and secure your data in the lakehouse by defining fine-grained permissions in the Lake Formation that are enforced across analytics and ML tools and engines.

For more information, refer to documentation. We encourage you to try out the SageMaker Lakehouse with ABAC and share your feedback with us.

About the authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

2025-04-21 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/read-and-write-apache-iceberg-tables-using-aws-lake-formation-hybrid-access-mode/

Enterprises are adopting Apache Iceberg table format for its multitude of benefits. The change data capture (CDC), ACID compliance, and schema evolution features cater to representing big datasets that receive new records at a fast pace. In an earlier blog post, we discussed how to implement fine-grained access control in Amazon EMR Serverless using AWS Lake Formation for reads. Lake Formation helps you centrally manage and scale fine-grained data access permissions and share data with confidence within and outside your organization.

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables. The bimodal permissions are needed to support existing data pipelines that use only IAM and Amazon Simple Storage Service (Amazon) S3 bucket policy-based permissions and to support table operations that are not yet available in the analytics engines. The two-way permission is achieved by registering the Amazon S3 data location of the Iceberg table with Lake Formation in hybrid access mode. Lake Formation hybrid access mode allows you to onboard new users with Lake Formation permissions to access AWS Glue Data Catalog tables with minimal interruptions to existing IAM policy-based users. With this solution, organizations can use the Lake Formation permissions to scale the access of their existing Iceberg tables in Amazon S3 to new readers. You can extend the methodology to other open table formats, such as Linux Foundation Delta Lake tables and Apache Hudi tables.

Key use cases for Lake Formation hybrid access mode

Lake Formation hybrid access mode is useful in the following use cases:

Avoiding data replication – Hybrid access mode helps onboard new users with Lake Formation permissions on existing Data Catalog tables. For example, you can enable a subset of data access (coarse vs. fine-grained access) for various user personas, such as data scientists and data analysts, without making multiple copies of the data. This also helps maintain a single source of truth for production and business insights.
Minimal interruption to existing IAM policy-based user access – With hybrid access mode, you can add new Lake Formation managed users with minimal disruptions to your existing IAM and Data Catalog policy-based user access. Both access methods can coexist for the same catalog table, but each user can have only one mode of permissions.
Transactional table writes – Certain write operations like insert, update, and delete are not supported by Amazon EMR for Lake Formation managed Iceberg tables. Refer to Considerations and limitations for additional details. Although you could use Lake Formation permissions for Iceberg table read operations, you could manage the write operations as the table owners with IAM policy-based access.

Solution overview

An example Enterprise Corp has a large number of Iceberg tables based on Amazon S3. They are currently managing the Iceberg tables manually with IAM policy, Data Catalog resource policy, and S3 bucket policy-based access in their organization. They want to share their transactional data of Iceberg tables across different teams, such as data analysts and data scientists, asking for read access across a few lines of business. While maintaining the ownership of the table’s updates to their single team, they want to provide restricted read access to certain columns of their tables. This is achieved by using the hybrid access mode feature of Lake Formation.

In this post, we illustrate the scenario with a data engineer team and a new data analyst team. The data engineering team owns the extract, transform, and load (ETL) application that will process the raw data to create and maintain the Iceberg tables. The data analyst team will query the tables to gather business insights from those tables. The ETL application will use IAM role-based access to the Iceberg table, and the data analyst gets Lake Formation permissions to query the same tables.

The solution can be visually represented in the following diagram.

Solution Overview

For ease of illustration, we use only one AWS account in this post. Enterprise use cases typically have multiple accounts or cross-account access requirements. The setup of the Iceberg tables, Lake Formation permissions, and IAM based permissions are similar for multiple and cross-account scenarios.

The high-level steps involved in the permissions setup are as follows:

Make sure that IAMAllowedPrincipals has Super access to the database and tables in Lake Formation. IAMAllowedPrincipals is a virtual group that represents any IAM principal permissions. Super access to this virtual group is required to make sure that IAM policy-based permissions to any IAM principal continues to work.
Register the data location with Lake Formation in hybrid access mode.
Grant DATA LOCATION permission to the IAM role that manages the table with IAM policy-based permissions. Without the DATA LOCATION permission, write workloads will fail. Test the access to the table by writing new records to the table as the IAM role.
Add SELECT table permissions to the Data-Analyst role in Lake Formation.
Opt-in the Data-Analyst to the Iceberg table, making the Lake Formation permissions effective for the analyst.
Test access to the table as the Data-Analyst by running SELECT queries in Athena.
Test the table write operations by adding new records to the table as ETL-application-role using EMR Serverless.
Read the latest update, again, as Data-Analyst.

Prerequisites

You should have the following prerequisites:

An AWS account with a Lake Formation administrator configured. Refer to Data lake administrator permissions and Set up AWS Lake Formation. You can also refer to Simplify data access for your enterprise using Amazon SageMaker Lakehouse for the Lake Formation admin setup in your AWS account. For ease of demonstration, we have used an IAM admin role added as a Lake Formation administrator.
An S3 bucket to host the sample Iceberg table data and metadata.
An IAM role to register your Iceberg table Amazon S3 location with Lake Formation. Follow the policy and trust policy details for a user-defined role creation from Requirements for roles used to register locations.

An IAM role named ETL-application-role, which will be the runtime role to execute jobs in EMR Serverless. The minimum policy required is shown in the following code snippet. Replace the Amazon S3 data location of the Iceberg table, database name, and AWS Key Management Service (AWS KMS) key ID with your own. For additional details on the role setup, refer to Job runtime roles for Amazon EMR Serverless. This role can insert, update, and delete data in the table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IcebergDataAccessInS3",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:Get*",
                "s3:Put*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::your-iceberg-data-bucket-name",
                "arn:aws:s3:::your-iceberg-data-bucket-name/*"
            ]
        },
        {
            "Sid": "GlueCatalogApiPermissions",
            "Effect": "Allow",
            "Action": [
                "glue:*"
            ],
            "Resource": [
                "arn:aws:glue:your-Region:account-id:catalog",
                "arn:aws:glue:your-Region:account-id:database/iceberg-database-name",
                "arn:aws:glue:your-Region:account-id:database/default",
                "arn:aws:glue:your-Region:account-id:table/*/*"
            ]
        },
        {
            "Sid": "KmsKeyPermissions",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey",
                "kms:DescribeKey",
                "kms:ListKeys",
                "kms:ListAliases"
            ],
            "Resource": [
                "arn:aws:kms:your-Region:account-id:key/your-key-id"
            ]
        }
    ]
}

Add the following trust policy to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "emr-serverless.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

An IAM role called Data-Analyst, to represent the data analyst access. Use the following policy to create the role. Also attach the AWS managed policy arn:aws:iam::aws:policy/AmazonAthenaFullAccess to the role, to allow querying the Iceberg table using Amazon Athena. Refer to Data engineer permissions for additional details about this role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LFBasicUser",
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetCatalogs",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AthenaResultsBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:Put*",
                "s3:Get*",
                "s3:Delete*"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name-prefix",
                "arn:aws:s3:::your-bucket-name-prefix/*"
            ]
        }
    ]
}

Add the following trust policy to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<your_account_id>:root"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create the Iceberg table

Complete the following steps to create the Iceberg table:

Sign in to the Lake Formation console as the admin role.
In the navigation pane under Data Catalog, choose Databases.
From the Create dropdown menu, create a database named iceberg_db. You can leave the Amazon S3 location property empty for the database.

On the Athena console, run the following provided queries. The queries perform the following operations:

Create a table called customer_csv, pointing to the customer dataset in the public S3 bucket.
Create an Iceberg table called customer_iceberg, pointing to your S3 bucket location that will host the Iceberg table data and metadata.

Insert data from the CSV table to the Iceberg table.

CREATE EXTERNAL TABLE `iceberg_db`.`customer_csv`(
  `c_customer_sk` int,
  `c_customer_id` string,
  `c_current_cdemo_sk` int,
  `c_current_hdemo_sk` int,
  `c_current_addr_sk` int,
  `c_first_shipto_date_sk` int,
  `c_first_sales_date_sk` int,
  `c_salutation` string,
  `c_first_name` string,
  `c_last_name` string,
  `c_preferred_cust_flag` string,
  `c_birth_day` int,
  `c_birth_month` int,
  `c_birth_year` int,
  `c_birth_country` string,
  `c_login` string,
  `c_email_address` string,
  `c_last_review_date` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  ' s3://redshift-downloads/TPC-DS/2.13/10GB/customer/'
TBLPROPERTIES (
  'classification'='csv');   

 SELECT * FROM customer_csv LIMIT 5; //verifies table data  

CREATE TABLE IF NOT EXISTS iceberg_db.customer_iceberg (
        c_customer_sk             int,
        c_customer_id             string,
        c_current_cdemo_sk        int,
        c_current_hdemo_sk        int,
        c_current_addr_sk         int,
        c_first_shipto_date_sk    int,
        c_first_sales_date_sk     int,
        c_salutation              string,
        c_first_name              string,
        c_last_name               string,
        c_preferred_cust_flag     string,
        c_birth_day               int,
        c_birth_month             int,
        c_birth_year              int,
        c_birth_country           string,
        c_login                   string,
        c_email_address           string,
        c_last_review_date        string
    )
LOCATION 's3://your-iceberg-data-bucket-name/path/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

INSERT INTO customer_iceberg
SELECT *
FROM customer_csv;  

SELECT * FROM customer_iceberg LIMIT 5; //verifies table data

Set up the Iceberg table as a hybrid access mode resource

Complete the following steps to set up the Iceberg table’s Amazon S3 data location as hybrid access mode in Lake Formation:

Register your table location with Lake Formation:
1. Sign in to the Lake Formation console as data lake administrator.
2. In the navigation pane, choose Data lake Locations.
3. For Amazon S3 path, provide the S3 prefix of your Iceberg table location that holds both the data and metadata of the table.
4. For IAM role, provide the user-defined role that has permissions to your Iceberg table’s Amazon S3 location and that you created according to the prerequisites. For more details, refer to Registering an Amazon S3 location.
5. For Permission mode, select Hybrid access mode.
6. Choose Register location to register your Iceberg table Amazon S3 location with Lake Formation.

Add data location permission to ETL-application-role:
1. In the navigation pane, choose Data locations.
2. For IAM users and roles, choose ETL-application-role.
3. For Storage location, provide the S3 prefix of your Iceberg table.
4. Choose Grant.

Data location permission is required for write operations to the Iceberg table location only if the Iceberg table’s S3 prefix is a child location of the database’s Amazon S3 location property.

Grant Super access on the Iceberg database and table to IAMAllowedPrincipals:
1. In the navigation pane, choose Data permissions.
2. Choose IAM users and roles and choose IAMAllowedPrincipals.
3. For LF-Tags or catalog resources, choose Named Data Catalog resources.
4. Under Databases, select the name of your Iceberg table’s database.
5. Under Database permissions, select Super.
6. Choose Grant.
7. Repeat the preceding steps and for Tables – optional, choose the Iceberg table.
8. Under Table permissions, select Super.
9. Choose Grant.

Add database and table permissions to the Data-Analyst role:
1. Repeat the steps in Step 3 to grant permissions for the Data-Analyst role, once for database-level permission and once for table-level permission.
2. Select Describe permissions for the Iceberg database.
3. Select Select permissions for the Iceberg table.
4. Under Hybrid access mode, select Make Lake Formation permissions effective immediately.
5. Choose Grant.

The following screenshots show the database permissions for Data-Analyst.

The following screenshots show the table permissions for Data-Analyst.

Verify Lake Formation permissions on the Iceberg table and database to both Data-Analyst and IAMAllowedPrincipals:
1. In the navigation pane, choose Data permissions.
2. Filter by Table= customer_iceberg.
  You should see IAMAllowedPrincipals with All permission and Data-Analyst with Select permission.
3. Similarly, verify permissions for the database by filtering database=iceberg_db.

You should see IAMAllowedPrincipals with All permission and Data-Analyst with Describe permission.

Verify Lake Formation opt-in for Data-Analyst:
1. In the navigation pane, choose Hybrid access mode.

You should see Data-Analyst opted-in for both database and table level permissions.

Query the table as the Data-Analyst role in Athena

While you are logged in to the AWS Management Console as admin, set up the Athena query results bucket:

On the console navigation bar, choose your user name.
Choose Switch role to switch to the Data-Analyst role.
Enter your account ID, IAM role name (Data-Analyst), and choose Switch Role.
Now that you’re logged in as the Data-Analyst role, open the Athena console and set up the Athena query results bucket.
Run the following query to read the Iceberg table. This verifies the Select permission granted to the Data-Analyst role in Lake Formation.

SELECT * FROM "iceberg_db"."customer_iceberg"
WHERE c_customer_sk = 247

Upsert data as ETL-application-role using Amazon EMR

To upsert data to Lake Formation enabled Iceberg tables, we will use Amazon EMR Studio, which is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio will be our web-based IDE to run our notebooks, and we will use EMR Serverless as the compute engine. EMR Serverless is a deployment option for Amazon EMR that provides a serverless runtime environment. For the steps to run an interactive notebook, see Submit a job run or interactive workload.

Sign out of the AWS console as Data-Analyst and log back or switch the user to admin.
On the Amazon EMR console, choose EMR Serverless in the navigation pane.
Choose Get started.
For first-time users, Amazon EMR allows creation of an EMR Studio without a virtual private cloud (VPC). Create an EMR Serverless application as follows:
1. Provide a name for the EMR Serverless application, such as DemoHybridAccess.
2. Under Application setup, choose Use default settings for interactive workloads.
3. Choose Create and start application.

The next step is to create an EMR Studio.

On the Amazon EMR console, choose Studio under EMR Studio in the navigation pane.
Choose Create Studio.
Select Interactive workloads.
You should see a default pre-populated section. Keep these default settings and choose Create Studio and launch Workspace.

After the workspace is launched, attach the EMR Serverless application created earlier and select ETL-application-role as the runtime role under Compute.

Download the notebook Iceberg-hybridaccess_final.ipynb and upload it to EMR Studio workspace.

This notebook configures the metastore properties to work with Iceberg tables. (For more details, see Using Apache Iceberg with EMR Serverless.) Then it performs insert, update, and delete operations in the Iceberg table. It also verifies if the operations are successful by reading the newly added data.

Select PySpark as the kernel and execute each cell in the notebook by choosing the run icon.

Refer to Submit a job run or interactive workload for further details about how to run an interactive notebook.

The following screenshot shows that the Iceberg table insert operation completed successfully.

The following screenshot illustrates running the update statement on the Iceberg table in the notebook.

The following screenshot shows that the Iceberg table delete operation completed successfully.

Query the table again as Data-Analyst using Athena

Complete the following steps:

Switch your role to Data-Analyst on the AWS console.
Run the following query on the Iceberg table and read the row that was updated by the EMR cluster:
```
SELECT * FROM "iceberg_db"."customer_iceberg"
WHERE c_customer_sk = 247
```

The following screenshot shows the results. As we can see, ‘c_first_name’ column is updated with new value.

Clean up

To avoid incurring costs, clean up the resources you used for this post:

Revoke the Lake Formation permissions and hybrid access mode opt-in granted to the Data-Analyst role and IAMAllowedPrincipals.
Revoke the registration of the S3 bucket to Lake Formation.
Delete the Athena query results from your S3 bucket.
Delete the EMR Serverless resources.
Delete Data-Analyst role and ETL-application-role from IAM.

Conclusion

In this post, we demonstrated how to scale the adoption and use of Iceberg tables using Lake Formation permissions for read workloads, while maintaining full control over table schema and data updates through IAM policy-based permissions for the table owners. The methodology also applies to other open table formats and standard Data Catalog tables, but the Apache Spark configuration for each open table format will vary.

Hybrid access mode in Lake Formation is an option you could use to adopt Lake Formation permissions gradually and scale those use cases that support Lake Formation permissions while using IAM based permissions for the use cases that don’t. We encourage you to try out this setup in your environment. Please share your feedback and any additional topics you would like to see in the comments section.

About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lake house solutions, and establishes best practices.

Parul Saxena is a Senior Big Data Specialist Solutions Architect in AWS. She helps customers and partners build highly optimized, scalable, and secure solutions. She specializes in Amazon EMR, Amazon Athena, and AWS Lake Formation, providing architectural guidance for complex big data workloads and assisting organizations in modernizing their architectures and migrating analytics workloads to AWS.

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

2025-03-20 Shiyang Wei

Post Syndicated from Shiyang Wei original https://aws.amazon.com/blogs/big-data/build-a-data-lakehouse-in-a-hybrid-environment-using-amazon-emr-serverless-apache-dolphinscheduler-and-tidb/

While helping our customers build systems on AWS, we found out that a large number of enterprise customers who pay great attention to data security and compliance, such as B2C FinTech enterprises, build data-sensitive applications on premises and use other applications on AWS to take advantage AWS managed services. Using AWS managed services can greatly simplify daily operation and maintenance, as well as help you achieve optimized resource utilization and performance.

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including Amazon EMR Serverless, Amazon Athena, Amazon Simple Storage Service (Amazon S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).

Solution overview

For our use case, an enterprise data warehouse with business data is hosted on an on-premises TiDB platform, an AWS Global Partner that is also available on AWS through AWS Marketplace.

The data is then processed by an Amazon EMR Serverless Job to achieve data lakehouse tiering logic. Different tiering data are stored in separate S3 buckets or separate S3 prefixes under the same S3 bucket. Typically, there are four layers in terms of data warehouse design.

Operational data store layer (ODS) – This layer stores raw data of the data warehouse
Data warehouse stage layer (DWS) – This layer is a temporary staging area within the data warehousing architecture where data from various sources is loaded, cleaned, transformed, and prepared before being loaded into the data warehouse database layer;
Data warehouse database layer (DWD) – This layer is the central repository in a data warehousing environment where data from various sources is integrated, transformed, and stored in a structured format for analytical purposes;
Analytical data store (ADS) – This layer is a subset of the data warehousing that is specifically designed and optimized for a particular business function, department, or analytical purpose.

For this post, we only use ODS and ADS layers to demonstrate the technical feasibility.

The schema of this data is managed through the AWS Glue Data Catalog, and can be queried using Athena. The EMR Serverless Jobs are orchestrated using Apache DolphinScheduler deployed in cluster mode on Amazon Elasctic Compute Cloud (Amazon EC2) instances, with meta data stored in an Amazon Relational Database Service (Amazon RDS) for MySQL instance.

Using DolphinScheduler as the data lakehouse job orchestrator offers the following advantages:

Its distributed architecture allows for better scalability, and the visual DAG designer makes workflow creation more intuitive for team members with varying technical expertise
It provides more granular task-level controls and supports a wider range of task types out-of-the-box, including Spark, Flink, and machine learning (ML) workflows, without requiring additional plugin installations;
Its multi-tenancy feature enables better resource isolation and access control across different teams within an organization.

However, DolphinScheduler requires more initial setup and maintenance effort, making it more suitable for organizations with strong DevOps capabilities and a desire for complete control over their workflow infrastructure.

The following diagram illustrates the solution architecture.

Prerequisites

You need to create an AWS account and set up an AWS Identity and Access Management (IAM) user as a prerequisite for the following implementation. Complete the following steps:

For AWS account signing up, please follow up the actions guided per page link.

Create an AWS account.
Sign in to the account using the root user for the first time.
One the IAM console, create an IAM user with AdministratorAccess Policy.
Use this IAM user to log in AWS Management Console rather the root user.
On the IAM console, choose Users in the navigation pane.
Navigate to your user, and on the Security credentials tab, create an access key.
Store the access key and secret key in a secure place and use them for further API access of the resources of this AWS account.

Set up DolphinScheduler, IAM configuration, and the TiDB Cloud table

In this section, we walk through the steps to install DolphinScheduler, complete additional IAM configurations to enable the EMR Serverless job, and provision the TiDB Cloud table.

Install DolphinScheduler on an EC2 instance with an RDS for MySQL instance storing DolphinScheduler metadata. The production deployment mode of DolphinScheduler is cluster mode. In this blog, we use pseudo cluster mode which has the same installation steps as cluster mode, and could achieve resource economy. We name the EC2 instance ds-pseudo.

Make sure the inbound rule of the security group attached to the EC2 instance allows port 12345’s TCP traffic. Then complete the following steps:

sudo dnf install java-1.8.0-amazon-corretto
java -version

Switch to dir /usr/local/src:
```
cd /usr/local/src
```

Install Apache Zookeeper:

wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz
tar -zxvf apache-zookeeper-3.8.0-bin.tar.gz
cd apache-zookeeper-3.8.0-bin/conf
cp zoo_sample.cfg zoo.cfg
cd ..
nohup bin/zkServer.sh start-foreground &> nohup_zk.out &
bin/zkServer.sh status

Check the Python version:
```
python3 --version
```
The version should be 3.9 or above. It is recommended that you use Amazon Linux 2023 or later as the Amazon EC2 operating system (OS); Python version 3.9 meets the requirement. For detail information, refer to Python in AL2023.

Install Dolphinscheduler

Download the dolphinscheduler package:

cd /usr/local/src
wget https://dlcdn.apache.org/dolphinscheduler/3.1.9/apache-dolphinscheduler-3.1.9-bin.tar.gz
tar -zxvf apache-dolphinscheduler-3.1.9-bin.tar.gz
mv apache-dolphinscheduler-3.1.9-bin apache-dolphinscheduler

Download the mysql connector package:

wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.31.tar.gz
tar -zxvf mysql-connector-j-8.0.31.tar.gz

Copy specific mysql connector JAR file to the following destinations:

cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/api-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/alert-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/master-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/worker-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/tools/libs/

Add the user dolphinscheduler, and make sure the directory apache-dolphinscheduler and the files under it are owned by the user dolphinscheduler:

useradd dolphinscheduler
echo "dolphinscheduler" | passwd --stdin dolphinscheduler
sed -i '$adolphinscheduler ALL=(ALL) NOPASSWD: NOPASSWD: ALL' /etc/sudoers
sed -i 's/Defaults   requirett/#Defaults requirett/g' /etc/sudoers
chown -R dolphinscheduler:dolphinscheduler apache-dolphinscheduler

Install the mysql client:

sudo dnf update -y 
sudo dnf install mariadb105

On the Amazon RDS console, provision an RDS for MySQL instance with the following configurations:
1. For Database Creation Method, select Standard create.
2. For Engine options, choose MySQL.
3. For Edition: choose MySQL 8.0.35.
4. For Templates: select Dev/Test.
5. For Availability and durability, select Single DB instance.
6. For Credentials management, select Self-managed.
7. For Connectivity, select Connect to an EC2 compute resource, and choose the EC2 instance created earlier.
8. For Database Authentication: choose Password Authentication.
Navigate to the ds- mysql database details page, and under Connectivity & security, copy the RDS for MySQL endpoint.

Configure the intance:

mysql -h <RDS for mysql Endpoint> -u admin -p
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
mysql> exit;

Configure the dolphinscheduler configuration file:
```
cd /usr/local/src/apache-dolphinscheduler/
```

Revise dolphinscheduler_env.sh:

vim bin/env/dolphinscheduler_env.sh
export DATABASE=${DATABASE:-mysql}
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL="jdbc:mysql://ds-mysql.cq**********.us-east-1.rds.amazonaws.com/dolphinscheduler?useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false"
export SPRING_DATASOURCE_USERNAME="admin"
export SPRING_DATASOURCE_PASSWORD="<your password>"

On the Amazon EC2 console, navigate to the instance details page and copy the private IP address.

Revise install_env.sh:

vim bin/env/install_env.sh
ips=${ips:-"<private ip address of ds-pseudo EC2 instance>"}
masters=${masters:-"<private ip address of ds-pseudo EC2 instance>"}
workers=${workers:-" private ip address of ds-pseudo EC2 instance:default"}
alertServer=${alertServer:-" private ip address of ds-pseudo EC2 instance "}
apiServers=${apiServers:-" private ip address of ds-pseudo EC2 instance "}
installPath=${installPath:-"~/dolphinscheduler"}
export JAVA_HOME=${JAVA_HOME:-/usr/lib/jvm/jre-1.8.0-openjdk}
export PYTHON_HOME=${PYTHON_HOME:-/bin/python3}

Configure the dolphinscheduler configuration file:

cd /usr/local/src/apache-dolphinscheduler/
bash tools/bin/upgrade-schema.sh

Install DolphinScheduler:

cd /usr/local/src/apache-dolphinscheduler/
su dolphinscheduler
bash ./bin/install.sh

Start DolphinScheduler after installation:

cd /usr/local/src/apache-dolphinscheduler/
su dolphinscheduler
bash ./bin/start-all.sh

Open the DolphinScheduler console:

http://<ec2 ip address>:12345/dolphinscheduler/ui/login

After input the initial username and password, press Login button to enter into the dashboard shown as below.

initial user/password admin/dolphinscheduler123

Configure IAM role to enable the EMR serverless job

The EMR serverless job role needs to have permission to access a specific S3 bucket to read job scripts and potentially write results, and also have permission to access AWS Glue to read the Data Catalog which stores the tables’ meta data. For detailed guidance, please refer to Grant permission to use EMR Serverless or EMR Serverless Samples.

The following screenshot shows the IAM role configured with the trust policy attached.

The IAM role should have the following permissions policies attached, as shown in the following screenshot.

Provision the TiDB Cloud table

To provision the TiDB Cloud table, complete the following steps:
1. Register for TiDB Cloud.
2. Create a serverless cluster, as shown in the following screenshot. For this post, we name the cluster Cluster0.

Choose Cluster0, then choose SQL Editor to create a database named test:

create table testtable (id varchar(255));
insert into testtable values (1);
insert into testtable values (2);
insert into testtable values (3);

Synchronize data between on-premises TiDB and AWS

In this section, we discuss how to synchronize historical data as well as incremental data between TiDB and AWS.

Use TiDB Dumpling to sync historical data from TiDB to Amazon S3

Use the commands in this section to dump data stored in TiDB as CSV files into a S3 bucket. For full details on how to achieve a data sync from on-premises TiDB to Amazon S3, see Export data to Amazon S3 cloud storage. For this post, we use TiDB tool Dumpling. Complete the following steps:

Run the following command to install TiUP:

curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh

cd /root
source .bash_profile

tiup --version

Run the following command to install Dumpling:
```
tiup install dumpling
```

Run the following command to achieve target database table dumpling to the specific S3 bucket.

tiup dumpling -u <prefix.root> -P 4000 -h <tidb serverless endpoint/host> -r 200000 -o "s3://<specific s3 bucket>" --sql "select * from <target database>.<target table>" --ca "/etc/pki/tls/certs/ca-bundle.crt" --password <tidb serverless password>

To acquire the TiDB serverless connection information, navigate to the TiDB Cloud console and choose Connect.

You can collect the specific connection information of test database from the following screenshot.

Yan can view the data stored in the S3 bucket on the Amazon S3 console.

You can use Amazon S3 Select to query the data and get results similar to the following screenshot, confirming that the data has been ingested into testtable.

Use TiDB Dumpling with a self-managed checkpoint to sync incremental data from TiDB to Amazon S3

To achieve incremental data synchronization using TiDB Dumpling, it’s essential to self-manage the check point of the target synchronized data. One recommended way is to store the ID of the final ingested record into a certain media (such as Amazon ElastiCache for Redis, Amazon DynamoDB) to achieve a self-managing checkpoint when running the shell/Python job that trigges TiDB Dumpling. The prerequisite for implementing this is that the target table has a monotonically increasing id field as its primary key.

You can use the following TiDB Dumpling command to filter the exported data:

tiup dumpling -u <prefix.root> -P 4000 -h <tidb serverless endpoint/host> -r 200000 -o "s3://<specific s3 bucket>" --sql "select * from <target database>.<target table> where id > 2" --ca "/etc/pki/tls/certs/ca-bundle.crt" --password <tidb serverless password>

Use the TiDB CDC connector to sync incremental data from TiDB to Amazon S3

The advantage of using TiDB CDC connector to achieve incremental data synchronization from TiDB to Amazon S3 is that there is built-in change data capture (CDC) mechanism, and because the backend engine is Flink, the performance is fast. However, there is one trade-off: you need to create several Flink tables to map the ODS tables on AWS.

For instructions to implement the TiDB CDC connector, refer to TiDB CDC.

Use an EMR serverless job to sync historical and incremental data from a Data Catalog table to the TiDB table

Data usually flows from on premises to the AWS Cloud. However, in some cases, the data might flow from the AWS Cloud to your on-premises database.

After landing on AWS, the data will be wrapped up and managed by the Data Catalog by created Athena tables with the specific tables’ schema. The table DDL script is as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS `testtable`(
  `id` string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://<bucket_name>/<prefix_name>/';

The screenshot below showcases the DDL running result using Athena console.

The data stored in testtable table is queried using select * from testable SQL. The query result is shown as follows:

In this case, an EMR serverless spark job can accomplish the work of synchronizing data from an AWS Glue table to your on premises table.

If the Spark job is written in Scala, the sample code is as below:

package com.example
import org.apache.spark.sql.{DataFrame, SparkSession}

object Main  {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName("<specific app name>")
      .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases").show()
    spark.sql("use default")
    var df=spark.sql("select * from testtable")

    df.write
      .format("jdbc")
      .option("driver","com.mysql.cj.jdbc.Driver")
      .option("url", "jdbc:mysql://<tidbcloud_endpoint>:4000/namespace")
      .option("dbtable", "<table_name>")
      .option("user", "<user_name>")
      .option("password", "<password_string>")
      .save()

    spark.close()
  }
}

You can acquire the TiDB serverless endpoint connection information on the TiDB console by choosing Connect, as shown earlier in this post.

After you have wrapped the Scala code as JAR file using SBT, you can submit the job to EMR Serverless with the following AWS Command Line Interface (AWS CLI) command:

export applicationId=00fev6mdk***

export job_role_arn=arn:aws:iam::<aws account id>:role/emr-serverless-job-role

aws emr-serverless start-job-run \
    --application-id $applicationId \
    --execution-role-arn $job_role_arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "<s3 object url for the wrapped jar file>",
            "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g --jars s3://spark-sql-test-nov23rd/mysql-connector-j-8.2.0.jar"
        }
    }'

If the Spark job is written in PySpark, the sample code is as follows:

import os
import sys
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\
        .builder\
        .appName("app1")\
        .enableHiveSupport()\
        .getOrCreate()

    df=spark.sql(f"select * from {str(sys.argv[1])}")

    df.write.format("jdbc").options(
        driver="com.mysql.cj.jdbc.Driver",
        url="jdbc:mysql://tidbcloud_endpoint:4000/namespace ",
        dbtable="table_name",
        user="use_name",
        password="password_string").save()

    spark.stop()

You can submit the job to EMR Serverless using the following AWS CLI command:

export applicationId=00fev6mdk***

export job_role_arn=arn:aws:iam::<aws account id>:role/emr-serverless-job-role

aws emr-serverless start-job-run \
    --application-id $applicationId \
    --execution-role-arn $job_role_arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "<s3 object url for the python script file>",
            "entryPointArguments": ["testspark"],
            "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g --jars s3://spark-sql-test-nov23rd/mysql-connector-j-8.2.0.jar"
        }
    }'

The preceding PySpark code and AWS CLI command achieves outbound parameter input as well: the table name (specifically testspark) is ingested into the SQL sentence when submitting the job.

EMR Serverless job pperation essentials

An EMR Serverless application is a resource pool concept. An application holds a certain capacity of compute, memory, and storage resources for jobs running on it to use. You can configure the resource capacity using AWS CLI or the console. Because it’s a resource pool, EMR Serverless application creation is usually a one-time action with the initial capacity and maximum capacity being configured.

An EMR Serverless job is a working unit that actually processes the compute task. In order for a job to work, you need to set the EMR Serverless application ID, the execution IAM role (discussed previously), and the specific application configuration (the resources the job is planning to use). Although you can create the EMR Serverless job on the console, it’s recommended to create the EMR Serverless job using the AWS CLI for further integration with the scheduler and scripts.

For more details on EMR Serverless application creation and EMR Serverless job provisioning, refer to EMR Serverless Hive query or EMR Serverless PySpark job

DolphinScheduler integration and job orchestration

DolphinScheduler is a modern data orchestration platform. It’s agile to create high- performance workflows with low code. It also provides a powerful UI, dedicated to solving complex task dependencies in the data pipeline and providing various types of jobs out of the box.

DolphinScheduler is developed and maintained by WhaleOps, and available in AWS Marketplace as WhaleStudio.

DolphinScheduler has been natively integrated with Hadoop: DolphinScheduler cluster mode is by default recommended to be deployed on a Hadoop cluster (usually on HDFS data nodes), and the HQL scripts uploaded to DolphinScheduler Resource Manager are stored by default on HDFS, and can be orchestrated using the following native Hive shell command:

Hive -f example.sql

Moreover, for specific case in which the orchestration DAGs are quite complicated, each DAG consists of several jobs (for example, more than 300), and almost all the jobs are HQL scripts stored in DolphinScheduler Resource Manager.

Complete the steps listed in this section to achieve a seamless integration between DolphinScheduler and EMR Serverless.

Switch the storage layer of DolphinScheduler Resource Center from HDFS to Amazon S3

Edit the common.properties files under directories /usr/local/src/apache-dolphinscheduler/api-server/ and directory /usr/local/src/apache-dolphinscheduler/worker-server/conf. The following code snippet shows the part of the file that needs to be revised:

# resource storage type: HDFS, S3, OSS, NONE
#resource.storage.type=NONE
resource.storage.type=S3
# resource store on HDFS/S3 path, resource file will store to this base path, self configuration, please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
resource.storage.upload.base.path=/dolphinscheduler

# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=AKIA************
# The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.secret.access.key=lAm8R2TQzt*************
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=us-east-1
# The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
resource.aws.s3.bucket.name=dolphinscheduler-shiyang
# You need to set this parameter when private cloud s3. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
resource.aws.s3.endpoint=s3.us-east-1.amazonaws.com

After editing and saving the two files, restart the api-server and worker-server by running the following commands, under folder path /usr/local/src/apache-dolphinscheduler/

bash ./bin/stop-all.sh
bash ./bin/start-all.sh
bash ./bin/status-all.sh

You can validate whether switching the storage layer to Amazon S3 was successful by uploading a script using DolphinScheduler Resource Center Console, check if the file appears in relevant S3 bucket folder.

Before verifying that Amazon S3 is now the storage location of DolphinScheduler, you need to create a tenant on the DolphinScheduler console and bundle the admin user with the tenant, as illustrated in the following screenshots:

After that, you can create a folder on the DolphinScheduler console, and check whether the folder is visible on the Amazon S3 console.

Make sure the job scripts uploaded from Amazon S3 are available in the DolphinScheduler Resource Center

After accomplishing the first task, you can upload the scripts from the DolphinScheduler Resource Center console, and confirm that the scripts are stored in Amazon S3. However, in practice, you need to migrate all scripts directly to Amazon S3. You can find and modify the scripts stored in Amazon S3 using DolphinScheduler Resource Center console. To do so, you can revise the metadata table t_ds_resources by inserting all the scripts’ metadata. The table schema of table t_ds_resources is shown in the following screenshot.

The insert command is as follows:

insert into t_ds_resources values(6, 'count.java', ' count.java','',1,1,0,'2024-11-09 04:46:44', '2024-11-09 04:46:44', -1, 'count.java',0);

Now there are two records in the table t_ds_resoruces.

You can access relevant records on the DolphinScheduler console.

The following screenshot shows the files on the Amazon S3 console.

Make the DolphinScheduler DAG orchestrator aware of the jobs’ status so the DAG can move forward or take relevant actions

As mentioned earlier, DolphinScheduler is natively integrated with the Hadoop ecosystem, and the HQL scripts can be orchestrated by the DolphinScheduler DAG orchestrator via Hive -f xxx.sql command. As a result, when the scripts changed to shell scripts or Python scripts (EMR Severless jobs needs to be orchestrated via shell scripts or Python scripts rather than the simple Hive command), the DAG orchestrator can start the job, but can’t get the real time status of the job, and therefore can’t continue the workflow to further steps. Because the DAGs in this case are very complicated, it’s not feasible to amend the DAGs; instead we follow a lift-and-shift strategy.

We use the following scripts to capture jobs’ status and take appropriate actions.

Persist the application ID list with the following code:

var=$(cat applicationlist.txt|grep appid1)
applicationId=${var#* }
echo $applicationId

Enable the DolphinScheduler step status auto-check using a Linux shell:

app_state
{
  response2=$(aws emr-serverless get-application --application-id $applicationId)
  application=$(echo $response1 | jq -r '.application')
  state=$(echo $application | jq -r '.state')
  echo $state
}

job_state
{
  response4=$(aws emr-serverless get-job-run --application-id $applicationId --job-run-id $JOB_RUN_ID)
  jobRun=$(echo $response4 | jq -r '.jobRun')
  JOB_RUN_ID=$(echo $jobRun | jq -r '.jobRunId')
  JOB_STATE=$(echo $jobRun | jq -r '.state')
  echo $JOB_STATE
}

state=$(job_state)

while [ $state != "SUCCESS" ]; do
  case $state in
    RUNNING)
         state=$(job_state)
         ;;
    SCHEDULED)
         state=$(job_state)
         ;;
    PENDING)
         state=$(job_state)
         ;;
    FAILED)
         break
         ;;
   esac
done

if [ $state == "FAILED" ]
then
  false
else
  true
fi

Clean up

To clean up your resources, we recommend using APIs through the following steps:

Delete the EC2 instance:
1. Find the instance using the following command:
```
aws ec2 describe-instances 
```
2. Delete the instance using the following command:
```
aws ec2 terminate-instances –instance-ids <specific instance id>
```
Delete the RDS instance:
1. Find the instance using the following command:
```
aws rds describe-db-instances
```
2. Delete the instance using the following command:
```
aws rds delete-db-instances –db-instance-identifier <speficic rds instance id>
```
Delete the EMR Serverless application
1. Find the EMR Serverless application using the following command:
```
aws emr-serverless list-applications 
```
2. Delete the EMR Serverless application using the following command:
```
 aws emr-serverless delete-application –application-id <specific application id>
```

Conclusion

In this post, we discussed how EMR Serverless, as AWS managed serverless big data compute engine, integrates with popular OSS products like TiDB and DolphinScheduler. We discussed how to achieve data synchronization between TiDB and the AWS Cloud, and how to use DolphineScheduler to orchestrate EMR Serverless jobs.

Try out the solution with your own use case, and share your feedback in the comments.

About the Author

Shiyang Wei is Senior Solutions Architect at Amazon Web Services. He is specializing in cloud system architecture and solution design for the financial industry. Particularly, he focused on big data and machine learning applications in finance, as well as the impact of regulatory compliance on cloud architecture design in the financial sector. He has over 10 years of experience in data domain development and architectural design.

Implement Amazon EMR HBase Graceful Scaling

2025-03-18 Yu-Ting Su

Post Syndicated from Yu-Ting Su original https://aws.amazon.com/blogs/big-data/implement-amazon-emr-hbase-graceful-scaling/

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. We can use Amazon EMR with HBase on top of Amazon Simple Storage Service (Amazon S3) for random, strictly consistent real-time access for tables with Apache Kylin. It ingests data through spark jobs and queries the HTables through Apache Kylin cubes. The HBase cluster uses HBase write-ahead logs (WAL) instead of Amazon EMR WAL.

A time goes by, companies may want to scale in long-running Amazon EMR HBase clusters because of issues such as Amazon Elastic Compute Cloud (Amazon EC2) scheduling events and budget concerns. Another issue is that companies may use Spot Instances and auto scaling for task nodes for short-term parallel computation power, like MapReduce tasks and spark executors. Amazon EMR also runs HBase region servers on task nodes for Amazon EMR on S3 clusters. Spot interruptions will lead to an unexpected shutdown on HBase region servers. For an Amazon EMR HBase cluster without enabling write-ahead logs (WAL) for Amazon EMR feature, an unexpected shutdown on HBase region servers will cause WAL splits with server recovery process, and it will bring extra load to the cluster and sometimes makes HTables inconsistent.

For these reasons, administrators look for a way to scale-in Amazon EMR HBase cluster gracefully and stop all HBase region servers on the task nodes.

This post demonstrates how to gracefully decommission target region servers programmatically. The scripts do the following tasks. The script also tests successfully in Amazon EMR 7.3.0, Amazon EMR 6.15.0, and 5.36.2.

Automatically move the HRegions through a script
Raise the decommission priority
Decommission HBase region servers gracefully
Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR software configurations
Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR steps

Overview of solution

For graceful scaling in, the script uses HBase built-in graceful_stop.sh to move regions to other region servers to avoid WAL splits when decommissioning nodes. The script uses HDFS CLI and web interface to make sure there are no missing and corrupted HDFS block during the scaling events. To prevent Amazon EMR provisions HBase region servers on task nodes, administrators need to specify software configurations per instance groups when launching a cluster. For existing clusters, administrators can either use a step to terminate HBase region servers on task nodes, or reconfigure the task instance group’s HBase storagerootdir.

Solution

For a running Amazon EMR cluster, administrators can use AWS Command Line Interface (AWS CLI) to issue a modify-instance-groups with EC2InstanceIdsToTerminate to terminate specified instances immediately. But terminating an instance in this way can cause a data loss and unpredictable cluster behavior when HDFS blocks have not enough copies or there are ongoing tasks on those decommissioned nodes. To avoid these risks, administrators can send a modify-instance-groups with a new instance request count without a specific instance ID that administrators want to terminate. This command triggers a graceful decommission process on the Amazon EMR side. However, Amazon EMR only supports graceful decommission for YARN and HDFS. Amazon EMR doesn’t support graceful decommission for HBase.

Hence, administrators can try method 1, as described later in this post, to raise the decommission priority of the decommission targets as the first step. In case tweaking the decommissions priority didn’t work, move forward to the second approach, method 2. Method 2 is to stop the resizing request, and move the HRegions manually before terminating the target core nodes. Note that Amazon EMR is a managed service. Amazon EMR service will terminate the EC2 instance after anyone stops it or detach its Amazon Elastic Block Store (Amazon EBS) volumes. Therefore, don’t try to detach EBS volumes on the decommission targets and attach them to new nodes.

Method 1: Decommission HBase region servers through resizing

To decommission Hadoop nodes, administrators can add decommission targets to HDFS’s and YARN’s exclude list, which were dfs.hosts.exclude and yarn.nodes.exclude.xml. However, Amazon EMR disallows manual update to these files. The reason is that the Amazon EMR service daemon, master instance controller, is the only valid process to update these two files on master nodes. Manual updates to these two files will be reset.

Thus, one of the most accessible ways to raise a core node’s decommission priority according to Amazon EMR is having less instance controller heartbeat.

As the first step, pass move_regions to the following script on Amazon S3, blog_HBase_graceful_decommission.sh, as an Amazon EMR step to move HRegions to other region servers and shutdown processes of region server and instance controller. Please also provide targetRS and S3Path to blog_HBase_graceful_decommission.sh. targetRS represents to the private DNS of the decommission target region server. S3Path represents the location of the region migration script.

This step needs to be run in off-peak hours. After all HRegions on the target region server are moved to other nodes, splitting WAL activities after stopping the HBase region server will generate a very low workload to the cluster because it serves 0 regions.

For more information , refer to blog_HBase_graceful_decommission.sh.

Taking a closer look at the move_regions option in blog_HBase_graceful_decommission.sh, this script disables the region balancer and moves the regions to other region servers. The script retrieves Secure Shell (SSH) credentials from AWS Secrets Manager to access worker nodes.

In addition, the script included some AWS CLI operations. Please make sure the instance profile, EMR_EC2_DefaultRole, can operate the following APIs and have SecretsManagaerReadWrite permission.

Amazon EMR APIs:

describe-cluster
list-instances
modify-instance-groups

Amazon S3 APIs:

cp

Secrets Manager APIs:

get-secret-value

In Amazon EMR 5.x, HBase on Amazon S3 will make the master node also work as a region server hosting hbase:meta regions. This script will get stuck when trying to move non-hbase:meta HRegions to the master. To automate the script, the parameter, maxthreads, is increased to move regions through multiple threads. By moving regions in a while loop, one of the threads got a runtime error because it tries to move non-hbase:meta HRegions to the master node. Other threads can keep on moving HRegions to other region servers. After the only stuck thread timed out after 300 seconds, it moves forward to the next run. After six retries, manual actions will be required, such as using a move action through the HBase shell for the remaining regions’ movement or resubmitting the step.

The following is the syntax to use the script to invoke the move_regions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions <your-secret-id> <targetRS: target_region_server_private_DNS> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions your-secret-id ip-172-0-0-1.us-west-2.compute.internal s3://yourbucket/yourpath/
Action on failure:Continue

In the HBase web UI, the target region server will serve 0 regions after the evacuation, as shown in the following screenshot.

After that, the stop_RS_IC function in the script stopped the HBase region server and instance controller process on the decommission target after making sure that there is no running YARN container on that node.

Note that the script is for Amazon EMR 5.30.0 and later release versions. For Amazon EMR 4.x-5.29.0 release versions, stop_RS_IC in the script needs to be updated by referring to How do I restart a service in Amazon EMR? In the AWS Knowledge Center. Also, in Amazon EMR versions earlier than 5.30.0, Amazon EMR uses a service nanny to watch the status of other processes. If a service nanny automatically restarts the instance controller, please stop the service nanny using the stop_RS_IC function before stopping the instance controller on that node. Here’s an example:

if [ "\$runningContainers" -eq 0 ]; then
        echo "0 container is running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
        echo "Shutdown IC" | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/service-nanny stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller status | tee -a /tmp/graceful_stop.log;
else
        echo "Still have \${runningContainers} containers running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
     	echo "Not to shutdown IC" | tee -a /tmp/graceful_stop.log;
fi

After the step is successfully completed, scale in and define (current core node amount is −1) as the desired target node amount using the Amazon EMR console. Amazon EMR might pick up the target core node to decommission it because the instance controller isn’t running on that node. There can be a few minutes of delay for Amazon EMR to detect the heartbeat loss of that target node through polling the instance controller. Thus, make sure the workload is very low and there will be no container to the target node for a while.

Stopping the instance controller merely increases the decommissioning priority. But method 1 doesn’t guarantee that the target core node will be picked up as the decommissioning target by Amazon EMR. If Amazon EMR doesn’t pick up the decommission target as the decommissioning victim after using method 1, administrators can stop the resize activity using the AWS Management Console. Then, proceed to method 2.

Method 2: Manually decommission the target core nodes

Administrators can terminate the node using the EC2InstanceIdsToTerminate option in the modify-instance-groups API. But this action will directly terminate the EC2 instance and will risk losing HDFS blocks. To mitigate the risk of having a data loss, administrators can use the following steps in off-peak hours with zero or very few running jobs.

First, run the move_hregions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step in method 1. The function moves HRegions to other region servers and stopped the HBase region server as well as the instance controller process.

Then, run the terminate_ec2 function in blog_HBase_graceful_decommission.sh as an Amazon EMR step. To run this function successfully, please provide the target instance group ID and target instance ID to the script. This function merely terminates one node at a time by specifying the EC2InstanceIdsToTerminate option in the modify-instance-groups API. This makes sure that the core nodes are not terminated back-to-back and lowered the risks of missing HDFS blocks. It inspects HDFS and makes sure all HDFS blocks had at least two copies. If an HDFS block have only one copy, the script will exit with an error message similar to, “Some HDFS blocks have only 1 copy. Please increase HDFS replication factor through the following command for existing HDFS blocks.”

$ hdfs dfs -setrep -R -w 2 <the-file-or-directory-you-want-to-modify>

To make sure all upcoming HDFS blocks have at least two copies, reconfigure the core instance group with the following software configuration:

[{
    "classification": "hdfs-site",
    "properties": {
        "dfs.replication": "2"
    },
    "configurations": []
}]

In addition, the terminateEC2 function compares the metadata of the replicating blocks before and after terminating the core node using hdfs dfsadmin -report. This makes sure no under-replicating, corrupted, or missing HDFS block increased.

The terminateEC2 function tracked decommission status. The script will complete after the decommission completes. It can take some time to recover HDFS blocks. The elapsed time depends on several factors such as the total number of blocks, I/O, bandwidth, HDFS handler amount, and name node resources. If there are many HDFS blocks to be recovered, it may take a few hours to complete. Before running the script, please make sure that the instance profile, EMR_EC2_DefaultRole, have permission of elasticmapreduce:ModifyInstanceGroups.

The following is the syntax to use the script to invoke the terminate_ec2 function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 <your-secret-id> <instance_groupID> <target_EC2_Instance_ID>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 your-secret-id ig-ABCDEFGH12345 i-1234567890abcdef
Action on failure:Continue

While invoking terminate_ec2, the script checks HDFS Name Node Web UI for the decommission target to understand how many blocks need to be recovered on other nodes after submitting the decommission request. Here are the steps:

On the Amazon EMR console, version 6.x, find HDFS NameNode web UI. For example, enter http://<master-node-public-DNS>:9870
On the top menu bar, choose Datanodes
In the In operation section, check the on-service data nodes and the total number of data blocks on the nodes, as shown in the following screenshot.
To view the HDFS decommissioning progress, go to Overview, as shown in the following screenshot.

On the Datanodes page, the decommission target node will not have a green checkmark, and the node will be in the Decommissioning section, as shown in the following screenshot.

The step’s STDOUT also reveals the decommission status:

Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommission in progress

The decommission target will transit from Decommissioning to Decommissioned in the HDFS NameNode web UI, as shown in the following screenshot.

The decommissioned target will appear in the Dead datanodes section in the step’s STDOUT after the process is completed:

Dead datanodes (1):
Name: 172.31.4.197:50010 (ip-172-31-4-197.us-west-2.compute.internal)
Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommissioned
Configured Capacity: 62245027840 (57.97 GB)
DFS Used: 394412032 (376.14 MB)
Non DFS Used: 0 (0 B)
DFS Remaining: 61179640063 (56.98 GB)
DFS Used%: 0.63%
DFS Remaining%: 98.29%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jan 14 06:09:17 UTC 2025

After the target node is decommissioned, the hdfs dfsadmin report will be displayed in the last section in the step’s STDOUT . There should be no difference between rep_blocks_${beforeDate} and rep_blocks_${afterDate} as described in the script. It means no additional amount of under-replicated, missing, or corrupt blocks after the decommission. In HBase web UI, the decommissioned region server will be moved to dead region servers. The dead region server records will be reset after restarting HMaster during routine maintenance.

After the Amazon EMR step is completed without errors, please repeat the preceding steps to decommission the next target core node because administrators may have more than one core nodes to decommission.

After administrators complete all decommission tasks, administrators can manually enable the HBase balancer through the HBase shell again:

$ echo "balance_switch true" | sudo -u hbase hbase shell
## To make sure balance_switch is enabled, submit the same command again. The output should say it’s already in “true” status.
$ echo "balance_switch true" | sudo -u hbase hbase shell

Prevent Amazon EMR from provisioning HBase region servers on task nodes

For new clusters, configure HBase settings for master and core groups only and keep the HBase settings empty when launching an Amazon EMR HBase on an S3 cluster. This prevents provisioning HBase region servers on task nodes.

For example, define configurations for applications other than HBase settings in the software configuration textbox in the Software settings section on the Amazon EMR console, as shown in the following screenshot.

Then, configure HBase settings in Node configuration – optional for each instance group in the Cluster configuration – required section, as shown in the following screenshot.

For master and core instance groups, HBase configurations will be like the following screenshot.

Here’s a json formatted example:

[
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "s3"
         }
    },
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "s3://my/HBase/on/S3/RootDir/"
        }
    }
]

For task instance groups, there will be no HBase configuration, as shown in the following screenshot.

Here’s a json formatted example:

[]

Here’s an example in AWS CLI:

$ aws emr create-cluster \
--applications Name=Hadoop Name=HBase Name=ZooKeeper \
... (skip) \
--instance-groups '[ {"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Master - 1"},\
{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"TASK","InstanceType":"m5.xlarge","Name":"Task - 3"},\
{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":64,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Core - 2"}]' --configurations '[{"Classification":"hdfs-site","Properties":{"dfs.replication":"2"}}]' \
--auto-scaling-role Amazon EMR_AutoScaling_DefaultRole \
... (skip) \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-west-2

Stop decommission the HBase region servers on task nodes

For an existing Amazon EMR HBase on an S3 cluster, pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step to stop HBase region servers on nodes in a task instance group. The script requirs a task instance group ID and an S3 location to place sharing scripts for task nodes.

The following is the syntax to pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location: s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments: s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh stop_and_check_task_rs <your-secret-id> <instance_groupID> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to stop HBase regions on nodes in a task group:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/ blog_HBase_graceful_decommission.sh your-secret-id stop_and_check_task_rs ig-ABCDEFGH12345 s3://yourbucket/yourpath/
Action on failure:Continue

This step above not only stops HBase region servers on existing task nodes. To avoid provisioning HBase region servers on new task nodes, the script also reconfigures and scales in the task group. Here are the steps:

Using the move_regions function, in blog_HBase_graceful_decommission.sh, move HRegions on the task group to other nodes and stop region servers on those task nodes.

After making sure that the HBase region servers are stopped at these task nodes, the script reconfigures the task instance group. The reconfiguration details are to let HBase rootdir point to a non-existing location. These settings only apply to the task group. Here’s an example:

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "hdfs://non/existing/location"
        }
    },
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "hdfs"
        }
    }
]

When the task group’s state returns to RUNNING, the script scales in these task nodes to 0. New task nodes in the upcoming scaling out events will not run HBase region servers.

Conclusion

These scaling steps demonstrate how to handle Amazon EMR HBase scaling gracefully. The functions in the script can help administrators to resolve problems when companies want to gracefully scale the Amazon EMR HBase on S3 clusters without Amazon EMR WAL.

If you have a similar request to scale in an Amazon EMR HBase on an S3 cluster gracefully because the cluster doesn’t enable Amazon EMR WAL, you can refer to this post. Please test the steps in the testing environment for verifications first. After you confirm the steps can meet your production requirements, you can proceed and apply the steps to production environment.

About the Authors

Yu-Ting Su is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.

Hsing-Han Wang is a Cloud Support Engineer at Amazon Web Services (AWS). He focuses on Amazon EMR and AWS Lambda. Outside of work, he enjoys hiking and jogging, and he is also an Eorzean.

Cheng Wang is a Technical Account Manager at AWS who has over 10 years of industry experience, focusing on enterprise service support, data analysis, and business intelligence solutions.

Chris Li is an Enterprise Support manager at AWS. He leads a team of Technical Account Managers to solve complex customer problems and implement well-structured solutions.

Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

2025-03-14 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/architect-fault-tolerant-applications-with-instance-fleets-on-amazon-emr-on-ec2/

Organizations rely on Amazon EMR on EC2 clusters to process large-scale data workloads using frameworks like Apache Spark, Apache Hive, and Trino. Events such as TV advertisements or unplanned promotions might lead to an increase in demand of compute capacity, making effective capacity planning necessary to make sure your workloads don’t hit capacity limits or job failures.

A common scenario is to run daily Spark jobs on Amazon EMR using consistent Amazon Elastic Compute Cloud (Amazon EC2) instance types (for example, a single instance size and family for the cluster). Although this might work well to sustain the baseline, spikes can trigger auto scaling, which narrows the chances of capacity availability when trying to stop and relaunch a larger EMR cluster, because the specific on-demand instance pool might lack capacity to meet the demand.

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when Amazon EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use Amazon EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Solution overview

Instance fleets in Amazon EMR offer a flexible and robust way to manage EC2 instances within your cluster. This feature allows you to specify target capacities for On-Demand and Spot Instances, select up to five EC2 instance types per fleet (or 30 when using the AWS Command Line Interface [AWS CLI] and API with an allocation strategy), and use multiple subnets across different Availability Zones. Importantly, instance fleets support the use of ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capacity. You can configure your instance fleet to prefer or require capacity reservations, making sure that your EMR clusters use your reserved capacity efficiently.

EMR workload patterns typically fall into two categories: stable and variable (spiky). In the following sections, we explore how to optimize for each pattern using various options available with instance fleets, starting with stable workloads and then addressing variable workloads.

Stable workloads are workloads with a predictable pattern of resource utilization over time; for example, a pharmaceutical provider needs to process 21 TB of research data, patient records, and other information daily. The workload is consistent and needs to run reliably every day on long-running persistent clusters. For critical business operations requiring high reliability and guaranteed capacity, we recommend reserving the baseline capacity as part of your capacity planning. We demonstrate the following steps:

Use AWS Cost and Usage Reports (AWS CUR) to estimate the baseline of existing workloads.
Reserve the baseline capacity using ODCR.
Configure Amazon EMR to use the targeted ODCR.

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands. These surges can be triggered by various factors (such as batch processing, real-time data streaming, or seasonal business fluctuations) that trigger Amazon EMR to request more capacity to match the demand. We address the resource allocation by using instance and Availability Zone flexibility, with the following steps:

Introduce EC2 instance flexibility with EMR instance fleets.
Achieve resiliency through intelligent subnet selection with EMR instance fleets.
Use managed scaling to automatically manage scaling in and out.

Stable workloads

In this section, we demonstrate how to define your baseline, configure AWS Identity and Access Management (IAM) permissions, create an ODCR, and associate your reservations to a capacity group and configure Amazon EMR to use targeted ODCRs. You can opt for a mixed ODCR strategy—for example, one ODCR with a short period of duration that supports the launch of your EMR cluster, and another ODCR with a longer period of duration that supports your task nodes based on the baseline capacity reservation.

Estimate the baseline

Make sure to activate the AWS generated cost allocation tag aws:elasticmapreduce:job-flow-id. This enables the field resource_tags_aws_elasticmapreduce_job_flow_id in the AWS CUR to be populated with the EMR cluster ID and is used by the SQL queries in the solution. To activate the cost allocation tag from the AWS Billing Console, complete the following steps:

On the AWS Billing and Cost Management console, choose Cost allocation tags in the navigation pane.
Under AWS generated cost allocation tags, choose the aws:elasticmapreduce:job-flow-id tag.
Choose Activate.

It can take up to 24 hours for tags to activate. For more information, see here.

After the tags are activated, you can use AWS CUR and perform the following query on Amazon Athena to find the compute resources used by the EMR cluster ID vs. the timeline of usage. For more details, see Querying Cost and Usage Reports using Amazon Athena. Update the following query with your CUR table name, EMR cluster ID, desired timestamps, and AWS account ID, and run the query on Athena:

SELECT bill_payer_account_id as Payer,
    product_product_family as PFamily,
    product_product_name as PName,
    resource_tags_aws_elasticmapreduce_job_flow_id,
    line_item_usage_account_id as LinkedAccount,
    line_item_usage_start_date as UsageDate,
    bill_billing_period_start_date as BillingDate,
    SPLIT_PART(line_item_usage_type, ':', 2) AS InstanceType,
    line_item_availability_zone AS AvailabilityZone,
    COUNT(line_item_resource_id) as ResourceIDCount
FROM <YOUR_CUR_TABLE_NAME>
WHERE (
        line_item_usage_start_date BETWEEN TIMESTAMP 'YYYY-MM-DD 00:00:00'
        AND TIMESTAMP 'YYYY-MM-DD 23:59:59' 
    )
    AND line_item_operation LIKE '%%RunInstance%%'
    AND line_item_line_item_type LIKE '%%Usage%%'
    AND product_product_family NOT IN ('Data Transfer')
    AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE '%%<emr-cluster-id>%%'
    AND line_item_usage_account_id IN (
        '<aws_account_id>'
)
GROUP BY 1,2,3,4,5,6,7,8,9

As an example, the preceding query filters instances usage per hour for a given account and EMR cluster for the period of 6 months, to generate the following figure. You can export the results in CSV format and analyze the data. Now that you have a visual representation of your workloads’ baseline and bursts, you can define the strategy and configuration of your EMR cluster.

Create an ODCR to reserve the baseline capacity

ODCRs can be either open or targeted:

With an open ODCR, new instances and existing instances that have matching attributes (such as operating system or instance type) will run using the capacity reservation attributes first.
With a targeted ODCR, instances must match the attributes of the ODCR specification and the ODCR is specifically targeted at launch. This approach is recommended if you have multiple concurrent EMR clusters consuming capacity from the shared On-Demand pool of EC2 instances. EMR clusters larger than the targeted ODCR quantity will fall back to On-Demand Instances that are in the same Availability Zone.

In this example, we use a targeted ODCR with an EMR instance fleet in the us-east-1a Availability Zone. The following diagram illustrates the workflow.

Complete the following steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make a note of the CapacityReservationArn value in the output:

aws ec2 create-capacity-reservation \
     --availability-zone <Input Your Availability Zone> \
     --instance-type r8g.2xlarge \
     --instance-match-criteria targeted \
     --instance-platform Linux/UNIX \
     --instance-count <enter the number of instances out of your baseline estimation>

We get the following output:

{
     "CapacityReservation": {
         "CapacityReservationId": "cr-0123456f9907xxxxx",
         "OwnerId": "XXXX",
         "CapacityReservationArn": "arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx",
         "InstanceType": "r8g.2xlarge",
         "InstancePlatform": "Linux/UNIX",
         "AvailabilityZone": "us-east-1a"

 ....
     }
 }

You can use Amazon CloudWatch to monitor ODCR usage and trigger an alert for unused capacity. For more details, see Monitor Capacity Reservations usage with CloudWatch metrics.

Create a resource group named EMRSparkSteadyStateGroup and make a note of GroupArn values in the output:

aws resource-groups create-group --name EMRSparkSteadyStateGroup \
--configuration '{"Type":"AWS::EC2::CapacityReservationPool"}' '{"Type":"AWS::ResourceGroups::Generic", "Parameters":[{"Name":"allowed-resource-types","Values":["AWS::EC2::CapacityReservation"]}]}'

We get the following output:

"Group": {
         "GroupArn": "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup",
         "Name": "EMRSparkSteadyStateGroup"
     }, ...

Use the following code to associate the capacity reservation to the resource group. You can have multiple capacity reservations associated to a resource group.

aws resource-groups group-resources --group EMRSparkSteadyStateGroup \
 --resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a best practice for effective management and cleanup, Create a tag Purpose=EMR-Spark-Steady-State for the newly created ODCR and the resource group.

# Tag your Capacity Reservation
 aws ec2 create-tags \
 --resources cr-0123456f9907xxxxx \
 --tags Key=Purpose,Value=EMR-Spark-Steady-State
# Tag your Resource Group
 aws resource-groups tag \
 --arn "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup" --tags Purpose=EMR-Spark-Steady-State

Implement Amazon EMR with ODCR

Complete the following steps to create an EMR cluster tagged with the specific targeted ODCR:

Add required permissions to the EMR service role before using capacity reservations. With these permissions, you can lock down the resource with the specific Amazon Resource Name (ARN) of the group name to be created with the following code:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Resource": "*",
             "Action": [
                 "ec2:CreateFleet",
                 "ec2:RunInstances",
                 "ec2:CreateLaunchTemplate",
                 "ec2:CreateLaunchTemplateVersion",
                 "ec2:DeleteLaunchTemplateVersions",
                 "ec2:DescribeCapacityReservations",
                 "ec2:DescribeLaunchTemplateVersions",
                 "resource-groups:ListGroupResources"
             ]
         }
     ]
 }

Configure the EMR cluster to use ODCR with instance fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as shown in the following example:

  {
 ...
     "LaunchSpecifications": {
       "OnDemandSpecification": {
         "AllocationStrategy": "LOWEST_PRICE",
         "CapacityReservationOptions": {
           "UsageStrategy": "USE_CAPACITY_RESERVATIONS_FIRST",
           "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup"
         }
       }
     }
   }

The following step-by-step breakdown illustrates the Amazon EMR decision-making process when prioritizing targeted capacity reservations, from core node provisioning through task node allocation:

Cluster provisioning initiation:
- The user chooses to override the lowest-price allocation strategy.
- The user specifies targeted capacity reservations in the launch request.
Core node provisioning:
- Amazon EMR evaluates all EC2 instance capacity pools with targeted capacity reservations, and selects the pool with the lowest price that has sufficient capacity for all requested core nodes.
- If no pool with targeted reservations has sufficient capacity, Amazon EMR reevaluates all specified EC2 instance capacity pools and selects the lowest-priced pool with sufficient capacity for core nodes. Available open capacity reservations are applied automatically.
Availability Zone selection:
- After the core capacity is acquired, Amazon EMR locks in the Availability Zone for your cluster.
Primary and task node provisioning:
- Amazon EMR evaluates EC2 instance capacity pools within that Availability Zone for primary and task fleets. First, Amazon EMR evaluates all the pools with targeted ODCRs specified in the request, ordered by lowest price by default.
- From the ordered list, Amazon EMR launches as much capacity as possible from the unused targeted ODCRs of each instance pool until the request is fulfilled.
- If the unused targeted ODCRs don’t fulfill the request yet, Amazon EMR continues to launch the remaining capacity into On-Demand pools, in the lowest-price order by default.

For more details about the allocation strategy, refer to Allocation strategy for instance fleets or Amazon EMR Support for Targeted ODCR.

Spiky workloads

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands, triggered by factors such as infrequent but resource-intensive periodic batch processing jobs. For example, a geographic information system processes location data from millions of users in real time to provide up-to-date traffic information, calculate routes, and suggest points of interest. User location data is constantly being generated, but the volume can spike dramatically during rush hour or special events, as illustrated in the following figure. This graph shows the number of used resources (Amazon EC2) by hour; it varies from 1 when the cluster scales in, waiting for jobs, to spikes of 1,000 nodes.

If you’re running spiky workloads with limited flexibility in instance type, family, and Availability Zone, you might face ICE errors when the available capacity can’t meet the cluster’s scaling requirements. To address this, we explore a set of best practices for EMR cluster creation to maximize availability and balance price-performance. Although spiky workloads present a unique challenge in resource management, configuring EMR instance fleets offers a powerful solution. By using diverse instance types, prioritized allocation strategies, Availability Zone flexibility, and managed scaling, organizations can create a robust, cost-effective infrastructure capable of handling unpredictable workload patterns. This configuration offers the following benefits:

Improved availability – By diversifying instance types and using multiple Availability Zones, the cluster mitigates insufficient capacity issues
Cost savings – Allocation strategies reduce costs while minimizing interruptions
Resilience for spiky workloads – Prioritizing instance generations provides seamless scaling under varying demands
Optimized performance – Managed scaling dynamically adjusts resources to meet workload demands efficiently

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Amazon EMR supports instance flexibility with instance fleet deployment. Instance fleets give you a wider variety of options and intelligence around instance provisioning. You can now provide a list of up to 30 instance types with corresponding weighted capacities and spot bid prices (including spot blocks) using the AWS CLI or AWS CloudFormation. Amazon EMR will automatically provision On-Demand and Spot capacity across these instance types when creating your cluster. This can make it more straightforward and more cost-effective to quickly obtain and maintain your desired capacity for your clusters. In August 2024, Amazon EMR introduced the prioritized allocation strategy to enhance instance flexibility with instance fleets. This feature allows you to specify priority levels for your instance types, enabling Amazon EMR to allocate capacity to the highest-priority instances first. This strategy helps improve cost savings and reduces the time required to launch clusters, even in scenarios with limited capacity. For more details, see Amazon EMR support prioritized and capacity-optimized-prioritized allocation strategies for EC2 instances. To maximize cost-efficiency and availability for spiky workloads, combine the price-performance advantages of new-generation instances with the broader availability of previous-generation instances. For workloads with strict latency requirements, fix the instance size to maintain consistent performance. This approach takes advantage of the strengths of both instance generations, providing flexibility and reliability decreasing the likelihood of capacity constraints. For On-Demand nodes, choose the prioritized allocation strategy, so the cluster tries to use newer-generation instances first. While configuring the instance fleet, arrange instances in a prioritized order reflecting price-performance and availability trade-offs, for example:

Primary node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Task Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Instances, make sure the capacity-optimized prioritized allocation strategy is selected to reduce interruptions. See the following CloudFormation template snippet as an example:

...
       "Properties": {
         "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnMaster",
            "InstanceTypeConfigs": [
               {
                 "BidPrice": "10.50",
                 "InstanceType": "m5.xlarge",
                 "Priority": "1",
 ...
             "LaunchSpecifications": {
               "SpotSpecification": {
                 "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                 "TimeoutDurationMinutes": 20,
                 "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
               },
               "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
               }
 ...

Select subnets with EMR instance fleets

When creating a cluster, specify multiple EC2 subnets within a virtual private cloud (VPC), each corresponding to a different Availability Zone. Amazon EMR provides multiple subnet (Availability Zone) options by employing subnet filtering at cluster launch, and selects one of the subnets that has adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets.

Use managed scaling

Managed scaling is another powerful feature of Amazon EMR that automatically adjusts the number of instances in your cluster based on workload demands. This makes sure that your cluster scales up during periods of high demand to meet processing requirements and scales down during idle times to save costs. With managed scaling, you can set minimum and maximum scaling limits, giving you control over costs while benefiting from an optimized and efficient cluster performance.

The following workflow illustrates Amazon EMR configured with instance fleets and managed scaling.

The workflow consists of the following steps:

The user defines the EMR instance configurations and instance types, along with their launch priority.
The user selects subnets for the Amazon EMR configuration to provide Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision instances based on the allocation strategy.
The EMR instance fleet is launched.
The cycle is repeated for scaling operations within the launched Availability Zone, providing optimized performance and scalability.

Conclusion

In this post, we demonstrated how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. As you implement any of the preceding strategies, remember to continuously monitor your cluster’s performance and adjust configurations based on your specific workload patterns and business needs. With the right approach, the challenges of spiky workloads can be transformed into opportunities for optimized performance and cost savings.

To effectively manage workloads with both baseline demands and unexpected spikes, consider implementing a hybrid approach in Amazon EMR. Use ODCRs for consistent baseline capacity and configure instance fleets with a strategic mix of ODCR, On-Demand, and Spot Instances prioritizing ODCR usage.

Try these strategies with your own use case, and leave your questions in the comments.

About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Suba Palanisamy is a Senior Technical Account Manager, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Flavio Torres is a Principal Technical Account Manager at AWS. Flavio helps Enterprise Support customers design, deploy, and scale resilient cloud applications. Outside of work, he enjoys hiking and barbecuing.

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

2025-02-28 Avinash Desireddy

Post Syndicated from Avinash Desireddy original https://aws.amazon.com/blogs/big-data/design-patterns-for-implementing-hive-metastore-for-amazon-emr-on-eks/

In modern data architectures, the need to manage and query vast datasets efficiently, consistently, and accurately is paramount. For organizations that deal with big data processing, managing metadata becomes a critical concern. This is where Hive Metastore (HMS) can serve as a central metadata store, playing a crucial role in these modern data architectures.

HMS is a central repository of metadata for Apache Hive tables and other data lake table formats (for example, Apache Iceberg), providing clients (such as Apache Hive, Apache Spark, and Trino) access to this information using the Metastore Service API. Over time, HMS has become a foundational component for data lakes, integrating with a diverse ecosystem of open source and proprietary tools.

In non-containerized environments, there was typically only one approach to implementing HMS—running it as a service in an Apache Hadoop cluster. With the advent of containerization in data lakes through technologies such as Docker and Kubernetes, multiple options for implementing HMS have emerged. These options offer greater flexibility, allowing organizations to tailor HMS deployment to their specific needs and infrastructure.

In this post, we will explore the architecture patterns and demonstrate their implementation using Amazon EMR on EKS with Spark Operator job submission type, guiding you through the complexities to help you choose the best approach for your use case.

Solution overview

Prior to Hive 3.0, HMS was tightly integrated with Hive and other Hadoop ecosystem components. Hive 3.0 introduced a Standalone Hive Metastore. This new version of HMS functions as an independent service, decoupled from other Hive and Hadoop components such as HiveServer2. This separation enables various applications, such as Apache Spark, to interact directly with HMS without requiring a full Hive and Hadoop environment installation. You can learn more about other components of Apache Hive on the Design page.

In this post, we will use a Standalone Hive Metastore to illustrate the architecture and implementation details of various design patterns. Any reference to HMS refers to a Standalone Hive Metastore.

The HMS broadly consists of two main components:

Backend database: The database is a persistent data store that holds all the metadata, such as table schemas, partitions, and data locations.
Metastore service API: The Metastore service API is a stateless service that manages the core functionality of the HMS. It handles read and write operations to the backend database.

Containerization and Kubernetes offers various architecture and implementation options for HMS, including, running:

HMS as a sidecar container
Cluster dedicated HMS
External HMS

In this post, we’ll use Apache Spark as the data processing framework to demonstrate these three architectural patterns. However, these patterns aren’t limited to Spark and can be applied to any data processing framework, such as Hive or Trino, that relies on HMS for managing metadata and accessing catalog information.

Note that in a Spark application, the driver is responsible for querying the metastore to fetch table schemas and locations, then distributes this information to the executors. Executors process the data using the locations provided by the driver, never needing to query the metastore directly. Hence, in the three patterns described in the following sections, only the driver communicates with the HMS, not the executors.

HMS as sidecar container

In this pattern, HMS runs as a sidecar container within the same pod as the data processing framework, such as Apache Spark. This approach uses Kubernetes multi-container pod functionality, allowing both HMS and the data processing framework to operate together in the same pod. The following figure illustrates this architecture, where the HMS container is part of Spark driver pod.

HMS as sidecar container

This pattern is suited for small-scale deployments where simplicity is the priority. Because HMS is co-located with the Spark driver, it reduces network overhead and provides a straightforward setup. However, it’s important to note that in this approach HMS operates exclusively within the scope of the parent application and isn’t accessible by other applications. Additionally, row conflicts might arise when multiple jobs attempt to insert data into the same table simultaneously. To address this, you should make sure that no two jobs are writing to the same table simultaneously.

Consider this approach if you prefer a basic architecture. It’s ideal for organizations where a single team manages both the data processing framework (for example, Apache Spark) and HMS, and there’s no need for other applications to use HMS.

Cluster dedicated HMS

In this pattern, HMS runs in multiple pods managed through a Kubernetes deployment, typically within a dedicated namespace in the same data processing EKS cluster. The following figure illustrates this setup, with HMS decoupled from Spark driver pods and other workloads.

Cluster dedicated HMS

This pattern works well for medium-scale deployments where moderate isolation is enough, and compute and data needs can be handled within a few clusters. It provides a balance between resource efficiency and isolation, making it ideal for use cases where scaling metadata services independently is important, but complete decoupling isn’t necessary. Additionally, this pattern works well when a single team manages both the data processing frameworks and HMS, ensuring streamlined operations and alignment with organizational responsibilities.

By decoupling HMS from Spark driver pods, it can serve multiple clients, such as Apache Spark and Trino, while sharing cluster resources. However, this approach might lead to resource contention during periods of high demand, which can be mitigated by enforcing tenant isolation on HMS pods.

External HMS

In this architecture pattern, HMS is deployed in its own EKS cluster deployed using Kubernetes deployment and exposed as a Kubernetes Service using AWS Load Balancer Controller, separate from the data processing clusters. The following figure illustrates this setup, where HMS is configured as an external service, separate from the data processing clusters.

External HMS

This pattern suits scenarios where you want a centralized metastore service shared across multiple data processing clusters. HMS allows different data teams to manage their own data processing clusters while relying on the shared metastore for metadata management. By deploying HMS in a dedicated EKS cluster, this pattern provides maximum isolation, independent scaling, and the flexibility to operate and managed as its own independent service.

While this approach offers clear separation of concerns and the ability to scale independently, it also introduces higher operational complexity and potentially increased costs because of the need to manage an additional cluster. Consider this pattern if you have strict compliance requirements, need to ensure complete isolation for metadata services, or want to provide a unified metadata catalog service for multiple data teams. It works well in organizations where different teams manage their own data processing frameworks and rely on a shared metadata store for data processing needs. Additionally, the separation enables specialized teams to focus on their respective areas.

Deploy the solution

In the remainder of this post, you will explore the implementation details for each of the three architecture patterns, using EMR on EKS with Spark Operator job submission type as an example to demonstrate their implementation. Note that this implementation hasn’t been tested with other EMR on EKS Spark job submission types. You will begin by deploying the common components that serve as the foundation for all the architecture patterns. Next, you’ll deploy the components specific to each pattern. Finally, you’ll execute Spark jobs to connect to the HMS implementation unique to each pattern and verify the successful execution and retrieval of data and metadata.

To streamline the setup process, we’ve automated the deployment of common infrastructure components so you can focus on the essential aspects of each HMS architecture. We’ll provide detailed information to help you understand each step, simplifying the setup while preserving the learning experience.

Scenario

To showcase the patterns, you will create three clusters:

Two EMR on EKS clusters: analytics-cluster and datascience-cluster
An EKS cluster: hivemetastore-cluster

Both analytics-cluster and datascience-cluster serve as data processing clusters that run Spark workloads, while the hivemetastore-cluster hosts the HMS.

You will use analytics-cluster to illustrate the HMS as sidecar and cluster dedicated pattern. You will use all three clusters to demonstrate the external HMS pattern.

Source code

You can find the codebase in the AWS Samples GitHub repository.

Prerequisites

Before you deploy this solution, make sure that the following prerequisites are in place:

Access to a valid AWS account
The AWS Command Line Interface (AWS CLI) is installed on your local machine
Git, Docker, eksctl, kubectl, Helm, and jq utilities are installed on your local machine
Permission to create AWS resources
Familiarity with Kubernetes, Apache Spark, Amazon Elastic Kubernetes Service (Amazon EKS), and EMR on EKS

Set up common infrastructure

Begin by setting up the infrastructure components that are common to all three architectures.

Clone the repository to your local machine and set the two environment variables. Replace <AWS_REGION> with the AWS Region where you want to deploy these resources.

git clone https://github.com/aws-samples/sample-emr-eks-hive-metastore-patterns.git
cd sample-emr-eks-hive-metastore-patterns
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

Execute the following script to create the shared infrastructure.

cd ${REPO_DIR}/setup
./setup.sh

To verify successful infrastructure deployment, navigate to the AWS Management Console for AWS CloudFormation, select your stack, and check the Events, Resources, and Outputs tabs for completion status, details, and list of resources created.

You have completed the setup of the common components that serve as the foundation for all architectures. You will now deploy the components specific to each architecture and execute Apache Spark jobs to validate the successful implementation.

HMS in a sidecar container

To implement HMS using the sidecar container pattern, the Spark application requires setting both sidecar and catalog properties in the job configuration file.

Execute the following script to configure the analytics-cluster for sidecar pattern. For this post, we stored the HMS database credentials into a Kubernetes Secret object. We recommend using Kubernetes External Secrets Operator to fetch HMS database credentials from AWS Secrets Manager.

cd ${REPO_DIR}/hms-sidecar
./configure-hms-sidecar.sh analytics-cluster

Review the Spark job manifest file spark-hms-sidecar-job.yaml. This file was created by substituting variables in the spark-hms-sidecar-job.tpl template in the previous step. The following samples highlight key sections of the manifest file.

spec:
  driver:
  ...
    sidecars:
      # Hive Metastore Sidecar container
      - name: hive-metastore
        image: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/hive-metastore:latest
        env:
          # These settings configure metastore to use Amazon Postgres RDS as backend database, connecting to it via jdbc URL.
          - name: HIVE_METASTORE_DB_TYPE
            value: postgres
          - name: HIVE_METASTORE_DB_DRIVER
            value: org.postgresql.Driver
          - name: HIVE_METASTORE_DB_URL
            value: jdbc:postgresql://${HMS_RDS_PROXY_ENDPOINT}:5432/hivemetastore
          # The warehouse location is specified as an S3 bucket
          - name: HIVE_METASTORE_WAREHOUSE_LOC
            value: s3a://${S3_BUCKET_NAME}/warehouse
          - name: AWS_REGION
            value: ${AWS_REGION}
          # The database username and password are passed via environment variables. The password is retrieved from a Kubernetes secret
          - name: HIVE_METASTORE_DB_USER
            value: hive_metastore_user
          - name: HIVE_METASTORE_DB_PASSWORD
            valueFrom:
              secretKeyRef:
                name: hms-rds-password
                key: HIVE_METASTORE_DB_PASSWORD

Spark job configuration

spec:  
  sparkConf:
    # Hive Catalog properties
    # Sets spark to use Hive as the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is exposed on localhost:8080 since the sidecar runs in the same pod. Spark connects to the sidecar via this URI
    spark.hadoop.hive.metastore.uris: "thrift://localhost:9083"
    # The data location is set to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve temporary AWS credentials by assuming a role using a web identity token
    spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to use S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and verify the HMS as sidecar container setup

In this pattern, you will submit Spark jobs in analytics-cluster. The Spark jobs will connect to the HMS service running as a sidecar container in the driver pod.

Run the Spark job to verify that the setup was successful.

kubectl apply -f spark-hms-sidecar-job.yaml

Describe the sparkapplication object.

kubectl get sparkapplication -n emr
kubectl describe sparkapplication spark-hms-sidecar-job --namespace emr

List the pods and observe the number of containers attached to the driver pod. Wait until the Status changes from ContainerCreating to Running (should take just a few seconds).

kubectl get pods -n emr

View the driver logs to validate the output.

kubectl logs spark-hms-sidecar-job-driver --namespace emr

If you encounter the following error, wait for a few minutes and rerun the previous command.

Error from server (BadRequest): container "spark-kubernetes-driver" in pod "spark-hms-sidecar-driver" is waiting to start: ContainerCreating

After successful completion of the job, you see the following message in the logs. The tabular output successfully validates the setup of HMS as a sidecar container.

...
24/09/17 21:44:00 INFO metastore: Trying to connect to metastore with URI thrift://localhost:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/17 21:44:00 INFO metastore: Connected to metastore.
...
...
+-------+---+-------+---+
|pattern|id |name   |age|
+-------+---+-------+---+
|Sidecar|1 |Alice   |30 |
|Sidecar|2 |Bob     |25 |
|Sidecar|3 |Charlie |35 |
+-------+---+-------+---+

Cluster dedicated HMS

To implement HMS using a cluster dedicated HMS pattern, the Spark application requires setting up HMS URI and catalog properties in the job configuration file.

Execute the following script to configure the analytics-cluster for cluster dedicated pattern.

cd ${REPO_DIR}/hms-cluster-dedicated
./configure-hms-cluster-dedicated.sh analytics-cluster

Verify the HMS deployment by listing the pods and viewing the logs. No Java exceptions in the logs confirms that the Hive Metastore service is running successfully.

kubectl get pods --namespace hive-metastore
kubectl logs <HMS-PODNAME> --namespace hive-metastore

Review the Spark job manifest file, spark-hms-cluster-dedicated-job.yaml. This file is created by substituting variables in the spark-hms-cluster-dedicated-job.tpl template in the previous step. The following sample highlights key sections of the manifest file.

spec: 
  sparkConf:
    # Hive Catalog properties
    # Sets spark to use Hive as the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is running in a pod and we can connect to it in the same EKS cluster via this URI
    spark.hadoop.hive.metastore.uris: "thrift://hive-metastore-svc.hive-metastore.svc.cluster.local:9083"
    # The data location is set to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve temporary AWS credentials by assuming a role using a web identity token
    spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to use S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and verify the cluster dedicated HMS setup

In this pattern, you will submit Spark jobs in analytics-cluster. The Spark jobs will connect to the HMS service in the same data processing EKS cluster.

Submit the job.

kubectl apply -f spark-hms-cluster-dedicated-job.yaml -n emr

Verify the status.

kubectl get sparkapplication -n emr
kubectl describe sparkapplication spark-hms-cluster-dedicated-job --namespace emr

Describe driver pod and observe the number of containers attached to the driver pod. Wait until the status changes from ContainerCreating to Running (should take just a few seconds).

kubectl get pods -n emr

View the driver logs to validate the output.

kubectl logs spark-hms-cluster-dedicated-job-driver --namespace emr

After successful completion of the job, you should see the following message in the logs. The tabular output successfully validates the setup of cluster dedicated HMS.

...
24/09/17 21:44:00 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore-svc.hive-metastore.svc.cluster.local:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/17 21:44:00 INFO metastore: Connected to metastore.
...
...
+-----------------+---+-------+---+
|pattern          |id |name   |age|
+-----------------+---+-------+---+
|Cluster Dedicated|1  |Alice  |30 |
|Cluster Dedicated|2  |Bob    |25 |
|Cluster Dedicated|3  |Charlie|35 |
+-----------------+---+-------+---+

External HMS

To implement an external HMS pattern, the Spark application requires setting up an HMS URI for the service endpoint exposed by hivemetastore-cluster.

Execute the following script to configure hivemetastore-cluster for External HMS pattern.

cd ${REPO_DIR}/hms-external
./configure-hms-external.sh

Review the Spark job manifest file spark-hms-external-job.yaml. This file is created by substituting variables in the spark-hms-external-job.tpl template during the setup process. The following sample highlights key sections of the manifest file.

spec:
  sparkConf:
    # Hive Catalog properties
    # Sets spark to use Hive as the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is running in a cluster and we can connect to it in the EKS cluster via this URI
    spark.hadoop.hive.metastore.uris: "thrift://${HMS_URI_ENDPOINT}:9083"
    # The data location is set to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve temporary AWS credentials by assuming a role using a web identity token
    spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to use S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and verify the HMS in a separate EKS cluster setup

To verify the setup, submit Spark jobs in analytics-cluster and datascience-cluster. The Spark jobs will connect to the HMS service in the hivemetastore-cluster.

Use the following steps for analytics-cluster and then for datascience-cluster to verify that both clusters can connect to the HMS on hivemetastore-cluster.

Run the spark job to test the successful setup. Replace <CONTEXT_NAME> with Kubernetes context for analytics-cluster and then for datascience-cluster.

kubectl config use-context <CONTEXT_NAME>
kubectl apply -f spark-hms-external-job.yaml -n emr

Describe the sparkapplication object.

kubectl get sparkapplication spark-hms-external-job -n emr
kubectl describe sparkapplication spark-hms-external-job --namespace emr

List the pods and observe the number of containers attached to the driver pod. Wait until the status changes from ContainerCreating to Running (should take just a few seconds).

kubectl get pods -n emr

View the driver logs to validate the output on the data processing cluster.

kubectl logs spark-hms-external-job-driver --namespace emr

The output should look like the following. The tabular output successfully validates the setup of HMS in a separate EKS cluster.

After successful completion of the job, you should be able to see the below message in the logs.
...
24/09/17 21:44:00 INFO metastore: Trying to connect to metastore with URI  thrift://k8s-hivemeta-hmsexter-xxxxxx.elb.us-east-1.amazonaws.com:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/17 21:44:00 INFO metastore: Connected to metastore.
...
...
+--------+---+-------+---+
|pattern |id |name   |age|
+--------+---+-------+---+
|External|1  |Alice  |30 |
|External|2  |Bob    |25 |
|External|3  |Charlie|35 |
+--------+---+-------+---+

Clean up

To avoid incurring future charges from the resources created in this tutorial, clean up your environment after you’ve completed the steps. You can do this by running the cleanup.sh script, which will safely remove all the resources provisioned during the setup.

cd ${REPO_DIR}/setup
./cleanup.sh

Conclusion

In this post, we’ve explored the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.

We encourage you to experiment with these patterns in your own setups, adapting them to fit your unique workloads and operational needs. By understanding and applying these design patterns, you can optimize your Hive Metastore deployments for performance, scalability, and security in your EMR on EKS environments. Explore further by deploying the solution in your AWS account and share your experiences and insights with the community.

About the Authors

Avinash Desireddy is a Cloud Infrastructure Architect at AWS, passionate about building secure applications and data platforms. He has extensive experience in Kubernetes, DevOps, and enterprise architecture, helping customers containerize applications, streamline deployments, and optimize cloud-native environments.

Suvojit Dasgupta is a Principal Data Architect at AWS. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.

Amazon Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

2025-02-26 William Vambenepe

Post Syndicated from William Vambenepe original https://aws.amazon.com/blogs/big-data/amazon-web-services-named-a-leader-in-the-2024-gartner-magic-quadrant-for-data-integration-tools/

Amazon Web Services (AWS) has been recognized as a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools. We were positioned in the Challengers Quadrant in 2023.

This recognition, we feel, reflects our ongoing commitment to innovation and excellence in data integration, demonstrating our continued progress in providing comprehensive data management solutions.

The Gartner Magic Quadrant evaluates 20 data integration tool vendors based on two axes—Ability to Execute and Completeness of Vision. This evaluation, we feel, critically examines vendors’ capabilities to address key service needs, including data engineering, operational data integration, modern data architecture delivery, and enabling less-technical data integration across various deployment models.

Discover, prepare, and integrate all your data at any scale

AWS Glue is a fully managed, serverless data integration service that simplifies data preparation and transformation across diverse data sources. With its comprehensive suite of tools, AWS Glue allows users to build and manage data pipelines efficiently, without requiring extensive infrastructure management expertise.

Given the diverse data integration needs of customers, AWS offers a robust data integration system through multiple services including Amazon EMR, Amazon Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis, and others. Many thousands of customers across various industries are using these services to transform, operationalize, and manage their data across data lakes and data warehouses.

We have embarked on a journey to unify the broad range of AWS data processing, analytics, and AI capabilities, starting with the announcement of Amazon SageMaker Unified Studio at re:Invent 2024. This includes the data integration capabilities mentioned above, with support for both structured and unstructured data. With an integrated experience for data workers, SageMaker Unified Studio provides an environment where users can collaborate and build faster. It supports model development, generative AI, data processing, and SQL analytics, all accelerated by Amazon Q Developer—our most capable generative AI assistant for software development. The Unified Studio provides unified access to all data sources, whether stored in data lakes, data warehouses, or third-party and federated sources, with robust governance and enterprise-grade security built-in.

Review the Gartner Magic Quadrant

Access a complimentary copy of the full report to see why Gartner positioned AWS as a Leader, and dive deep into the strengths and cautions of AWS. We believe our recognition as a Leader in the Gartner Magic Quadrant is a testament to delivering innovations for our customers.

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from here.

About the authors

William Vambenepe is Director of Product Management at AWS, where he leads the Product Management, Solutions Engineering, and UX Design team for data processing services (Amazon EMR, AWS Glue, Athena, Amazon MWAA), SageMaker Unified Studio, and SageMaker Catalog. Prior to AWS, William worked at Google (6 years building and growing the Data and Analytics product portfolio for Google Cloud, and 5 years as Product Management Director for Google Search). He had previously held software engineering leadership roles at Oracle and HP. William holds an Engineering degree from Ecole Centrale Paris, a graduate Diploma in Computer Science from Cambridge University, and a Masters in Engineering Management from Stanford University.

Santosh Chandrachood has been with AWS for over 8 years and helped build, launch, and scale a variety of AWS services. Currently, Santosh is Director and service leader for Data Processing, managing Amazon EMR, Athena, AWS Glue, and Managed Workflows for Apache Airflow (Amazon MWAA). Santosh also led AWS Data Integration as the General Manager. Before joining AWS, Santosh lead engineering teams in networking, storage, and data infrastructure areas.

Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro

2025-02-25 Senaka Ariyasinghe

Post Syndicated from Senaka Ariyasinghe original https://aws.amazon.com/blogs/big-data/building-and-operating-data-pipelines-at-scale-using-ci-cd-amazon-mwaa-and-apache-spark-on-amazon-emr-by-wipro/

Businesses of all sizes are challenged with the complexities and constraints posed by traditional extract, transform and load (ETL) tools. These intricate solutions, while powerful, often come with a significant financial burden, particularly for small and medium enterprise customers. Beyond the substantial costs of procurement and licensing, customers must also contend with the expenses associated with installation, maintenance, and upgrades—a perpetual cycle of investment that can strain even the most robust budgets. At Wipro, scalability of data pipelines in addition to automation remains a persistent concern for their customers and they’ve learned through customer engagements that it’s not achievable without considerable effort. As data volumes continue to swell, these tools can struggle to keep pace with the ever-increasing demand, leading to processing delays and disruptions in data delivery—a critical bottleneck in an era when timely insights are paramount.

This blog post discusses how a programmatic data processing framework developed by Wipro can help data engineers overcome obstacles and streamline their organization’s ETL processes. The framework leverages Amazon EMR improved runtime for Apache Spark and integrates with AWS Managed services. This framework is robust and capable of connecting with multiple data sources and targets. By using capabilities from AWS managed services, the framework eliminates the undifferentiated heavy lifting typically associated with infrastructure management in traditional ETL tools, enabling customers to allocate resources more strategically. Furthermore, we will show you how the framework’s inherent scalability ensures that businesses can effortlessly adapt to increasing data volumes, fostering agility and responsiveness in an evolving digital landscape.

Solution overview

The proposed solution helps build a fully automated data processing pipeline that streamlines the entire workflow. It triggers processes when code is pushed to Git, orchestrates and schedules processing of jobs, validates data with the help of defined rules, transforms data prescribed within code, and loads the transformed datasets into a specified target. The primary component of this solution is the robust framework, developed using Amazon EMR runtime for Apache Spark. The framework can be used for any ETL process where input might be fetched from various data sources, transformed, and loaded into specified targets. To enable gaining valuable insights and provide overall job monitoring and automation, the framework is integrated with AWS managed services:

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) for data pipeline orchestration, scheduling, and execution.
Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to execute a data processing job that extracts data from the source, performs data validation and transformation, and loads data into the target.
Amazon CloudWatch and Amazon Simple Notification Service (Amazon SNS) for monitoring of processing jobs and notification on job progress.
Amazon Simple Storage Service (Amazon S3) to store build artifacts.
Amazon EC2 to host and run a Jenkins build server.

Solution walkthrough

solution architecture

The solution architecture is shown in the preceding figure and includes:

Continuous integration and delivery (CI/CD) for data processing
- Data engineers can define the underlying data processing job within a JSON template. A pre-defined template is available on GitHub for you to review syntax. At a high level, this step includes the following objectives:
  - Writing the Spark job configuration to be executed on Amazon EMR.
  - Split the data processing into three phases:
    - Parallelize fetching data from source, validate source data, and prepare the dataset for further processing.
    - Provide flexibility to write business transformation rules defined in JSON, including data validation checks such as duplicate record, null value check, and their removal. It can also include any SQL based transformation written in Apache Spark SQL.
    - Take the transformed data set and load it to the target and perform reconciliation as needed.

It’s important to highlight that each step of the three phases is recorded for auditing, and error reporting, and troubleshooting and security purposes.

- After the data engineer has prepared the configuration file following the prescribed template in step 1 and committed it to the Git repository, it triggers the Jenkins pipeline. Jenkins is an open source continuous integration tool running on an EC2 instance that takes the configuration file, processes it, and builds (compiles the Spark application code) end artifacts—a JAR file and a configuration file (.conf) that’s copied to an S3 bucket and will be used later by Amazon EMR.

CI/CD for data pipeline

The CI/CD for the data pipeline is shown in the following figure.

CICD for the data pipeline

- After the data processing job is written, the data engineers can use a similar code-driven development approach to define the data processing pipeline to schedule, orchestrate, and execute the data processing job. Apache Airflow is a popular open source platform used for developing, scheduling, and monitoring batch-oriented workflows. In this solution, we use Amazon MWAA to execute the data pipeline through a Direct Acyclic Graph (DAG). To make it easier for engineers to build the required DAG in this solution, you can define the data pipeline in simple YAML. A sample of the YAML file is available on GitHub for review.
- When a user commits the YAML file containing the DAG details to the project Git repository, another Jenkins pipeline is triggered.
- The Jenkins pipeline now reads the YAML configuration file and based on the task and dependencies given, it generates the DAG script file, which is copied to the configured S3 bucket.

Airflow DAG execution
- After both the data processing job and data pipeline are configured, Amazon MWAA retrieves the most recent scripts from the S3 bucket to display the latest DAG definition in the Airflow user interface. These DAGs contain at least three tasks and except for creating and terminating an EMR cluster, every task represents an ETL process. Sample DAG code is available in GitHub. The following figure shows the DAG grid view within Amazon MWAA.

- As defined in the schedule specified in the job, Airflow executes the create Amazon EMR task that launches the Amazon EMR cluster on the EC2 instance. After the cluster is created, the ETL processes are submitted to Amazon EMR as steps.
- Amazon EMR executes these steps concurrently (Amazon EMR provides step concurrency levels that define how many steps to process concurrently). After the tasks are finished, the Amazon EMR cluster is terminated to save costs.

ETL processing
- Each step submitted by Airflow to Amazon EMR with a Spark submit command also includes the S3 bucket path of the configuration file passed as an argument.
- Based on the configuration file, the input data is fetched and technical validations are applied. If data mapping has been enabled within the data processing job, then the structured data is prepared based on the given schema. This output is passed to next phase where data transformations and business validations can be applied.
- A set of reconciliation rules are applied to the transformed data to ensure the data quality dimensions. After this step, data is loaded to specified target.

The following figure shows the ETL data processing job as executed by Amazon EMR.

ETL data processing job

Logging, monitoring and notification
- Amazon MWAA provides the logs generated by each task of the DAG within the Airflow UI. Using these logs, you can monitor Apache Airflow task details, delays, and workflow errors.
- Amazon MWAA also frequently pings the Amazon EMR cluster to fetch the latest status of the step being executed and updates the task status accordingly.
- If a step has failed, for example, if the database connection was not established because of high traffic, Amazon MWAA repeats the process.
- Whenever a task has failed, an email notification is sent to the configured recipients with the failure cause and logs using Amazon SNS.

The key capabilities of this solution are:

Full automation: After the user commits the configuration files to Git, the remainder of the process is fully automated from when the CI/CD pipelines deploy the artifacts and DAG code. The DAG code is later executed in Airflow at the scheduled time. The entire ETL job is logged and monitored, and email notifications are sent in case of any failures.

Flexible integration: The application can seamlessly accommodate a new ETL process with minimal effort. To create a new process, prepare a configuration file that contains the source and target details and the necessary transformation logic. An example of how to specify your data transformation is shown in the following sample code.

"data_transformations": [{
"functionName": "cast column date_processed",
"sqlQuery": "Select *, from_unixtime(UNIX_TIMESTAMP(date_processed, 'yyyy-MM-dd HH:mm:ss'), 'dd/MM/yyyy') as dateprocessed from table_details",
"outputDFName": "table_details"
},
{
"functionName": "find the reference data from lookup",
"sqlQuery": "join_query_table_lookup.sql",
"outputDFName": "super_search_table_details"
}]

Fault tolerance: In addition to Apache Spark’s fault-tolerant capabilities, this solution offers the ability to recover data even after the Amazon EMR has been terminated. The application solution has three phases. In the event of a failure in the Apache Spark job, the output of the last successful phase is temporarily stored in Amazon S3. When the job is triggered again through Airflow DAG, the Apache Spark job will resume from the point at which it previously failed, thereby ensuring continuity and minimizing disruptions to the workflow. The following figure shows job reporting in the Amazon MWAA UI.

job reporting in the Amazon MWAA UI.

Scalability: As shown in the following figure, the Amazon EMR cluster is configured to use instance fleet options to scale up or down the number of nodes depending on the size of the data, which makes this application an ideal choice for businesses with growing data needs.

instance fleet options to scale up or down

Customizable: This solution can be customized to meet the needs of specific use cases, allowing you to add your own transformations, validations, and reconciliations according to your unique data management requirements.
Enhanced data flexibility: By incorporating support for multiple file formats, the Apache Spark application and Airflow DAGs gain the ability to seamlessly integrate and process data from various sources. This advantage allows data engineers to work with a wide range of file formats, including JSON, XML, Text, CSV, Parquet, Avro, and so on.
Concurrent execution: Amazon MWAA submits the tasks to Amazon EMR for concurrent execution, using the scalability and performance of distributed computing to process large volumes of data efficiently.
Proactive error notification: Email notifications are sent to configured recipients whenever a task fails, providing timely awareness of failures and facilitating prompt troubleshooting.

Considerations

In our deployments, we have noticed that the average time of a DAG completion is 15–20 minutes containing 18 ETL processes concurrently and dealing with 900 thousand to 1.2 million records each. However, if you want to further optimize the DAG completion time, you can configure the core.dag_concurrency from the Amazon MWAA console as described in Example high performance use case.

Conclusion

The proposed code-driven data processing framework developed by Wipro using Amazon EMR Runtime for Apache Spark and Amazon MWAA demonstrates a solution to address the challenges associated with traditional ETL tools. By harnessing the capabilities from open source frameworks and seamlessly integrating with AWS services, you can build cost-effective, scalable, and automated approaches for your enterprise data processing pipelines.

Now that you have seen how to use Amazon EMR Runtime for Apache Spark with Amazon MWAA , we encourage you to use Amazon MWAA to create a workflow that will run your ETL jobs on Amazon EMR Runtime for Apache Spark.

The configuration file samples and example DAG code referred in this blog post can be found in GitHub.

References

Disclaimer

Sample code, software libraries, command line tools, proofs of concept, templates, or other related technology are provided as AWS Content or third-party content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content or third-party content in your production accounts, or on production or other critical data. Performance metrics, including the stated DAG completion time, may vary based on the specific deployment environment. You are responsible for testing, securing, and optimizing the AWS Content or third-party content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content or Third-Party Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

About the Authors

Deependra Shekhawat is a Senior Energy and Utilities Industry Specialist Solutions Architect based in Sydney, Australia. In his role, Deependra helps Energy companies across APJ region use cloud technologies to drive sustainability and operational efficiency. He specializes in creating robust data foundations and advanced workflows that enable organizations to harness the power of big data, analytics, and machine learning for solving critical industry challenges.

Senaka Ariyasinghe is a Senior Partner Solutions Architect working with Global Systems Integrators at AWS. In his role, Senaka guides AWS partners in the APJ region to design and scale well-architected solutions, focusing on generative AI, machine learning, cloud migrations, and application modernization initiatives.

Sandeep Kushwaha is a Senior Data Scientist at Wipro and has extensive experience in big data and machine learning. With a strong command of Apache Spark, Sandeep has designed and implemented cutting-edge cloud solutions that optimize data processing and drive efficiency. His expertise in using AWS services and best practices, combined with his deep knowledge of data management and automation, has enabled him to lead successful projects that meet complex technical challenges and deliver high-impact results.

Enhance your workload resilience with new Amazon EMR instance fleet features

2025-02-19 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/enhance-your-workload-resilience-with-new-amazon-emr-instance-fleet-features/

Big data processing and analytics have emerged as fundamental components of modern data architectures. Organizations worldwide use these capabilities to extract actionable insights and facilitate data-driven decision-making processes. Amazon EMR has long been a cornerstone for big data processing in the cloud. Now, with a suite of exciting new features for EMR instance fleets that enables you to effectively manage your compute, Amazon is taking cloud-based analytics to the next level.

Amazon EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS. This comprehensive post introduces instance fleets, demonstrates using this new allocation strategy, explores how enhanced Availability Zone and subnet selection works, and examines how these features improve cluster’s resilience. This technical exploration will equip you with the knowledge to implement more resilient and efficient EMR clusters for your organization’s big data processing needs.

The current challenges

Organizations using big data operations might face several challenges:

When preferred instance types are unavailable, finding suitable alternatives often delays cluster launches and disrupts workflows
Selecting the optimal Availability Zone for cluster launch is challenging due to constantly changing available compute capacity, especially when considering future scaling needs
Maintaining uninterrupted operation of mission-critical long-running clusters becomes complex as data processing requirements evolve over time
Organizations frequently struggle to scale their operations to meet growing data processing demands, leading to performance bottlenecks and delayed insights

These challenges underscore the need for more advanced, flexible, and intelligent solutions in the realm of big data operations, driving the demand for innovative features in cloud-based data processing platforms.

Introducing improved EMR instance fleets

Amazon EMR, a cloud-based big data platform, allows you to process large datasets using various open source tools such as Apache Spark, Apache Flink, and Trino. To address the aforementioned challenges, Amazon EMR introduced instance fleets, with a robust set of features.

When setting up an EMR cluster, Amazon EMR offers two configuration options for configuring the primary, core, and task nodes: uniform instance groups or instance fleets.

Uniform instance groups offer a streamlined approach to cluster setup, allowing up to 50 instance groups per cluster. An EMR cluster has a primary instance group for primary node, a core instance group with one or more Amazon Elastic Compute Cloud (Amazon EC2) instances, and the option to add up to 48 task instance groups. Both core and task instance groups are flexible, allowing any number of EC2 instances within each group. Both core and task groups offer flexibility in instance count, and each node type (primary, core, or task) consists of instances sharing the same specifications and purchasing model (On-Demand or Spot). However, this approach limits the ability to mix different instance types or purchasing options within a single group.

Instance fleets provide a versatile approach to provisioning EC2 instances, offering unparalleled flexibility in cluster configuration. This setup assigns one instance fleet each for primary and core nodes, with the task instance fleet being optional. It allows you to specify up to five EC2 instance types (or up to 30 when using the Amazon Command Line Interface (AWS CLI) or API with an instance allocation strategy) for each node type in a cluster, providing enhanced instance diversity to optimize cost and performance while increasing the likelihood of fulfilling capacity requirements. Instance fleets automatically manage the mix of instance types to meet specified target capacities for On-Demand and Spot, reducing operational overhead and improving compute availability.

Key benefits of instance fleets include improved cluster resilience to capacity fluctuations, superior management of Spot Instances with the ability to set timeouts and specify actions if Spot capacity can’t be provisioned, and faster cluster provisioning. The feature also allows you to select multiple subnets for different Availability Zones, enabling Amazon EMR to optimally launch clusters and automatically route traffic away from impacted zones during large-scale events. Additionally, instance fleets offer capacity reservation options for On-Demand Instances and support allocation strategies that prioritize instance types based on user-defined criteria, further enhancing the flexibility and efficiency of EMR cluster management.

Achieve resiliency with instance fleets

Now that you have a good understanding of instance fleets, let’s explore how the new instance fleet capabilities help achieve resiliency for your workloads through the following methods:

EC2 instance allocation – Enables precise control over instance type selection and prioritization
Enhanced subnet selection – Optimizes cluster deployment across Availability Zones

EC2 instance allocation

EMR instance fleets now offer newer allocation strategies for both Spot and On-Demand Instances, giving you control over selection and prioritization of instance types and allowing you to optimize for greater flexibility, resilience, and cost-efficiency.

Amazon EMR supports the following allocation strategies for On-Demand Instances:

Prioritized (new) – Allows you to define a priority order for instance types, giving you precise control over instance selection
Lowest-price (existing) – Selects the lowest-priced instance type from the available options

Amazon EMR supports the following allocation strategies for Spot Instances:

Price-capacity optimized (new) – Selects instances with the lowest price while also considering the available capacity
Capacity-optimized-prioritized (new) – Similar to capacity-optimized, but respects instance type priorities that you specify, on a best-effort basis
Capacity-optimized (existing) – Selects instances from the pools with the most available capacity
Lowest-price (existing) – Selects the lowest-priced Spot Instances
Diversified (existing) – Distributes instances across all pools

When using the prioritized On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

For Spot Instances, Amazon EMR recommends the capacity-optimized allocation strategy. This approach allocates instances from the most available capacity pools, thereby reducing the chance of interruptions and enhancing cluster stability. Amazon EMR also allows you to launch a cluster without an allocation strategy. However, using an allocation strategy is recommended for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

Enhanced subnet selection

Amazon EMR on EC2 offers improved reliability and cluster launch experience for instance fleet clusters through the newly launched enhanced subnet selection. With this feature, EMR on EC2 reduces cluster launch failures resulting from an IP address shortage. Previously, the subnet selection for EMR clusters only considered the available IP addresses for the core instance fleet. Amazon EMR now employs subnet filtering at cluster launch and selects one of the subnets that have adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets. In this scenario, Amazon EMR will also publish an Amazon CloudWatch alert event to notify the user. If none of the configured subnets can be used to provision the core and primary fleet, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take remedial actions as necessary. This capability is enabled by default when you configure more than one subnet for cluster launch, and you don’t need to make any configuration changes to benefit from it.

Solution overview

Now that you have a comprehensive grasp of the two new features, let’s integrate the elements of instance fleets and look at the implementation flow for each feature.

EC2 instance allocation

The following diagram illustrates the instance fleet lifecycle management architecture.

The workflow consists of the following steps:

Create a cluster configuration with the prioritized allocation strategy, specifying instance types, their priority, and a list of potential subnets.
When you launch an EMR cluster, it evaluates compute capacity and available IPs across the specified subnets. Amazon EMR then selects a single Availability Zone that best meets capacity and instance availability needs for the entire cluster.
Amazon EMR launches the cluster using available instance types in one of the configured Availability Zones based on enhanced subnet selection.
During a scale-up scenario, Amazon EMR adds new instances to the clusters while following the configured compute allocation strategy.
If a specific instance type is unavailable, Amazon EMR will select the next available instance types based on the priority order. This flexibility provides capacity availability for production workloads while maintaining scalability.

The following example code provisions an EMR cluster with a primary and core instance fleet configuration with both Spot and On-Demand Instances, using the Capacity-optimized-prioritized allocation strategy for Spot Instances and the Prioritized strategy for On-Demand Instances:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "myCluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
        "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnPrimary",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "TargetOnDemandCapacity": 1
          },
          "CoreInstanceFleet": {
            "Name": "cfnCore",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "WeightedCapacity": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "LaunchSpecifications": {
              "SpotSpecification": {
                "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                "TimeoutDurationMinutes": 20,
                "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
              },
              "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
              }
            },
            "TargetOnDemandCapacity": "5",
            "TargetSpotCapacity": "0"
          }
        },
        "Name": "blog-test",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "ServiceRole": "EMR_DefaultRole",
        "ReleaseLabel": "emr-7.2.0"
      }
    }
  }
}

Enhanced subnet selection

To better understand Step 3 in the preceding workflow, let’s explore how enhanced subnet selection works with instance fleet EMR clusters.

For our example, let’s configure an EMR instance fleet as follows:

Primary fleet (1 unit) – r8g.xlarge, r6g.xlarge, r8g.2xlarge
Core fleet (48 units) – r6g.xlarge, r6g.2xlarge, m7g.2xlarge
Task fleet (48 units) – m7g.2xlarge, r6g.xlarge, r6a.4xlarge

For this example, let’s use the lowest price allocation strategy. Next, let’s check the available IP addresses in our subnets using the AWS CLI:

aws ec2 describe-subnets 
--query "sort_by(Subnets, &SubnetId)[*].[SubnetId, AvailableIpAddressCount, AvailabilityZoneId]" 
--output table

We get the following results:

--------------------------------------------------
|                 DescribeSubnets                |
+---------------------------+-------+------------+
|subnet-XXXXXXXXXXXXXXXX1   |  27  |  us-east-1a |
|subnet-XXXXXXXXXXXXXXXX2   |  251 |  us-east-1b |
|subnet-XXXXXXXXXXXXXXXX3   |  11  |  us-east-1a |
-------------------------------------------------

When launching an EMR cluster, Amazon EMR follows a specific subnet filtering process. First, EMR on EC2 evaluates subnets based on the total IP addresses required for all node types: primary, core, and task nodes. If multiple subnets have sufficient IP capacity to accommodate all instance fleets, Amazon EMR selects one based on the cluster’s allocation strategy. However, if no subnet has enough IPs to support all node types, Amazon EMR considers subnets that can at least accommodate the primary and core nodes, again using the allocation strategy to make the final selection. In our case, Amazon EMR selected a subnet in Availability Zone us-east-1b that had 251 available IPs that can support 97 instances to launch the whole cluster, bypassing smaller subnets with only 27 or 11 available IPs because they didn’t meet the minimum IP requirements for the cluster configuration.

Primary fleet (1 unit) – r6g.xlarge
Core fleet (48 units) – m7g.2xlarge
Task fleet (48 units) – r6g.xlarge

The EMR and CloudWatch event for this cluster would be:

Amazon EMR cluster j-X40BEI1Oxxx (Cluster) 
is being created in subnet (subnet-XXXXXXXXXXXXXXXX2) 
in VPC (vpc-XXXXXXXXXXXXXXXX1) in Availability Zone (us-east-1b), 
which was chosen from the specified VPC options.

If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the entire cluster, it will prioritize launching the core and primary instance fleets. If no configured subnet can accommodate even the core and primary fleets, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take necessary actions.

Conclusion

The latest enhancements to EMR instance fleets mark a significant advancement in cloud-based big data processing, addressing key challenges in resource allocation, scalability, and reliability. These features, including priority-based instance selection and enhanced subnet selection, provide you with greater control over resource strategies, improved cluster availability, enhanced capacity optimization across Availability Zones, and more efficient fallback mechanisms for production workloads. Instance fleets help you tackle current resource management challenges while laying the groundwork for future scalability.

Get started today by setting up an EMR cluster using the example configuration provided in this post. For additional configuration options and implementation guidance, refer here or reach out to your AWS account team.

About the Authors

Ravi Kumar Singh is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specialized in building petabyte-scale data infrastructure and analytics platforms. With a passion for building innovative tools, he helps customers unlock valuable insights from their structured and unstructured data. Ravi’s expertise lies in creating robust data foundations using open source technologies and advanced cloud computing that power advanced artificial intelligence and machine learning use cases. A recognized thought leader in the field, he advances the data and AI ecosystem through pioneering solutions and collaborative industry initiatives. As a strong advocate for customer-centric solutions, Ravi constantly seeks ways to simplify complex data challenges and enhance user experiences. Outside of work, Ravi is an avid technology enthusiast who enjoys exploring emerging trends in data science, cloud computing, and machine learning.

Mandisa Nxumalo is a Cloud Engineer at Amazon Web Services (AWS) with over 5 years experience in topics related to cloud services (databases, automation, and others). Currently, specializing in Big data service Amazon EMR. She is passionate about engaging customers to effectively adopt and utilize data driven approaches to improve their big data workflows. Outside work, Mandisa enjoys hiking mountains, chasing waterfalls and travelling across countries.

Kashif Khan is a Sr. Analytics Specialist Solutions Architect at AWS, specializing in big data services like Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With over a decade of experience in the big data domain, he possesses extensive expertise in architecting scalable and robust solutions. His role involves providing architectural guidance and collaborating closely with customers to design tailored solutions using AWS analytics services to unlock the full potential of their data.

Gaurav Sharma is a Specialist Solutions Architect (Analytics) at AWS, supporting US public sector customers on their cloud journey. Outside of work, Gaurav enjoys spending time with his family and reading books.

Hybrid big data analytics with Amazon EMR on AWS Outposts

2025-01-29 Shoukat Ghouse

Post Syndicated from Shoukat Ghouse original https://aws.amazon.com/blogs/big-data/hybrid-big-data-analytics-with-amazon-emr-on-aws-outposts/

Businesses require powerful and flexible tools to manage and analyze vast amounts of information. Amazon EMR has long been the leading solution for processing big data in the cloud. Amazon EMR is the industry-leading big data solution for petabyte-scale data processing, interactive analytics, and machine learning using over 20 open source frameworks such as Apache Hadoop, Hive, and Apache Spark. However, data residency requirements, latency issues, and hybrid architecture needs often challenge purely cloud-based solutions.

Enter Amazon EMR on AWS Outposts—a groundbreaking extension that brings the power of Amazon EMR directly to your on-premises environments. This innovative service merges the scalability, performance (the Amazon EMR runtime for Apache Spark is 4.5 times more performant than Apache Spark 3.5.1), and ease of Amazon EMR with the control and proximity of your data center, empowering enterprises to meet stringent regulatory and operational requirements while unlocking new data processing possibilities.

In this post, we dive into the transformative features of EMR on Outposts, showcasing its flexibility as a native hybrid data analytics service that allows seamless data access and processing both on premises and in the cloud. We also explore how it integrates smoothly with your existing IT infrastructure, providing the flexibility to keep your data where it best fits your needs while performing computations entirely on premises. We examine a hybrid setup where sensitive data remains locally in Amazon S3 on Outposts and public data in an AWS Regional Amazon Simple Storage Service bucket. This configuration allows you to augment your sensitive on-premises data with cloud data while making sure all data processing and compute runs on-premises in AWS Outposts Racks.

Solution overview

Consider a fictional company named Oktank Finance. Oktank aims to build a centralized data lake to store vast amounts of structured and unstructured data, enabling unified access and supporting advanced analytics and big data processing for data-driven insights and innovation. Additionally, Oktank must comply with data residency requirements, making sure that confidential data is stored and processed strictly on premises. Oktank also needs to enrich their datasets with non-confidential and public market data stored in the cloud on Amazon S3, which means they should be able to join datasets across their on-premises and cloud data stores.

Traditionally, Oktank’s big data platforms tightly coupled compute and storage resources, creating an inflexible system where decommissioning compute nodes could lead to data loss. To avoid this situation, Oktank aims to decouple compute from storage, allowing them to scale down compute nodes and repurpose them for other workloads without compromising data integrity and accessibility.

To meet these requirements, Oktank decides to adopt Amazon EMR on Outposts as their big data analytics platform and Amazon S3 on Outposts as their on-premises data store for their data lake. With EMR on Outposts, Oktank can make sure that all compute occurs on premises within their Outposts rack while still being able to query and join the public data stored in Amazon S3 with their confidential data stored in S3 on Outposts, using the same unified data APIs. For data processing, Oktank can choose from a wide variety of applications available on Amazon EMR. In this post, we use Spark as the data processing framework.

This approach makes sure that all data processing and analytics are performed locally within their on-premises environment, allowing Oktank to maintain compliance with data privacy and regulatory requirements. Simultaneously, by avoiding the need to replicate public data to their on-premises data centers, Oktank reduces storage costs and simplifies their end-to-end data pipelines by eliminating additional data movement jobs.

The following diagram illustrates the high-level solution architecture.

As explained earlier, the S3 on Outposts bucket in the architecture holds Oktank’s sensitive data, which stays on the Outpost in Oktank’s data center while the Regional S3 bucket holds the non-sensitive data.

In this post, to achieve high network performance from the Outpost to the Regional S3 bucket and vice-versa, we also use AWS Direct Connect with a virtual private gateway. This is especially beneficial when you need higher query throughput to the Regional S3 bucket by making sure the traffic is routed through your own dedicated network channel to AWS.

The solution involves deploying an EMR cluster on an Outposts rack. A service link connects AWS Outposts to a Region. The service link is a necessary connection between your Outposts and the Region (or home Region). It allows for the management of the Outposts and the exchange of traffic to and from the Region.

You can also access Regional S3 buckets using this service link. However, in this post, we employ an alternate option to enable the EMR cluster to privately access the Regional S3 bucket through the local gateway. This helps optimize data access from the Regional S3 bucket as traffic is routed through Direct Connect.

To enable the EMR cluster to access Amazon S3 privately over Direct Connect, a route is configured in the Outposts subnet (marked as 2 in the architecture diagram) to direct Amazon S3 traffic through the local gateway. Upon reaching the local gateway, the traffic is routed over Direct Connect (private virtual interface) to a virtual private gateway in the Region. The second VPC (5 in diagram), which includes the S3 interface endpoint, is connected to this virtual private gateway. A route is then added to make sure that traffic can return to the EMR cluster. This setup provides more efficient, higher-bandwidth communication between the EMR cluster and Regional S3 buckets.

For big data processing, we use Amazon EMR. Amazon EMR supports access to local S3 on Outposts with the Apache Hadoop S3A connector from Amazon EMR version 7.0.0 onwards. EMR File System (EMRFS) with S3 on Outposts is not supported. We use EMR Studio notebooks for running interactive queries on the data. We also submit Spark jobs as a step on the EMR cluster. We also use the AWS Glue Data Catalog as the external Hive compatible metastore, which serves as the central technical metadata catalog. The Data Catalog is a centralized metadata repository for all your data assets across various data sources. It provides a unified interface to store and query information about data formats, schemas, and sources. Additionally, we use AWS Lake Formation for access controls on the AWS Glue table. You still need to control the raw files access on the S3 on Outposts bucket with AWS Identity and Access Management (IAM) permissions in this architecture. At the time of writing, Lake Formation can’t directly manage access to data on the S3 on Outposts bucket. Access to the actual data files stored in the S3 on Outposts bucket is managed with IAM permissions.

In the following sections, you will implement this architecture for Oktank. We focus on a specific use case for Oktank Finance, where they maintain sensitive customer stockholding data in a local S3 on Outposts bucket. Additionally, they have publicly available stock details stored in a Regional S3 bucket. Their goal is to explore both the datasets within their on-premises Outpost setup. Additionally, they need to enrich the customer stock holdings data by combining it with the publicly available stock details data.

First, we explore how to access both datasets using an EMR cluster. Then, we demonstrate the process of performing joins between the local and public data. We also demonstrate how to use Lake Formation to effectively manage permissions for these tables. We explore two primary scenarios throughout this walkthrough. In the interactive use case, we demonstrate how users can connect to the EMR cluster and run queries interactively using EMR Studio notebooks. This approach allows for real-time data exploration and analysis. Additionally, we show you how to submit batch jobs to Amazon EMR using EMR steps for automated, scheduled data processing. This method is ideal for recurring tasks or large-scale data transformations.

Prerequisites

Complete the following prerequisite steps:

Have an AWS account and a role with administrator access. If you don’t have an account, you can create one.
Have an Outposts rack installed and running.
Create an EC2 key pair. This allows you to connect to the EMR cluster nodes even if Regional connectivity is lost.
Set up Direct Connect. This is required only if you want to deploy the second AWS CloudFormation template as explained in the following section.

Deploy the CloudFormation stacks

In this post, we’ve divided the setup into four CloudFormation templates, each responsible for provisioning a specific component of the architecture. The templates come with default parameters, which you may need to adjust based on your specific configuration requirements.

Stack1 provisions the network infrastructure on Outposts. It also creates the S3 on Outposts bucket and Regional S3 bucket. It copies the sample data to the buckets to simulate the data setup for Oktank. Confidential data for customer stock holdings is copied to the S3 on Outposts bucket, and non-confidential data for stock details is copied to the Regional S3 bucket.

Stack2 provisions the infrastructure to connect to the Regional S3 bucket privately using Direct Connect. It establishes a VPC with private connectivity to both the regional S3 bucket and the Outposts subnet. It also creates an Amazon S3 VPC interface endpoint to allow private access to Amazon S3. It establishes a virtual private gateway for connectivity between the VPC and Outposts subnet. Lastly, it configures a private Amazon Route 53 hosted zone for Amazon S3, enabling private DNS resolution for S3 endpoints within the VPC. You can skip deploying this stack if you don’t need to route traffic using Direct Connect.

Stack3 provisions the EMR cluster infrastructure, AWS Glue database, and AWS Glue tables. The stack creates an AWS Glue database named oktank_outpostblog_temp and three tables under it: stock_details, stockholdings_info, and stockholdings_info_detailed. The table stock_details contains public information for the stocks, and the data location of this table points to the Regional S3 bucket. The tables stockholdings_info and stockholdings_info_detailed contain confidential information, and their data location is in the S3 on Outposts bucket. It also creates a runtime role named outpostblog-runtimeRole1. A runtime role is an IAM role that you associate with an EMR step, and jobs use this role to access AWS resources. With runtime roles for EMR steps, you can specify different IAM roles for the Spark and the Hive jobs, thereby scoping down access at a job level. This allows you to simplify access controls on a single EMR cluster that is shared between multiple tenants, wherein each tenant can be isolated using IAM roles. This stack also grants the required permissions on the runtime role to grant access on the Regional S3 bucket and the S3 on Outposts bucket. The EMR cluster uses a bootstrap action that runs a script to copy sample data to the S3 on Outposts bucket and the Regional S3 bucket for the two tables.

Stack4 provisions the EMR Studio. We will connect to EMR Studio notebook and interact with the data stored across S3 on Outposts and the Regional S3 bucket. This stack outputs the EMR Studio URL, which you can use to connect to EMR Studio.

Run the preceding CloudFormation stacks in sequence with an admin role to create the solution resources.

Access the data and join tables

To verify the solution, complete the following steps:

On the AWS CloudFormation console, navigate to the Outputs tab of Stack4, which deployed the EMR Studio, and choose the EMR Studio URL.

This will open EMR Studio in a new window.

Create a workspace and use the default options.

The workspace will launch in a new tab.

Connect to the EMR cluster using the runtime role (outpostblog-runtimeRole1).

You are now connected to the EMR cluster.

Choose the File Browser tab and open the notebook while choosing the kernel as PySpark.
Run the following query in the notebook to read from the stock details table. This table points to public data stored in the Regional S3 bucket.
```
spark.sql("select * from oktank_outpostblog_temp.stock_details").show(5)
```
Run the following query to read from the confidential data stored in the local S3 on Outposts bucket:
```
spark.sql("select * from oktank_outpostblog_temp.stockholdings_info").show(5)
```

As highlighted earlier, one of the requirements for Oktank is to enrich the preceding data with data from the Regional S3 bucket.

Run the following query to join the preceding two tables:

spark.sql("select customerid,sharesheld,purchasedate, a.stockid, b.stockname,b.category,b.currentprice from oktank_outpostblog_temp.stockholdings_info a inner join oktank_outpostblog_temp.stock_details b on a.stockid=b.stockid order by customerid").show(10)

Control access to tables using Lake Formation

In this post, we also showcase how you can control access to the tables using Lake Formation. To demonstrate, let’s block access to RuntimeRole1 on the stockholdings_info table.

On the Lake Formation console, choose Tables in the navigation pane.
Select the table stockholdings_info and on the Actions menu, choose View to view the current access permissions on this table.
Select IAMAllowedPrincipals from the list of principals and choose Revoke to revoke the permission.
Go back to the EMR Studio notebook and rerun the earlier query.

Oktank’s data access query fails because Lake Formation has denied permission to the runtime role; you will need to adjust the permissions.

To resolve this issue, return to the Lake Formation console, select the stockholdings_info table, and on the Actions menu, choose Grant.
Assign the necessary permissions to the runtime role to make sure it can access the table.
Select IAM users and roles and choose the runtime role (outpostblog-runtimeRole1).
Choose the table stockholdings_info from the list of tables and for Table permissions, select Select.
Select All data access and choose Grant.
Go back to the notebook and rerun the query.

The query now succeeds because we granted access to the runtime role connected to the EMR cluster through the EMR Studio notebook. This demonstrates how Lake Formation allows you to manage permissions on your Data Catalog tables.

The previous steps only restrict access to the table in the catalog, not to the actual data files stored in the S3 on Outposts bucket. To control access to these data files, you need to use IAM permissions. As mentioned earlier, Stack3 in this post handles the IAM permissions for the data. For access control on the Regional S3 bucket with Lake Formation, you don’t need to specifically provide IAM permissions on the specific S3 bucket to the roles. Lake Formation manages the Regional S3 bucket access controls for runtime roles. Refer to Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for access control with Amazon EMR for detailed guidance on managing access to a Regional S3 bucket with Lake Formation and EMR runtime roles.

Submit a batch job

Next, let’s submit a batch job as an EMR step on the EMR cluster. Before we do that, let’s confirm there is currently no data in the table stockholdings_info_detailed. Run the following query in the notebook:

spark.sql("select * from oktank_outpostblog_temp.stockholdings_info_detailed").show(10)

You will not see any data in this table. You can now detach the notebook from the cluster.
You will now insert data in this table using a batch job submitted as an EMR step.

On the EMR console, navigate to the cluster EMROutpostBlog and submit a step.
Choose Spark Application for Type.
Select the py script from the scripts folder in your S3 bucket created by the CloudFormation template.
For Permissions, choose the runtime role (outpostblog-RuntimeRole1).
Choose Add step to submit the job.

Wait for the job to complete. The job inserted data into the stockholdings_info_detailed table. You can rerun the earlier query in the notebook to verify the data:

spark.sql("select * from oktank_outpostblog_temp.stockholdings_info_detailed").show(10)

Clean up

To avoid incurring further charges, delete the CloudFormation stacks.

Before deleting Stack4, run the following shell command (with the %%sh magic command) in the EMR Studio notebook to delete the objects from the S3 on Outposts bucket:

aws s3api delete-objects --bucket <replace with value of key S3OutpostBucketAccessPointAlias1 from stack 3 output> --delete "$(aws s3api list-object-versions --bucket <replace with value of key S3OutpostBucketAccessPointAlias1 from stack 3 output> --output=json | jq '{Objects: [.Versions[]|{Key:.Key,VersionId:.VersionId}], Quiet: true}')"

Next, manually delete the EMR workspace from the EMR Studio.
You can now delete the stacks, starting with Stack4, Stack3, Stack2, and finally Stack1.

Conclusion

In this post, we demonstrated how to use Amazon EMR on Outposts as a managed big data processing service in your on-premises setup. We explored how you can set up the cluster to access data stored in an S3 on Outposts bucket on premises and also efficiently access data in the Regional S3 bucket with private networking. We also explored Glue Data Catalog as a serverless external Hive metastore and managed access control to the catalog tables using Lake Formation. We accessed the data interactively using EMR Studio notebooks and processed it as a batch job using EMR steps.

To learn more, visit Amazon EMR on AWS Outposts.

For further reading, refer to the following resources:

About the Authors

Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.

Fernando Galves is an Outpost Solutions Architect at AWS, specializing in networking, security, and hybrid cloud architectures. He helps customers design and implement secure hybrid environments using AWS Outposts, focusing on complex networking solutions and seamless integration between on-premises and cloud infrastructure.

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

2024-12-27 Atul Payapilly

Post Syndicated from Atul Payapilly original https://aws.amazon.com/blogs/big-data/amazon-emr-7-5-runtime-for-apache-spark-and-iceberg-can-run-spark-workloads-3-6-times-faster-than-spark-3-5-3-and-iceberg-1-6-1/

The Amazon EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue all use the optimized runtimes.

In this post, we demonstrate the performance benefits of using the Amazon EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

Iceberg is a popular open source high-performance format for large analytic tables. Our benchmarks demonstrate that Amazon EMR can run TPC-DS 3 TB workloads 3.6 times faster, reducing the runtime from 1.54 hours to 0.42 hours. Additionally, the cost efficiency improves by 2.9 times, with the total cost decreasing from $16.00 to $5.39 when using Amazon Elastic Compute Cloud (Amazon EC2) On-Demand r5d.4xlarge instances, providing observable gains for data processing tasks.

This is a further 32% increase from the optimizations shipped in Amazon EMR 7.1 covered in a previous post, Amazon EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 times faster than Apache Spark 3.5.1 and Iceberg 1.5.2. Since then we have continued adding more support for DataSource V2 for eight more existing query optimizations in the EMR runtime for Spark.

In addition to these DataSource V2 specific improvements, we have made more optimizations to Spark operators since Amazon EMR 7.1 that also contribute to the additional speedup.

Benchmark results for Amazon EMR 7.5 compared to4 open source Spark 3.5.3 and Iceberg 1.6.1

To assess the Spark engine’s performance with the Iceberg table format, we performed benchmark tests using the 3 TB TPC-DS dataset, version 2.13 (our results derived from the TPC-DS dataset are not directly comparable to the official TPC-DS results due to setup differences). Benchmark tests for the EMR runtime for Spark and Iceberg were conducted on Amazon EMR 7.5 EC2 clusters vs open source Spark 3.5.3 and Iceberg 1.6.1 on EC2 clusters.

The setup instructions and technical details are available in our GitHub repository. To minimize the influence of external catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This uses the underlying file system, specifically Amazon S3, as the catalog. We can define this setup by configuring the property spark.sql.catalog.<catalog_name>.type. The fact tables used the default partitioning by the date column, which have a number of partitions varying from 200–2,100. No precalculated statistics were used for these tables.

We ran a total of 104 SparkSQL queries in three sequential rounds, and the average runtime of each query across these rounds was taken for comparison. The average runtime for the three rounds on Amazon EMR 7.5 with Iceberg enabled was 0.42 hours, demonstrating a 3.6-fold speed increase compared to open source Spark 3.5.3 and Iceberg 1.6.1. The following figure presents the total runtimes in seconds.

EMR vs OSS runtime

The following table summarizes the metrics.

Metric	Amazon EMR 7.5 on EC2	Amazon EMR 7.1 on EC2	Open Source Spark 3.5.3 and Iceberg 1.6.1
Average runtime in seconds	1535.62	2033.17	5546.16
Geometric mean over queries in seconds	8.30046	10.13153	20.40555
Cost*	$5.39	$7.18	$16.00

*Detailed cost estimates are discussed later in this post.

The following chart demonstrates the per-query performance improvement of Amazon EMR 7.5 relative to open source Spark 3.5.3 and Iceberg 1.6.1. The extent of the speedup varies from one query to another, with the fastest up to 9.4 times faster for q93, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

EMR vs OSS per query cost

Cost comparison

Our benchmark provides the total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex, real-world decision support scenario. For additional insights, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR expenses.

Amazon EC2 cost (includes SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
- r5d.4xlarge hourly rate = $1.152 per hour in us-east-1
Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
- 4xlarge Amazon EMR cost = $0.27 per hour
Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

The calculations reveal that the Amazon EMR 7.5 benchmark yields a 2.9-fold cost efficiency improvement over open source Spark 3.5.3 and Iceberg 1.6.1 in running the benchmark job.

Metric	Amazon EMR 7.5	Amazon EMR 7.1	Open Source Spark 3.5.1 and Iceberg 1.5.2
Runtime in hours	0.426	0.564	1.540
Number of EC2 instances (Includes primary node)	9	9	9
Amazon EBS Size	20gb	20gb	20gb
Amazon EC2 (Total runtime cost)	$4.35	$5.81	$15.97
Amazon EBS cost	$0.01	$0.01	$0.04
Amazon EMR cost	$1.02	$1.36	$0
Total cost	$5.38	$7.18	$16.01
Cost savings	Amazon EMR 7.5 is 2.9 times better	Amazon EMR 7.1 is 2.2 times better	Baseline

In addition to the time-based metrics discussed so far, data from Spark event logs show that Amazon EMR scanned approximately 3.4 times less data from Amazon S3 and 4.1 times fewer records than the open source version in the TPC-DS 3 TB benchmark. This reduction in Amazon S3 data scanning contributes directly to cost savings for Amazon EMR workloads.

Run open source Spark benchmarks on Iceberg tables

We used separate EC2 clusters, each equipped with nine r5d.4xlarge instances, for testing both open source Spark 3.5.3 and Amazon EMR 7.5 for Iceberg workload. The primary node was equipped with 16 vCPU and 128 GB of memory, and the eight worker nodes together had 128 vCPU and 1024 GB of memory. We conducted tests using the Amazon EMR default settings to showcase the typical user experience and minimally adjusted the settings of Spark and Iceberg to maintain a balanced comparison.

The following table summarizes the Amazon EC2 configurations for the primary node and eight worker nodes of type r5d.4xlarge.

EC2 Instance	vCPU	Memory (GiB)	Instance Storage (GB)	EBS Root Volume (GB)
r5d.4xlarge	16	128	2 x 300 NVMe SSD	20 GB

Prerequisites

The following prerequisites are required to run the benchmarking:

Using the instructions in the emr-spark-benchmark GitHub repo, set up the TPC-DS source data in your S3 bucket and on your local computer.
Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application and copy the benchmark application to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.3.jar to your S3 bucket.
Create Iceberg tables from the TPC-DS source data. Follow the instructions on GitHub to create Iceberg tables using the Hadoop catalog. For example, the following code uses an EMR 7.5 cluster with Iceberg enabled to create the tables:

aws emr add-steps 
--cluster-id <cluster-id> --steps Type=Spark,Name="Create Iceberg Tables",
Args=[--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.hadoop_catalog.type=hadoop,
--conf,spark.sql.catalog.hadoop_catalog.warehouse=s3://<bucket>/<warehouse_path>/,
--conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<bucket>/<jar_location>/spark-benchmark-assembly-3.5.3.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/,
/home/hadoop/tpcds-kit/tools,parquet,3000,true,<database_name>,true,true],ActionOnFailure=CONTINUE --region <AWS region>

Note the Hadoop catalog warehouse location and database name from the preceding step. We use the same iceberg tables to run benchmarks with Amazon EMR 7.5 and open source Spark.

This benchmark application is built from the branch tpcds-v2.13_iceberg. If you’re building a new benchmark application, switch to the correct branch after downloading the source code from the GitHub repo.

Create and configure a YARN cluster on Amazon EC2

To compare Iceberg performance between Amazon EMR on Amazon EC2 and open source Spark on Amazon EC2, follow the instructions in the emr-spark-benchmark GitHub repo to create an open source Spark cluster on Amazon EC2 using Flintrock with eight worker nodes.

Based on the cluster selection for this test, the following configurations are used:

Make sure to replace the placeholder <private ip of primary node>, in the yarn-site.xml file, with the primary node’s IP address of your Flintrock cluster.

Run the TPC-DS benchmark with Spark 3.5.3 and Iceberg 1.6.1

Complete the following steps to run the TPC-DS benchmark:

Log in to the open source cluster primary node using flintrock login $CLUSTER_NAME.
Submit your Spark job:
1. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables.
2. The results are created in s3://<YOUR_S3_BUCKET>/benchmark_run.
3. You can track progress in /media/ephemeral0/spark_run.log.

spark-submit \
--master yarn \
--deploy-mode client \
--class com.amazonaws.eks.tpcds.BenchmarkSQL \
--conf spark.driver.cores=4 \
--conf spark.driver.memory=10g \
--conf spark.executor.cores=16 \
--conf spark.executor.memory=100g \
--conf spark.executor.instances=8 \
--conf spark.network.timeout=2000 \
--conf spark.executor.heartbeatInterval=300s \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.shuffle.service.enabled=false \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog    \
--conf spark.sql.catalog.local.type=hadoop  \
--conf spark.sql.catalog.local.warehouse=s3a://<YOUR_S3_BUCKET>/<warehouse_path>/ \
--conf spark.sql.defaultCatalog=local   \
--conf spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO   \
spark-benchmark-assembly-3.5.3.jar   \
s3://<YOUR_S3_BUCKET>/benchmark_run 3000 1 false  \
q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,\
q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,\
q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,\
q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,\
q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,\
q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,\
q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,\
q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,\
q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,\
q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,\
q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,\
q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13    \
true <database> > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the results

After the Spark job finishes, retrieve the test result file from the output S3 bucket at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or by using the AWS Command Line Interface (AWS CLI). The Spark benchmark application organizes the data by creating a timestamp folder and placing a summary file within a folder labeled summary.csv. The output CSV files contain four columns without headers:

Query name
Median time
Minimum time
Maximum time

With the data from three separate test runs with one iteration each time, we can calculate the average and geometric mean of the benchmark runtimes.

Run the TPC-DS benchmark with the EMR runtime for Spark

Most of the instructions are similar to Steps to run Spark Benchmarking with a few Iceberg-specific details.

Prerequisites

Complete the following prerequisite steps:

Run aws configure to configure the AWS CLI shell to point to the benchmarking AWS account. Refer to Configure the AWS CLI for instructions.
Upload the benchmark application JAR file to Amazon S3.

Deploy the EMR cluster and run the benchmark job

Complete the following steps to run the benchmark job:

Use the AWS CLI command as shown in Deploy EMR on EC2 Cluster and run benchmark job to spin up an EMR on EC2 cluster. Make sure to enable Iceberg. See Create an Iceberg cluster for more details. Choose the correct Amazon EMR version, root volume size, and same resource configuration as the open source Flintrock setup. Refer to create-cluster for a detailed description of the AWS CLI options.
Store the cluster ID from the response. We need this for the next step.
Submit the benchmark job in Amazon EMR using add-steps from the AWS CLI:
1. Replace <cluster ID> with the cluster ID from Step 2.
2. The benchmark application is at s3://<your-bucket>/spark-benchmark-assembly-3.5.3.jar.
3. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables. This should be the same as the one used for the open source TPC-DS benchmark run.
4. The results will be in s3://<your-bucket>/benchmark_run.

aws emr add-steps   --cluster-id <cluster-id>
--steps Type=Spark,Name="SPARK Iceberg EMR TPCDS Benchmark Job",
Args=[--class,com.amazonaws.eks.tpcds.BenchmarkSQL,
--conf,spark.driver.cores=4,
--conf,spark.driver.memory=10g,
--conf,spark.executor.cores=16,
--conf,spark.executor.memory=100g,
--conf,spark.executor.instances=8,
--conf,spark.network.timeout=2000,
--conf,spark.executor.heartbeatInterval=300s,
--conf,spark.dynamicAllocation.enabled=false,
--conf,spark.shuffle.service.enabled=false,
--conf,spark.sql.iceberg.data-prefetch.enabled=true,
--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.local.type=hadoop,
--conf,spark.sql.catalog.local.warehouse=s3://<your-bucket>/<warehouse-path>,
--conf,spark.sql.defaultCatalog=local,
--conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<your-bucket>/spark-benchmark-assembly-3.5.3.jar,
s3://<your-bucket>/benchmark_run,3000,1,false,
'q1-v2.13\,q10-v2.13\,q11-v2.13\,q12-v2.13\,q13-v2.13\,q14a-v2.13\,q14b-v2.13\,q15-v2.13\,q16-v2.13\,q17-v2.13\,q18-v2.13\,q19-v2.13\,q2-v2.13\,q20-v2.13\,q21-v2.13\,q22-v2.13\,q23a-v2.13\,q23b-v2.13\,q24a-v2.13\,q24b-v2.13\,q25-v2.13\,q26-v2.13\,q27-v2.13\,q28-v2.13\,q29-v2.13\,q3-v2.13\,q30-v2.13\,q31-v2.13\,q32-v2.13\,q33-v2.13\,q34-v2.13\,q35-v2.13\,q36-v2.13\,q37-v2.13\,q38-v2.13\,q39a-v2.13\,q39b-v2.13\,q4-v2.13\,q40-v2.13\,q41-v2.13\,q42-v2.13\,q43-v2.13\,q44-v2.13\,q45-v2.13\,q46-v2.13\,q47-v2.13\,q48-v2.13\,q49-v2.13\,q5-v2.13\,q50-v2.13\,q51-v2.13\,q52-v2.13\,q53-v2.13\,q54-v2.13\,q55-v2.13\,q56-v2.13\,q57-v2.13\,q58-v2.13\,q59-v2.13\,q6-v2.13\,q60-v2.13\,q61-v2.13\,q62-v2.13\,q63-v2.13\,q64-v2.13\,q65-v2.13\,q66-v2.13\,q67-v2.13\,q68-v2.13\,q69-v2.13\,q7-v2.13\,q70-v2.13\,q71-v2.13\,q72-v2.13\,q73-v2.13\,q74-v2.13\,q75-v2.13\,q76-v2.13\,q77-v2.13\,q78-v2.13\,q79-v2.13\,q8-v2.13\,q80-v2.13\,q81-v2.13\,q82-v2.13\,q83-v2.13\,q84-v2.13\,q85-v2.13\,q86-v2.13\,q87-v2.13\,q88-v2.13\,q89-v2.13\,q9-v2.13\,q90-v2.13\,q91-v2.13\,q92-v2.13\,q93-v2.13\,q94-v2.13\,q95-v2.13\,q96-v2.13\,q97-v2.13\,q98-v2.13\,q99-v2.13\,ss_max-v2.13',
true,<database>],ActionOnFailure=CONTINUE --region <aws-region>

Summarize the results

After the step is complete, you can see the summarized benchmark result at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv in the same way as the previous run and compute the average and geometric mean of the query runtimes.

Clean up

To prevent any future charges, delete the resources you created by following the instructions provided in the Cleanup section of the GitHub repository.

Summary

Amazon EMR is consistently enhancing the EMR runtime for Spark when used with Iceberg tables, achieving a performance that is 3.6 times faster than open source Spark 3.5.3 and Iceberg 1.6.1 with EMR 7.5 on TPC-DS 3 TB, v2.13. This is a further increase of 32% from EMR 7.1. We encourage you to keep up to date with the latest Amazon EMR releases to fully benefit from ongoing performance improvements.

To stay informed, subscribe to the AWS Big Data Blog’s RSS feed, where you can find updates on the EMR runtime for Spark and Iceberg, as well as tips on configuration best practices and tuning recommendations.

About the Authors

Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.

Udit Mehrotra is an Engineering Manager for EMR at Amazon Web Services.

Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

2024-12-10 Anubhav Awasthi

Post Syndicated from Anubhav Awasthi original https://aws.amazon.com/blogs/big-data/run-apache-spark-structured-streaming-jobs-at-scale-on-amazon-emr-serverless/

As data is generated at an unprecedented rate, streaming solutions have become essential for businesses seeking to harness near real-time insights. Streaming data—from social media feeds, IoT devices, e-commerce transactions, and more—requires robust platforms that can process and analyze data as it arrives, enabling immediate decision-making and actions.

This is where Apache Spark Structured Streaming comes into play. It offers a high-level API that simplifies the complexities of streaming data, allowing developers to write streaming jobs as if they were batch jobs, but with the power to process data in near real time. Spark Structured Streaming integrates seamlessly with various data sources, such as Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis Data Streams, providing a unified solution that supports complex operations like windowed computations, event-time aggregation, and stateful processing. By using Spark’s fast, in-memory processing capabilities, businesses can run streaming workloads efficiently, scaling up or down as needed, to derive timely insights that drive strategic and critical decisions.

The setup of a computing infrastructure to support such streaming workloads poses its challenges. Here, Amazon EMR Serverless emerges as a pivotal solution for running streaming workloads, enabling the use of the latest open source frameworks like Spark without the need for configuration, optimization, security, or cluster management.

Starting with Amazon EMR 7.1, we introduced a new job --mode on EMR Serverless called Streaming. You can submit a streaming job from the EMR Studio console or the StartJobRun API:

aws emr-serverless start-job-run \
 --application-id APPPLICATION_ID \
 --execution-role-arn JOB_EXECUTION_ROLE \
 --mode 'STREAMING' \
 --job-driver '{
     "sparkSubmit": {
         "entryPoint": "s3://streaming script",
         "entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET-OUTPUT/output"],
         "sparkSubmitParameters": "--conf spark.executor.cores=4
            --conf spark.executor.memory=16g
            --conf spark.driver.cores=4
            --conf spark.driver.memory=16g
            --conf spark.executor.instances=3"
     }
 }'

In this post, we highlight some of the key enhancements introduced for streaming jobs.

Performance

The Amazon EMR runtime for Apache Spark delivers a high-performance runtime environment while maintaining 100% API compatibility with open source Spark. Additionally, we have introduced the following enhancements to provide improved support for streaming jobs.

Amazon Kinesis connector with Enhanced Fan-Out Support

Traditional Spark streaming applications reading from Kinesis Data Streams often face throughput limitations due to shared shard-level read capacity, where multiple consumers compete for the default 2 MBps per shard throughput. This bottleneck becomes particularly challenging in scenarios requiring real-time processing across multiple consuming applications.

To address this challenge, we released the open source Amazon Kinesis Data Streams Connector for Spark Structured Streaming that supports enhanced fan-out for dedicated read throughput. Compatible with both provisioned and on-demand Kinesis Data Streams, enhanced fan-out provides each consumer with dedicated throughput of 2 MBps per shard. This enables streaming jobs to process data concurrently without the constraints of shared throughput, significantly reducing latency and facilitating near real-time processing of large data streams. By eliminating competition between consumers and enhancing parallelism, enhanced fan-out provides faster, more efficient data processing, which boosts the overall performance of streaming jobs on EMR Serverless. Starting with Amazon EMR 7.1, the connector comes pre-packaged on EMR Serverless, so you don’t need to build or download any packages.

The following diagram illustrates the architecture using shared throughput.

The following diagram illustrates the architecture using enhanced fan-out and dedicated throughput.

Refer to Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams for additional details on this connector.

Cost optimization

EMR Serverless charges are based on the total vCPU, memory, and storage resources utilized during the time workers are active, from when they are ready to execute tasks until they stop. To optimize costs, it is crucial to scale streaming jobs effectively. We have introduced the following enhancements to improve scaling at both the task level and across multiple tasks.

Fine-Grained Scaling

In practical scenarios, data volumes can be unpredictable and exhibit sudden spikes, necessitating a platform capable of dynamically adjusting to workload changes. EMR Serverless eliminates the risks of over- or under-provisioning resources for your streaming workloads. EMR Serverless scaling uses Spark dynamic allocation to correctly scale the executors according to demand. The scalability of a streaming job is also influenced by its data source to make sure Kinesis shards or Kafka partitions are also scaled accordingly. Each Kinesis shard and Kafka partition corresponds to a single Spark executor core. To achieve optimal throughput, use a one-to-one ratio of Spark executor cores to Kinesis shards or Kafka partitions.

Streaming operates through a sequence of micro-batch processes. In cases of short-running tasks, overly aggressive scaling can lead to resource wastage due to the overhead of allocating executors. To mitigate this, consider modifying spark.dynamicAllocation.executorAllocationRatio. The scaling down process is shuffle aware, avoiding executors holding shuffle data. Although this shuffle data is usually subject to garbage collection, if it’s not being cleared fast enough, the spark.dynamicAllocation.shuffleTracking.timeout setting can be adjusted to determine when executors should be timed out and removed.

Let’s examine fine-grained scaling with an example of a spiky workload where data is periodically ingested, followed by idle intervals. The following graph illustrates an EMR Serverless streaming job processing data from an on-demand Kinesis data stream. Initially, the job handles 100 records per second. As tasks queue, dynamic allocation adds capacity, which is quickly released due to short task durations (adjustable using executorAllocationRatio). When we increase input data to 10,000 records per second, Kinesis adds shards, triggering EMR Serverless to provision more executors. Scaling down happens as executors complete processing and are released after the idle timeout (spark.dynamicAllocation.executorIdleTimeout, default 60 seconds), leaving only the Spark driver running during the idle window. (Full scale-down is source dependent. For example, a provisioned Kinesis data stream with a fixed number of shards may have limitations in fully scaling down even when shards are idle.) This pattern repeats as bursts of 10,000 records per second alternate with idle periods, allowing EMR Serverless to scale resources dynamically. This job uses the following configuration:

--conf spark.dynamicAllocation.shuffleTracking.timeout=300s
--conf spark.dynamicAllocation.executorAllocationRatio=0.7

Resiliency

EMR Serverless ensures resiliency in streaming jobs by leveraging automatic recovery and fault-tolerant architectures

Built-in Availability Zone resiliency

Streaming applications drive critical business operations like fraud detection, real-time analytics, and monitoring systems, making any downtime particularly costly. Infrastructure failures at the Availability Zone level can cause significant disruptions to distributed streaming applications, potentially leading to extended downtime and data processing delays.

Amazon EMR Serverless now addresses this challenge with built-in Availability Zone failover capabilities: jobs are initially provisioned in a randomly selected Availability Zone, and, in the event of an Availability Zone failure, the service automatically retries the job in a healthy Availability Zone, minimizing interruptions to processing. Although this feature greatly enhances application reliability, achieving full resiliency requires input data sources that also support Availability Zone failover. Additionally, if you’re using a custom virtual private cloud (VPC) configuration, it is recommended to configure EMR Serverless to operate across multiple Availability Zones to optimize fault tolerance.

The following diagram illustrates a sample architecture.

Auto retry

Streaming applications are susceptible to various runtime failures caused by transient issues such as network connectivity problems, memory pressure, or resource constraints. Without proper retry mechanisms, these temporary failures can lead to permanently stopping jobs, requiring manual intervention to restart the jobs. This not only increases operational overhead but also risks data loss and processing gaps, especially in continuous data processing scenarios where maintaining data consistency is crucial.

EMR Serverless streamlines this process by automatically retrying failed jobs. Streaming jobs use checkpointing to periodically save the computation state to Amazon Simple Storage Service (Amazon S3), allowing failed jobs to restart from the last checkpoint, minimizing data loss and reprocessing time. Although there is no cap on the total number of retries, a thrash prevention mechanism allows you to configure the number of retry attempts per hour, ranging from 1–10, with the default being set to five attempts per hour.

See the following example code:

aws emr-serverless start-job-run \
 --application-id <APPPLICATION_ID>  \
 --execution-role-arn <JOB_EXECUTION_ROLE> \
 --mode 'STREAMING' \
 --retry-policy '{
    "maxFailedAttemptsPerHour": 5
 }'
 --job-driver '{
    "sparkSubmit": {
         "entryPoint": "/usr/lib/spark/examples/jars/spark-examples-does-not-exist.jar",
         "entryPointArguments": ["1"],
         "sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi"
    }
 }'

Observability

EMR Serverless provides robust log management and enhanced monitoring, enabling users to efficiently troubleshoot issues and optimize the performance of streaming jobs.

Event log rotation and compression

Spark streaming applications continuously process data and generate substantial amounts of event log data. The accumulation of these logs can consume significant disk space, potentially leading to degraded performance or even system failures due to disk space exhaustion.

Log rotation mitigates these risks by periodically archiving old logs and creating new ones, thereby maintaining a manageable size of active log files. Event log rotation is enabled by default for both batch as well as streaming jobs and can’t be disabled. Rotating logs doesn’t affect the logs uploaded to the S3 bucket. However, they will be compressed using zstd standard. You can find rotated event logs under the following S3 folder:

<S3-logUri>/applications/<application-id>/jobs/<job-id>/sparklogs/

The following table summarizes key configurations that govern event log rotation.

Configuration	Value	Comment
spark.eventLog.rotation.enabled	TRUE
spark.eventLog.rotation.interval	300 seconds	Specifies time interval for the log rotation
spark.eventLog.rotation.maxFilesToRetain	2	Specifies how many rotated log files to keep during cleanup
spark.eventLog.rotation.minFileSize	1 MB	Specifies a minimum file size to rotate the log file

Application log rotation and compression

One of the most common errors in Spark streaming applications is the no space left on disk errors, primarily caused by the rapid accumulation of application logs during continuous data processing. These Spark streaming application logs from drivers and executors can grow exponentially, quickly consuming available disk space.

To address this, EMR Serverless has introduced rotation and compression for driver and executor stderr and stdout logs. Log files are refreshed every 15 seconds and can range from 0–128 MB. You can find the latest log files at the following Amazon S3 locations:

<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_DRIVER/stderr.gz
<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_DRIVER/stdout.gz
<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_EXECUTOR/stderr.gz
<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_EXECUTOR/stdout.gz

Rotated application logs are pushed to archive available under the following Amazon S3 locations:

<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_DRIVER/archived/
<S3-logUri>/applications/<application-id>/jobs/<job-id>/SPARK_EXECUTOR/<executor-id>/archived/

Enhanced monitoring

Spark provides comprehensive performance metrics for drivers and executors, including JVM heap memory, garbage collection, and shuffle data, which are valuable for troubleshooting performance and analyzing workloads. Starting with Amazon EMR 7.1, EMR Serverless integrates with Amazon Managed Service for Prometheus, enabling you to monitor, analyze, and optimize your jobs using detailed engine metrics, such as Spark event timelines, stages, tasks, and executors. This integration is available when submitting jobs or creating applications. For setup details, refer to Monitor Spark metrics with Amazon Managed Service for Prometheus. To enable metrics for Structured Streaming queries, set the Spark property --conf spark.sql.streaming.metricsEnabled=true

You can also monitor and debug jobs using the Spark UI. The web UI presents a visual interface with detailed information about your running and completed jobs. You can dive into job-specific metrics and information about event timelines, stages, tasks, and executors for each job.

Service integrations

Organizations often struggle with integrating multiple streaming data sources into their data processing pipelines. Managing different connectors, dealing with varying protocols, and providing compatibility across diverse streaming platforms can be complex and time-consuming.

EMR Serverless supports Kinesis Data Streams, Amazon MSK, and self-managed Apache Kafka clusters as input data sources to read and process data in near real time.

Whereas the Kinesis Data Streams connector is natively available on Amazon EMR, the Kafka connector is an open source connector from the Spark community and is available in a Maven repository.

The following diagram illustrates a sample architecture for each connector.

Refer to Supported streaming connectors to learn more about using these connectors.

Additionally, you can refer to the aws-samples GitHub repo to set up a streaming job reading data from a Kinesis data stream. It uses the Amazon Kinesis Data Generator to generate test data.

Conclusion

Running Spark Structured Streaming on EMR Serverless offers a robust and scalable solution for real-time data processing. By taking advantage of the seamless integration with AWS services like Kinesis Data Streams, you can efficiently handle streaming data with ease. The platform’s advanced monitoring tools and automated resiliency features provide high availability and reliability, minimizing downtime and data loss. Furthermore, the performance optimizations and cost-effective serverless model make it an ideal choice for organizations looking to harness the power of near real-time analytics without the complexities of managing infrastructure.

Try out using Spark Structured Streaming on EMR Serverless for your own use case, and share your questions in the comments.

About the Authors

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Kshitija Dound is an Associate Specialist Solutions Architect at AWS based in New York City, focusing on data and AI. She collaborates with customers to transform their ideas into cloud solutions, using AWS big data and AI services. In her spare time, Kshitija enjoys exploring museums, indulging in art, and embracing NYC’s outdoor scene.

Paul Min is a Solutions Architect at AWS, where he works with customers to advance their mission and accelerate their cloud adoption. He is passionate about helping customers reimagine what’s possible with AWS. Outside of work, Paul enjoys spending time with his wife and golfing.

Amazon EMR streamlines big data processing with simplified Amazon S3 Glacier access

2024-11-28 Giovanni Matteo Fumarola

Post Syndicated from Giovanni Matteo Fumarola original https://aws.amazon.com/blogs/big-data/amazon-emr-streamlines-big-data-processing-with-simplified-amazon-s3-glacier-access/

Amazon S3 Glacier serves several important audit use cases, particularly for organizations that need to retain data for extended periods due to regulatory compliance, legal requirements, or internal policies. S3 Glacier is ideal for long-term data retention and archiving of audit logs, financial records, healthcare information, and other compliance-related data. Its low-cost storage model makes it economically feasible to store vast amounts of historical data for extended periods of time. The data immutability and encryption features of S3 Glacier uphold the integrity and security of stored audit trails, which is crucial for maintaining a reliable chain of evidence. The service supports configurable vault lock policies, allowing organizations to enforce retention rules and prevent unauthorized deletion or modification of audit data. The integration of S3 Glacier with AWS CloudTrail also provides an additional layer of auditing for all API calls made to S3 Glacier, helping organizations monitor and log access to their archived data. These features make S3 Glacier a robust solution for organizations needing to maintain comprehensive, tamper-evident audit trails for extended periods while managing costs effectively.

S3 Glacier offers significant cost savings for data archiving and long-term backup compared to standard Amazon Simple Storage Service (Amazon S3) storage. It provides multiple storage tiers with varying access times and costs, allowing optimization based on specific needs. By implementing S3 Lifecycle policies, you can automatically transition data from more expensive Amazon S3 tiers to cost-effective S3 Glacier storage classes. Its flexible retrieval options enable further cost optimization by choosing slower, less expensive retrieval for non-urgent data. Additionally, Amazon offers discounts for data stored in S3 Glacier over extended periods, making it particularly cost-effective for long-term archival storage. These features allow organizations to substantially reduce storage costs, especially for large volumes of infrequently accessed data, while meeting compliance and regulatory requirements. For more details, see Understanding S3 Glacier storage classes for long-term data storage.

Prior to Amazon EMR 7.2, EMR clusters couldn’t directly read from or write to the S3 Glacier storage classes. This limitation made it challenging to process data stored in S3 Glacier as part of EMR jobs without first transitioning the data to a more readily accessible Amazon S3 storage class.

The inability to directly access S3 Glacier data meant that workflows involving both active data in Amazon S3 and archived data in S3 Glacier were not seamless. Users often had to implement complex workarounds or multi-step processes to include S3 Glacier data in their EMR jobs. Without built-in S3 Glacier support, organizations couldn’t take full advantage of the cost savings in S3 Glacier for large-scale data analysis tasks on historical or infrequently accessed data.

Although S3 Lifecycle policies could move data to S3 Glacier, EMR jobs couldn’t easily incorporate this archived data into their processing without manual intervention or separate data retrieval steps.

The lack of seamless S3 Glacier integration made it challenging to implement a truly unified data lake architecture that could efficiently span across hot, warm, and cold data tiers.These limitations often required users to implement complex data management strategies or accept higher storage costs to keep data readily accessible for Amazon EMR processing. The improvements in Amazon EMR 7.2 aimed to address these issues, providing more flexibility and cost-effectiveness in big data processing across various storage tiers.

In this post, we demonstrate how to set up and use Amazon EMR on EC2 with S3 Glacier for cost-effective data processing.

Solution overview

With the release of Amazon EMR 7.2.0, significant improvements have been made in handling S3 Glacier objects:

Improved S3A protocol support – You can now read restored S3 Glacier objects directly from Amazon S3 locations using the S3A protocol. This enhancement streamlines data access and processing workflows.
Intelligent S3 Glacier file handling – Starting from Amazon EMR 7.2.0+, the S3A connector can differentiate between S3 Glacier and S3 Glacier Deep Archive objects. This capability prevents AmazonS3Exceptions from occurring when attempting to access S3 Glacier objects that have a restore operation in progress.
Selective read operations – The new version intelligently ignores archived S3 Glacier objects that are still in the process of being restored, enhancing operational efficiency.
Customizable S3 Glacier object handling – A new setting, fs.s3a.glacier.read.restored.objects, offers three options for managing S3 Glacier objects:
- READ_ALL (Default) – Amazon EMR processes all objects regardless of their storage class.
- SKIP_ALL_GLACIER – Amazon EMR ignores S3 Glacier-tagged objects, similar to the default behavior of Amazon Athena.
- READ_RESTORED_GLACIER_OBJECTS – Amazon EMR checks the restoration status of S3 Glacier objects. Restored objects are processed like standard S3 objects, and unrestored ones are ignored. This behavior is the same as Athena if you configure the table property as described in Query restored Amazon S3 Glacier objects.

These enhancements provide you with greater flexibility and control over how Amazon EMR interacts with S3 Glacier storage, improving both performance and cost-effectiveness in data processing workflows.

Amazon EMR 7.2.0 and later versions offer improved integration with S3 Glacier storage, enabling cost-effective data analysis on archived data. In this post, we walk through the following steps to set up and test this integration:

Create an S3 bucket. This will serve as the primary storage location for your data.
Load and transition data:
- Upload your dataset to S3.
- Use lifecycle policies to transition the data to the S3 Glacier storage class.
Create an EMR Cluster. Make sure you’re using Amazon EMR version 7.2.0 or higher.
Initiate data restoration by submitting a restore request for the S3 Glacier data before processing.
To configure the Amazon EMR for S3 Glacier integration, set the fs.s3a.glacier.read.restored.objects property to READ_RESTORED_GLACIER_OBJECTS. This enables Amazon EMR to properly handle restored S3 Glacier objects.
Run Spark queries on the restored data through Amazon EMR.

Consider the following best practices:

Plan workflows around S3 Glacier restore times
Monitor costs associated with data restoration and processing
Regularly review and optimize your data lifecycle policies

By implementing this integration, organizations can significantly reduce storage costs while maintaining the ability to analyze historical data when needed. This approach is particularly beneficial for large-scale data lakes and long-term data retention scenarios.

Prerequisites

The setup requires the following prerequisites:

An active AWS account with appropriate permissions.
An AWS Identity and Access Management (IAM) role with necessary permissions for Amazon EMR, Amazon S3, and S3 Glacier operations. For more information, see Configure IAM service roles for Amazon EMR permissions to AWS services and resources.
Access to create and manage S3 buckets. For more information, refer to Creating a bucket.
The ability to create and manage EMR clusters (version 7.2.0 or higher). For more information, see Tutorial: Getting started with Amazon EMR.
- An understanding of S3 Glacier storage classes and retrieval options. For more details, refer to the Amazon S3 Glacier Developer Guide.
Knowledge for creating a cluster with Apache Spark. For more details, check here.
A virtual private cloud (VPC) and subnet configurations for EMR cluster deployment. For more information, refer to Launch clusters into a VPC with Amazon EMR.
An Amazon Elastic Compute Cloud (Amazon EC2) key pair for EMR cluster access. To learn more, see Amazon EC2 key pairs and Amazon EC2 instances.
An allocated budget for AWS resource usage for running the example. For more details, see Managing your costs with AWS Budgets.

Create an S3 bucket

Create an S3 bucket with different S3 Glacier objects as listed in the following code:

aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2024/month=1/day=1/
aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2024/month=1/day=2/

aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2023/month=1/day=1/
aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2023/month=1/day=2/

aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2022/month=1/day=1/
aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2022/month=1/day=2/

aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2021/month=1/day=1/
aws s3api put-object --bucket reinvent-glacier-demo --key T1/year=2021/month=1/day=2/

For more information, refer to Creating a bucket and Setting an S3 Lifecycle configuration on a bucket.

The following is the list of objects:

ls | sort

glacier_deep_archive_1.txt
glacier_deep_archive_2.txt
glacier_flexible_retrieval_formerly_glacier_1.txt
glacier_flexible_retrieval_formerly_glacier_2.txt
glacier_instant_retrieval_1.txt
glacier_instant_retrieval_2.txt
standard_s3_file_1.txt
standard_s3_file_2.txt

The content of the objects is as follows:

ls ./* | sort | xargs cat

Long-lived archive data accessed less than once a year with retrieval of hours
Long-lived archive data accessed less than once a year with retrieval of hours
Long-lived archive data accessed once a year with retrieval of minutes to hours
Long-lived archive data accessed once a year with retrieval of minutes to hours
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds
standard s3 file 1
standard s3 file 2

S3 Glacier Instant Retrieval objects

For more information about S3 Glacier Instance Retrieval objects, see Appendix A at the end of this post. The objects are listed as follows:

glacier_instant_retrieval_1.txt
glacier_instant_retrieval_2.txt

The objects include the following contents:

Long-lived archive data accessed once a quarter with instant retrieval in milliseconds

To set different storage classes for objects in different folders, use the –storage-class parameter when uploading objects or change the storage class after upload:

aws s3 cp glacier_instant_retrieval_1.txt s3://reinvent-glacier-demo/T1/year=2023/month=1/day=1/ --storage-class GLACIER_IR

aws s3 cp glacier_instant_retrieval_2.txt s3://reinvent-glacier-demo/T1/year=2023/month=1/day=2/ --storage-class GLACIER_IR

S3 Glacier Flexible Retrieval objects

For more information about S3 Glacier Flexible Retrieval objects, see Appendix B at the end of this post. The objects are listed as follows:

glacier_flexible_retrieval_formerly_glacier_1.txt
glacier_flexible_retrieval_formerly_glacier_2.txt

The objects include the following contents:

Long-lived archive data accessed once a year with retrieval of minutes to hours

To set different storage classes for objects in different folders, use the –storage-class parameter when uploading objects or change the storage class after upload:

aws s3 cp glacier_flexible_retrieval_formerly_glacier_1.txt s3://reinvent-glacier-demo/T1/year=2022/month=1/day=1/ --storage-class GLACIER

aws s3 cp glacier_flexible_retrieval_formerly_glacier_2.txt s3://reinvent-glacier-demo/T1/year=2022/month=1/day=2/ --storage-class GLACIER

S3 Glacier Deep Archive objects

For more information about S3 Glacier Deep Archive objects, see Appendix C at the end of this post. The objects are listed as follows:

glacier_deep_archive_1.txt
glacier_deep_archive_2.txt

The objects include the following contents:

Long-lived archive data accessed less than once a year with retrieval of hours

To set different storage classes for objects in different folders, use the –storage-class parameter when uploading objects or change the storage class after upload:

aws s3 cp glacier_deep_archive_1.txt s3://reinvent-glacier-demo/T1/year=2021/month=1/day=1/ --storage-class DEEP_ARCHIVE

aws s3 cp glacier_deep_archive_2.txt s3://reinvent-glacier-demo/T1/year=2021/month=1/day=2/ --storage-class DEEP_ARCHIVE

List the bucket contents

List the bucket contents with the following code:

aws s3 ls s3://reinvent-glacier-demo/T1/ --recursive

2024-11-17 09:10:05          0 T1/year=2021/month=1/day=1/
2024-11-17 10:43:47         79 T1/year=2021/month=1/day=1/glacier_deep_archive_1.txt
2024-11-17 09:10:14          0 T1/year=2021/month=1/day=2/
2024-11-17 10:44:06         79 T1/year=2021/month=1/day=2/glacier_deep_archive_2.txt
2024-11-17 09:09:53          0 T1/year=2022/month=1/day=1/
2024-11-17 10:27:02         80 T1/year=2022/month=1/day=1/glacier_flexible_retrieval_formerly_glacier_1.txt
2024-11-17 09:09:58          0 T1/year=2022/month=1/day=2/
2024-11-17 10:27:21         80 T1/year=2022/month=1/day=2/glacier_flexible_retrieval_formerly_glacier_2.txt
2024-11-17 09:09:43          0 T1/year=2023/month=1/day=1/
2024-11-17 10:10:48         87 T1/year=2023/month=1/day=1/glacier_instant_retrieval_1.txt
2024-11-17 09:09:48          0 T1/year=2023/month=1/day=2/
2024-11-17 10:11:06         87 T1/year=2023/month=1/day=2/glacier_instant_retrieval_2.txt
2024-11-17 09:09:14          0 T1/year=2024/month=1/day=1/
2024-11-17 09:36:59         19 T1/year=2024/month=1/day=1/standard_s3_file_1.txt
2024-11-17 09:09:35          0 T1/year=2024/month=1/day=2/
2024-11-17 09:37:11         19 T1/year=2024/month=1/day=2/standard_s3_file_2.txt

Create an EMR Cluster

Complete the following steps to create an EMR Cluster:

On the Amazon EMR console, choose Clusters in the navigation pane.
Choose Create cluster.
For the cluster type, choose Advanced configuration for more control over cluster settings.
Configure the software options:
- Choose the Amazon EMR release version (make sure it’s 7.2.0 or higher for S3 Glacier integration).
- Choose applications (such as Spark or Hadoop).
Configure the hardware options:
- Choose the instance types for primary, core, and task nodes.
- Choose the number of instances for each node type.
Set the general cluster settings:
- Name your cluster.
- Choose logging options (recommended to enable logging).
- Choose a service role for Amazon EMR.
Configure the security options:
Choose an EC2 key pair for SSH access.
Set up an Amazon EMR role and EC2 instance profile.
To configure networking, choose a VPC and subnet for your cluster.
Optionally, you can add steps to run immediately when the cluster starts.
Review your settings and choose Create cluster to launch your EMR Cluster.

For more information and detailed steps, see Tutorial: Getting started with Amazon EMR.

For additional resources, refer to Plan, configure and launch Amazon EMR clusters, Configure IAM service roles for Amazon EMR permissions to AWS services and resources, and Use security configurations to set up Amazon EMR cluster security.

Make sure that your EMR cluster has the necessary permissions to access Amazon S3 and S3 Glacier, and that it’s configured to work with the storage classes you plan to use in your demonstration.

Perform queries

In this section, we provide code to perform different queries.

Create a table

Use the following code to create a table:

CREATE TABLE default.reinvent_demo_table (
  data STRING,
  year INT,
  month INT,
  day INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('serialization.format' = ',', 'field.delim' = ',')
STORED AS TEXTFILE
PARTITIONED BY (year, month, day)
LOCATION 's3a://reinvent-glacier-demo/T1';

ALTER TABLE reinvent_demo_table ADD IF NOT EXISTS
PARTITION (year=2024, month=1, day=1) LOCATION 's3a://reinvent-glacier-demo/T1/year=2024/month=1/day=1/'
PARTITION (year=2024, month=1, day=2) LOCATION 's3a://reinvent-glacier-demo/T1/year=2024/month=1/day=2/'
PARTITION (year=2023, month=1, day=1) LOCATION 's3a://reinvent-glacier-demo/T1/year=2023/month=1/day=1/'
PARTITION (year=2023, month=1, day=2) LOCATION 's3a://reinvent-glacier-demo/T1/year=2023/month=1/day=2/'
PARTITION (year=2022, month=1, day=1) LOCATION 's3a://reinvent-glacier-demo/T1/year=2022/month=1/day=1/'
PARTITION (year=2022, month=1, day=2) LOCATION 's3a://reinvent-glacier-demo/T1/year=2022/month=1/day=2/'
PARTITION (year=2021, month=1, day=1) LOCATION 's3a://reinvent-glacier-demo/T1/year=2021/month=1/day=1/'
PARTITION (year=2021, month=1, day=2) LOCATION 's3a://reinvent-glacier-demo/T1/year=2021/month=1/day=2/';

Queries before restoring S3 Glacier objects

Before you restore the S3 Glacier objects, run the following queries:

·READ_ALL – The following code shows the default behavior:

$ spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=READ_ALL
spark-sql (default)> select * from reinvent_demo_table;

This option throws an exception reading the S3 Glacier storage class objects:

24/11/17 11:57:59 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 9)
(ip-172-31-38-56.ec2.internal executor 2): java.nio.file.AccessDeniedException:
s3a://reinvent-glacier-demo/T1/year=2022/month=1/day=1/glacier_flexible_retrieval_formerly_glacier_1.txt:
open s3a://reinvent-glacier-demo/T1/year=2022/month=1/day=1/glacier_flexible_retrieval_formerly_glacier_1.txt
at 0 on s3a://reinvent-glacier-demo/T1/year=2022/month=1/day=1/glacier_flexible_retrieval_formerly_glacier_1.txt:
software.amazon.awssdk.services.s3.model.InvalidObjectStateException:
The operation is not valid for the object's storage class
(Service: S3, Status Code: 403, Request ID: N6P6SXE6T50QATZY,
Extended Request ID: Elg7XerI+xrhI1sFb8TAhFqLrQAd9cWFG2UrKo8jgt73dFG+5UWRT6G7vkI3wWuvsjhMewuE9Gw=):
InvalidObjectState

SKIP_ALL_GLACIER – This option retrieves Amazon S3 Standard and S3 Glacier Instant Retrieval objects:

$ spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=SKIP_ALL_GLACIER spark-sql (default)> select * from reinvent_demo_table;

24/11/17 14:28:31 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    2
standard s3 file 2    2024    1    2
standard s3 file 1    2024    1    1
Time taken: 7.104 seconds, Fetched 4 row(s)

READ_RESTORED_GLACIER_OBJECTS – The option retrieves standard Amazon S3 and all restored S3 Glacier objects. The S3 Glacier objects are under retrieval and will show up after they are retrieved.

spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=READ_RESTORED_GLACIER_OBJECTS

spark-sql (default)> select * from reinvent_demo_table;

24/11/17 14:31:52 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
standard s3 file 2    2024    1    2
standard s3 file 1    2024    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    2
Time taken: 6.533 seconds, Fetched 4 row(s)

Queries after restoring S3 Glacier objects

Perform the following queries after restoring S3 Glacier objects:

READ_ALL – Because all the objects have been restored, all the objects are read (no exception is thrown):

$ spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=READ_ALL

spark-sql (default)> select * from reinvent_demo_table;

24/11/18 01:38:37 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Long-lived archive data accessed once a year with retrieval of minutes to hours    2022    1    2
Long-lived archive data accessed once a year with retrieval of minutes to hours    2022    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    2
standard s3 file 2    2024    1    2
Long-lived archive data accessed less than once a year with retrieval of hours    2021    1    1
Long-lived archive data accessed less than once a year with retrieval of hours    2021    1    2
standard s3 file 1    2024    1    1
Time taken: 6.71 seconds, Fetched 8 row(s)

SKIP_ALL_GLACIER – This option retrieves standard Amazon S3 and S3 Glacier Instant Retrieval objects:

$ spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=SKIP_ALL_GLACIER

spark-sql (default)> select * from reinvent_demo_table;

24/11/18 01:39:27 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    1
standard s3 file 1    2024    1    1
standard s3 file 2    2024    1    2
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    2
Time taken: 6.898 seconds, Fetched 4 row(s)

READ_RESTORED_GLACIER_OBJECTS – The option retrieves standard Amazon S3 and all restored S3 Glacier objects. The S3 Glacier objects are under retrieval and will show up after they are retrieved.

$ spark-sql --conf spark.hadoop.fs.s3a.glacier.read.restored.objects=READ_RESTORED_GLACIER_OBJECTS

spark-sql (default)> select * from reinvent_demo_table;

24/11/18 01:40:55 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Long-lived archive data accessed once a year with retrieval of minutes to hours    2022    1    1
Long-lived archive data accessed less than once a year with retrieval of hours    2021    1    2
Long-lived archive data accessed once a year with retrieval of minutes to hours    2022    1    2
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    1
standard s3 file 1    2024    1    1
standard s3 file 2    2024    1    2
Long-lived archive data accessed less than once a year with retrieval of hours    2021    1    1
Long-lived archive data accessed once a quarter with instant retrieval in milliseconds    2023    1    2
Time taken: 6.542 seconds, Fetched 8 row(s)

Conclusion

The integration of Amazon EMR with S3 Glacier storage marks a significant advancement in big data analytics and cost-effective data management. By bridging the gap between high-performance computing and long-term, low-cost storage, this integration opens up new possibilities for organizations dealing with vast amounts of historical data.

Key benefits of this solution include:

Cost optimization – You can take advantage of the economical storage options of S3 Glacier while maintaining the ability to perform analytics when needed
Data lifecycle management – You can benefit from a seamless transition of data from active S3 buckets to archival S3 Glacier storage, and back when analysis is required
Performance and flexibility – Amazon EMR is able to work directly with restored S3 Glacier objects, providing efficient processing of historical data without compromising on performance
Compliance and auditing – The integration offers enhanced capabilities for long-term data retention and analysis, which are crucial for industries with strict regulatory requirements
Scalability – The solution scales effortlessly, accommodating growing data volumes without significant cost increases

As data continues to grow exponentially, the Amazon EMR and S3 Glacier integration provides a powerful toolset for organizations to balance performance, cost, and compliance. It enables data-driven decision-making on historical data without the overhead of maintaining it in high-cost, readily accessible storage.

By following the steps outlined in this post, data engineers and analysts can unlock the full potential of their archived data, turning cold storage into a valuable asset for business intelligence and long-term analytics strategies.

As we move forward in the era of big data, solutions like this Amazon EMR and S3 Glacier integration will play a crucial role in shaping how organizations manage, store, and derive value from their ever-growing data assets.

About the Authors

Giovanni Matteo Fumarola is the Senior Manager for EMR Spark and Iceberg group. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

Narayanan Venkateswaran is an Engineer in the AWS EMR group. He works on developing Hive in EMR. He has over 17 years of work experience in the industry across several companies including Sun Microsystems, Microsoft, Amazon and Oracle. Narayanan also holds a PhD in databases with focus on horizontal scalability in relational stores.

Karthik Prabhakar is a Senior Analytics Architect for Amazon EMR at AWS. He is an experienced analytics engineer working with AWS customers to provide best practices and technical advice in order to assist their success in their data journey.

Appendix A: S3 Glacier Instant Retrieval

S3 Glacier Instant Retrieval objects store long-lived archive data accessed once a quarter with instant retrieval in milliseconds. These are not distinguished from S3 Standard object, and there is no option to restore them as well. The key difference between S3 Glacier Instant Retrieval and standard S3 object storage lies in their intended use cases, access speeds, and costs:

Intended use cases – Their intended use cases differ as follows:
- S3 Glacier Instant Retrieval – Designed for infrequently accessed, long-lived data where access needs to be almost instantaneous, but lower storage costs are a priority. It’s ideal for backups or archival data that might need to be retrieved occasionally.
- Standard S3 – Designed for frequently accessed, general-purpose data that requires quick access. It’s suited for primary, active data where retrieval speed is essential.
Access speed – The differences in access speed are as follows:
- S3 Glacier Instant Retrieval – Provides millisecond access similar to standard Amazon S3, though it’s optimized for infrequent access, balancing quick retrieval with lower storage costs.
- Standard S3 – Also offers millisecond access but without the same access frequency limitations, supporting workloads where frequent retrieval is expected.
Cost structure – The cost structure is as follows:
- S3 Glacier Instant Retrieval – Lower storage cost compared to standard Amazon S3 but slightly higher retrieval costs. It’s cost-effective for data accessed less frequently.
- Standard S3 – Higher storage cost but lower retrieval cost, making it suitable for data that needs to be frequently accessed.
Durability and availability – Both S3 Glacier Instant Retrieval and standard Amazon S3 maintain the same high durability (99.999999999%) but have different availability SLAs. Standard Amazon S3 generally has a slightly higher availability, whereas S3 Glacier Instant Retrieval is optimized for infrequent access and has a slightly lower availability SLA.

Appendix B: S3 Glacier Flexible Retrieval

S3 Glacier Flexible Retrieval (previously known simply as S3 Glacier) is an Amazon S3 storage class for archival data that is rarely accessed but still needs to be preserved long-term for potential future retrieval at a very low cost. It’s optimized for scenarios where occasional access to data is required but immediate access is not critical. The key differences between S3 Glacier Flexible Retrieval and standard Amazon S3 storage are as follows:

Intended use cases – Best for long-term data storage where data is accessed very infrequently, such as compliance archives, media assets, scientific data, and historical records.
Access options and retrieval speeds – The differences in access and retrieval speed are as follows:
- Expedited – Retrieval in 1–5 minutes for urgent access (higher retrieval costs).
- Standard – Retrieval in 3–5 hours (default and cost-effective option).
- Bulk – Retrieval within 5–12 hours (lowest retrieval cost, suited for batch processing).
Cost structure – The cost structure is as follows:
- Storage cost – Very low compared to other Amazon S3 storage classes, making it suitable for data that doesn’t require frequent access.
- Retrieval cost – Retrieval incurs additional fees, which vary depending on the speed of access required (Expedited, Standard, Bulk).
- Data retrieval pricing – The quicker the retrieval option, the higher the cost per GB.
Durability and availability – Like other Amazon S3 storage classes, S3 Glacier Flexible Retrieval has high durability (99.999999999%). However, it has lower availability SLAs compared to standard Amazon S3 classes due to its archive-focused design.
Lifecycle policies – You can set lifecycle policies to automatically transition objects from other Amazon S3 classes (like S3 Standard or S3 Standard-IA) to S3 Glacier Flexible Retrieval after a certain period of inactivity.

Appendix C: S3 Glacier Deep Archive

S3 Glacier Deep Archive is the lowest-cost storage class of Amazon S3, designed for data that is rarely accessed and intended for long-term retention. It’s the most cost-effective option within Amazon S3 for data that can tolerate longer retrieval times, making it ideal for deep archival storage. It’s a perfect solution for organizations with data that must be retained but not frequently accessed, such as regulatory compliance data, historical archives, and large datasets stored purely for backup. The key differences between S3 Glacier Deep Archive and standard Amazon S3 storage are as follows:

Intended use cases – S3 Glacier Deep Archive is ideal for data that is infrequently accessed and requires long-term retention, such as backups, compliance records, historical data, and archive data for industries with strict data retention regulations (such as finance and healthcare).
Access options and retrieval speeds – The differences in access and retrieval speed are as follows:
- Standard retrieval – Data is typically available within 12 hours, intended for cases where occasional access is required.
- Bulk retrieval – Provides data access within 48 hours, designed for very large datasets and batch retrieval scenarios with the lowest retrieval cost.
Cost structure – The cost structure is as follows:
- Storage cost – S3 Glacier Deep Archive has the lowest storage costs across all Amazon S3 storage classes, making it the most economical choice for long-term, infrequently accessed data.
- Retrieval cost – Retrieval costs are higher than more active storage classes and vary based on retrieval speed (Standard or Bulk).
- Minimum storage duration – Data stored in S3 Glacier Deep Archive is subject to a minimum storage duration of 180 days, which helps maintain low costs for truly archival data.
Durability and availability – It offers the following durability and availability benefits:
- Durability – S3 Glacier Deep Archive has 99.999999999% durability, similar to other Amazon S3 storage classes.
- Availability – This storage class is optimized for data that doesn’t need frequent access, and so has lower availability SLAs compared to active storage classes like S3 Standard.
Lifecycle policies – Amazon S3 allows you to set up lifecycle policies to transition objects from other storage classes (such as S3 Standard or S3 Glacier Flexible Retrieval) to S3 Glacier Deep Archive based on the age or access frequency of the data.

Run high-availability long-running clusters with Amazon EMR instance fleets

2024-11-21 Garima Arora

Post Syndicated from Garima Arora original https://aws.amazon.com/blogs/big-data/run-high-availability-long-running-clusters-with-amazon-emr-instance-fleets/

AWS now supports high availability Amazon EMR on EC2 clusters with instance fleet configuration. With high availability instance fleet clusters, you now get the enhanced resiliency and fault tolerance of high availability architecture, along with the improved flexibility and intelligence in Amazon Elastic Compute Cloud (Amazon EC2) instance selection of instance fleets. Amazon EMR is a cloud big data platform for petabyte-scale data processing, interactive analysis, streaming, and machine learning (ML) using open source frameworks such as Apache Spark, Presto and Trino, and Apache Flink. Customers love the scalability and flexibility that Amazon EMR on EC2 offers. However, like most distributed systems running mission-critical workloads, high availability is a core requirement, especially for those with long-running workloads.

In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned Amazon EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.

High availability in Hadoop

High availability (HA) provides continuous uptime and fault tolerance for a Hadoop cluster. The core components of Hadoop, like Hadoop Distributed File System (HDFS) NameNode and YARN ResourceManager, are single points of failure in clusters with a single primary node. In the event that any of them crash, the entire cluster goes down. High Availability removes this single point of failure by introducing redundant standby nodes that can quickly take over if the primary node fails.

In a high availability EMR cluster, one node serves as the active NameNode that handles client operations, and others act as standby NameNodes. The standby NameNodes constantly synchronize their state with the active one, enabling seamless failover to maintain service availability. To learn more, see Supported applications in an Amazon EMR Cluster with multiple primary nodes.

Key instance fleet differentiations

Amazon EMR recommends using the instance fleet configuration option for provisioning EC2 instances in EMR clusters because it offers a flexible and robust approach to cluster provisioning. Some key advantages include:

Flexible instance provisioning – Instance fleets provide a powerful and simple way to specify up to five EC2 instance types on the Amazon EMR console, or up to 30 when using the AWS Command Line Interface (AWS CLI) or API with an allocation strategy. This enhanced diversity helps optimize for cost and performance while increasing the likelihood of fulfilling capacity requirements.
Target capacity management – You can specify target capacities for On-Demand and Spot Instances for each fleet. Amazon EMR automatically manages the mix of instances to meet these targets, reducing operational overhead.
Improved availability – By spanning multiple instance types and purchasing options such as On-Demand and Spot, instance fleets are more resilient to capacity fluctuations in specific EC2 instance pools.
Enhanced Spot Instance handling – Instance fleets offer superior management of Spot Instances, including the ability to set timeouts and specify actions if Spot capacity can’t be provisioned.
Reliable cluster launches – You can configure your instance fleet to select multiple subnets for different Availability Zones, allowing Amazon EMR to find the best combination of instances and purchasing options across these zones to launch your cluster in. Amazon EMR will identify the best Availability Zone based on your configuration and available EC2 capacity and launch the cluster.

Prerequisites

Before you launch the high availability EMR instance fleet clusters, make sure you have the following:

Latest Amazon EMR release – We recommend that you use the latest Amazon EMR release to benefit from the highest level of resiliency and stability for your high availability clusters. High availability for instance fleets is supported with Amazon EMR releases 5.36.1, 6.8.1, 6.9.1, 6.10.1, 6.11.1, 6.12.0, and later.
Supported applications – High availability for instance fleets is supported for applications such as Apache Spark, Presto, Trino, and Apache Flink. Refer to Supported applications in an Amazon EMR Cluster with multiple primary nodes for the complete list of supported applications and their failover processes.

Launch a high availability instance fleet cluster using the Amazon EMR console

Complete the following steps on the Amazon EMR console to configure and launch a high availability EMR cluster with instance fleets:

On the Amazon EMR console, create a new cluster.
For Name, enter a name.
For Amazon EMR release, choose the Amazon EMR release that supports high availability clusters with instance fleets. The setting will default to the latest available Amazon EMR release.

Under Cluster configuration, choose the desired instance types for the primary fleet. (You can select up to five when using the Amazon EMR console.)
Select Use high availability to launch the cluster with three primary nodes.

Choose the instance types and target On-Demand and Spot size for the core and task fleet according to your requirements.

Under Allocation strategy, select Apply allocation strategy.
1. 1 We recommend that you select Price-capacity optimized for your allocation strategy for your cluster for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.
Under Networking, you can choose multiple subnets for different Availability Zones. This allows Amazon EMR to look across those subnets and launch the cluster in an Availability Zone that best suits your instance and purchasing option requirements.

Review your cluster configuration and choose Create cluster.

Amazon EMR will launch your cluster in a few minutes. You can view the cluster details on the Amazon EMR console.

Launch a high availability cluster with AWS CloudFormation

To launch a high availability cluster using AWS CloudFormation, complete the following steps:

Create a CloudFormation template with EMR resource type AWS::EMR::Cluster and JobFlowInstancesConfig property types MasterInstanceFleet, CoreInstanceFleet and (optional) TaskInstanceFleets. To launch a high availability cluster, configure TargetOnDemandCapacity=3, TargetSpotCapacity=0 for the primary instance fleet and weightedCapacity=1 for each instance type configured for the fleet. See the following code:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "cluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
        "Instances": {
          "Ec2SubnetIds": [
            "subnet-003c889b8379f42d1",
            "subnet-0382aadd4de4f5da9",
            "subnet-078fbbb77c92ab099"
          ],
          "MasterInstanceFleet": {
            "Name": "HAPrimaryFleet",
            "TargetOnDemandCapacity": 3,
            "TargetSpotCapacity": 0,
            "InstanceTypeConfigs": [
              {
                "InstanceType": "m5.xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.2xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.4xlarge",
                "WeightedCapacity": 1
              }
            ]
          },
          "CoreInstanceFleet": {
            "Name": "cfnCore",
            "InstanceTypeConfigs": [
              {
                "InstanceType": "m5.xlarge",
                "WeightedCapacity": 1
              },
              {
                "InstanceType": "m5.2xlarge",
                "WeightedCapacity": 2
              },
              {
                "InstanceType": "m5.4xlarge",
                "WeightedCapacity": 4
              }
            ],
            "LaunchSpecifications": {
              "SpotSpecification": {
                "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                "TimeoutDurationMinutes": 20,
                "AllocationStrategy": "PRICE_CAPACITY_OPTIMIZED"
              }
            },
            "TargetOnDemandCapacity": "4",
            "TargetSpotCapacity": 0
          },
          "TaskInstanceFleets": [
            {
              "Name": "cfnTask",
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m5.xlarge",
                  "WeightedCapacity": 1
                },
                {
                  "InstanceType": "m5.2xlarge",
                  "WeightedCapacity": 2
                },
                {
                  "InstanceType": "m5.4xlarge",
                  "WeightedCapacity": 4
                }
              ],
              "LaunchSpecifications": {
                "SpotSpecification": {
                  "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                  "TimeoutDurationMinutes": 20,
                  "AllocationStrategy": "PRICE_CAPACITY_OPTIMIZED"
                }
              },
              "TargetOnDemandCapacity": "0",
              "TargetSpotCapacity": 4
            }
          ]
        },
        "Name": "TestHACluster",
        "ServiceRole": "EMR_DefaultRole",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "ReleaseLabel": "emr-6.15.0",
        "PlacementGroupConfigs": [
          {
            "InstanceRole": "MASTER",
            "PlacementStrategy": "SPREAD"
          }
        ]
      }
    }
  }
}

Make sure to use an Amazon EMR release that supports high availability clusters with instance fleets.

Create a CloudFormation stack with the preceding template:

aws cloudformation create-stack --stack-name HAInstanceFleetCluster --template-body file://cfn-template.json --region us-east-1

Retrieve the cluster ID from the list-clusters response to use in the following steps. You can further filter this list based on filters like cluster status, creation date, and time.

aws emr list-clusters --query "Clusters[?Name=='<YourClusterName>']"

Run the following describe-cluster command:

aws emr describe-cluster --cluster-id j-XXXXXXXXXXX --region us-east-1

If the high availability cluster was launched successfully, the describe-cluster response will return the state of the primary fleet as RUNNING and provisionedOnDemandCapacity as 3. By this point, all three primary nodes have been started successfully.

Primary node failover with High Availability clusters

To fetch information on all EC2 instances for an instance fleet, use the list-instances command:

aws emr list-instances --cluster-id j-XXXXXXXXXXX --instance-fleet-type MASTER --region us-east-1

For high availability clusters, it will return three instances in RUNNING state for the primary fleet and other attributes like public and private DNS names.

The following screenshot shows the instance fleet status on the Amazon EMR console.

Let’s examine two cases for primary node failover.

Case 1: One of the three primary instances is accidentally stopped

When an EC2 instance is accidentally stopped by a user, Amazon EMR detects this and performs a failover for the stopped primary node. Amazon EMR also attempts to launch a new primary node with the same private IP and DNS name to recover back the quorum. During this failover, the cluster remains fully operational, providing true resiliency to single primary node failures.

The following screenshots illustrate the instance fleet details.

This automatic recovery for primary nodes is also reflected in the MultiMasterInstanceGroupNodesRunning or MultiMasterInstanceGroupNodesRunningPercentage Amazon CloudWatch metric emitted by Amazon EMR for your cluster. The following screenshot shows an example of these metrics.

Case 2: One of the three primary instances becomes unhealthy

If Amazon EMR continuously receives failures when trying to connect to a primary instance, it is deemed as unhealthy and Amazon EMR will attempt to replace it. Similar to case 1, Amazon EMR will perform a failover for the stopped primary node and also attempt to launch a new primary node with the same private IP and DNS name to recover the quorum.

If you list the instances for the primary fleet, the response will include information for the EC2 instance that was stopped by the user and the new primary instance that replaced it with the same private IP and DNS name.

The following screenshot shows an example of the CloudWatch metrics.

An instance can have connection failures for multiple reasons, including but not limited to disk space unavailable on the instance, critical cluster daemons like instance controller shut down with errors, high CPU utilization, and more. Amazon EMR is continuously improving its health monitoring criteria to better identify unhealthy nodes on an EMR cluster.

Considerations and best practices

The following are some of the key considerations and best practices for using EMR instance fleets to launch a high availability cluster with multiple primary nodes:

Use the latest EMR release – With the latest EMR releases, you get the highest level of resiliency and stability for your high availability EMR clusters with multiple primary nodes.
Configure subnets for high availability – Amazon EMR can’t replace a failed primary node if the subnet is oversubscribed (there aren’t any available private IP addresses in the subnet). This results in a cluster failure as soon as the second primary node fails. Limited availability of IP addresses in a subnet can also result in cluster launch or scaling failures. To avoid such scenarios, we recommend that you dedicate an entire subnet to an EMR cluster.
Configure core nodes for enhanced data availability – To minimize the risk of local HDFS data loss on your production clusters, we recommend that you set the dfs.replication parameter to 3 and launch at least four core nodes. Setting dfs.replication to 1 on clusters with fewer than four core nodes can lead to data loss if a single core node goes down. For clusters with three or fewer core nodes, set dfs.replication parameter to at least 2 to achieve sufficient HDFS data replication. For more information, see HDFS configuration.
Use an allocation strategy – We recommend enabling an allocation strategy option for your instance fleet cluster to provide faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.
Set alarms for monitoring primary nodes – You should monitor the health and status of primary nodes of your long-running clusters to maintain smooth operations. Configure alarms using CloudWatch metrics such as MultiMasterInstanceGroupNodesRunning, MultiMasterInstanceGroupNodesRunningPercentage, or MultiMasterInstanceGroupNodesRequested.
Integrate with EC2 placement groups – You can also choose to protect primary instances against hardware failures by using a placement group strategy for your primary fleet. This will spread the three primary instances across separate underlying hardware to avoid loss of multiple primary nodes at the same time in the event of a hardware failure. See Amazon EMR integration with EC2 placement groups for more details.

When setting up a high availability instance fleet cluster with Amazon EMR on EC2, it’s important to understand that all EMR nodes, including the three primary nodes, are launched within a single Availability Zone. Although this configuration maintains high availability within that Availability Zone, it also means that the entire cluster can’t tolerate an Availability Zone outage. To mitigate the risk of cluster failures due to Spot Instance reclamation, Amazon EMR launches the primary nodes using On-Demand instances, providing an additional layer of reliability for these critical components of the cluster.

Conclusion

This post demonstrated how you can use high availability with EMR on EC2 instance fleets to enhance the resiliency and reliability of your big data workloads. By using instance fleets with multiple primary nodes, EMR clusters can withstand failures and maintain uninterrupted operations, while providing enhanced instance diversity and better Spot capacity management within a single Availability Zone. You can quickly set up these high availability clusters using the Amazon EMR console or AWS CloudFormation, and monitor their health using CloudWatch metrics.

To learn more about the supported applications and their failover process, see Supported applications in an Amazon EMR Cluster with multiple primary nodes. To get started with this feature and launch a high availability EMR on EC2 cluster, refer to Plan and configure primary nodes.

About the Authors

Garima Arora is a Software Development Engineer for Amazon EMR at Amazon Web Services. She specializes in capacity optimization and helps build services that allow customers to run big data applications and petabyte-scale data analytics faster. When not hard at work, she enjoys reading fiction novels and watching anime.

Ravi Kumar is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specialized in building exabyte-scale data infrastructure and analytics platforms. With a passion for building innovative tools, he helps customers unlock valuable insights from their structured and unstructured data. Ravi’s expertise lies in creating robust data foundations using open-source technologies and advanced cloud computing, that powers advanced artificial intelligence and machine learning use cases. A recognized thought leader in the field, he advances the data and AI ecosystem through pioneering solutions and collaborative industry initiatives. As a strong advocate for customer-centric solutions, Ravi constantly seeks ways to simplify complex data challenges and enhance user experiences. Outside of work, Ravi is an avid technology enthusiast who enjoys exploring emerging trends in data science, cloud computing, and machine learning.

Tarun Chanana is a Software Development Manager for Amazon EMR at Amazon Web Services.

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

2024-11-14 Tomohiro Tanaka

Post Syndicated from Tomohiro Tanaka original https://aws.amazon.com/blogs/big-data/expand-data-access-through-apache-iceberg-using-delta-lake-uniform-on-aws/

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures. As organizations adopt various open table formats to suit their specific needs, the demand for interoperability between these formats has grown significantly. This interoperability is crucial for enabling seamless data access, reducing data silos, and fostering a more flexible and efficient data ecosystem.

Delta Lake UniForm is an open table format extension designed to provide a universal data representation that can be efficiently read by different processing engines. It aims to bridge the gap between various data formats and processing systems, offering a standardized approach to data storage and retrieval. With UniForm, you can read Delta Lake tables as Apache Iceberg tables. This expands data access to broader options of analytics engines.

This post explores how to start using Delta Lake UniForm on Amazon Web Services (AWS). You can learn how to query Delta Lake native tables through UniForm from different data warehouses or engines such as Amazon Redshift as an example of expanding data access to more engines.

How Delta Lake UniForm works

UniForm allows other table format clients such as Apache Iceberg to access Delta Lake tables. Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. UniForm generates multiple table format metadata without duplicating the actual data files. When an Iceberg client reads a UniForm table, it first accesses the Iceberg metadata files for the UniForm table, which then allows the Iceberg client to read the underlying data files.

There are two options to use UniForm:

Create a new Delta Lake table with UniForm
Enable UniForm on your existing Delta Lake table

First, to create a new Delta Lake table enabling UniForm, you configure table properties for UniForm in a CREATE TABLE DDL query. The table properties are 'delta.universalFormat.enabledFormats'='iceberg' and 'delta.enableIcebergCompatV2'='true'. When these options are set to the CREATE TABLE query, Iceberg metadata files are generated along with Delta Lake metadata files. In addition to these options, Delta Lake table protocol versions that define supported features by the table such as delta.minReaderVersion and delta.minWriterVersion are required to be set to 2 and 7 or more respectively. For more information about the table protocol versions, refer to What is a table protocol specification? in Delta Lake public document. Appendix 1. Create a new Delta Lake table with UniForm shows an example query to create a new Delta Lake UniForm table.

You can also enable UniForm on an existing Delta Lake table. This option is suitable if you have Delta Lake tables in your environment. Enabling UniForm doesn’t affect your current operations on the Delta Lake tables. To enable UniForm on a Delta Lake table, run REORG TABLE db.existing_delta_lake_table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)). After running this query, Delta Lake automatically generates Iceberg metadata files for the Iceberg client. In the example in this post, you run this option and enable UniForm after you create a Delta Lake table.

For the information about enabling UniForm, refer to Enable Delta Lake UniForm in the Delta Lake public document. Note that the extra package (delta-iceberg) is required to create a UniForm table in AWS Glue Data Catalog. The extra package is also required to generate Iceberg metadata along with Delta Lake metadata for the UniForm table.

Example use case

A fictional company built a data lake with Delta Lake on Amazon Simple Storage Service (Amazon S3) that’s mainly used through Amazon Athena. According to its usage expansion, this company wants to expand data access to cloud-based data warehouses such as Amazon Redshift for flexible analytics use cases.

There are a few challenges to achieve this requirement. Delta Lake isn’t natively supported in Amazon Redshift. For those data warehouses, Delta Lake tables need to be converted to manifest tables, which requires additional operational overhead. You need to run the GENERATE command on Spark or use a crawler in AWS Glue to generate manifest tables, and you need to sync those manifest tables every time the Delta tables are updated.

Delta Lake UniForm can be a solution to meet this requirement. With Delta Lake UniForm, you can make the Delta Table compatible with the other open table formats such as Apache Iceberg, which is natively supported in Amazon Redshift. Users can query those Delta Lake tables as Iceberg tables through UniForm.

The following diagram describes the architectural overview to achieve that requirement.

In this tutorial, you create a Delta Lake table with a synthetic review dataset that includes different products and customer reviews and enable UniForm on that Delta Lake table to make it accessible from Amazon Redshift. Each component works as follows in this scenario:

Amazon EMR (Amazon EMR on EC2 cluster with Apache Spark): An Apache Spark application on an Amazon EMR cluster creates a Delta Lake table and enables UniForm on it. Only Delta Lake client can write to the Delta Lake UniForm table, making Amazon EMR act as a writer.
Amazon Redshift: Amazon Redshift uses Iceberg clients to read records from the Delta Lake UniForm table. It’s limited to reading records from the table and cannot write to it.
Amazon S3 and AWS Glue Data Catalog: These are used to manage the underlying files and the catalog of the Delta Lake UniForm table. The data and metadata files for the table are stored in an S3 bucket. The table is registered in AWS Glue Data Catalog.

Set up resources

In this section, you complete the following resource setup:

Launch an AWS CloudFormation template to configure resources such as S3 buckets, an Amazon Virtual Private Cloud (Amazon VPC) and a subnet, a database for Delta Lake in Data Catalog, AWS Identity and Access Management (IAM) policy and role with required permissions for Amazon EMR Studio, and an EC2 instance profile for Amazon EMR on EC2 cluster
Launch an Amazon EMR on EC2 cluster
Create an Amazon EMR Studio Workspace
Upload a Jupyter Notebook on Amazon EMR Studio Workspace
Launch a CloudFormation template to configure Amazon Redshift Serverless and relevant subnets

Launch a CloudFormation template to configure basic resources

You use a provided CloudFormation template to set up resources to build Delta Lake UniForm environments. The template creates the following resources.

An S3 bucket to store the Delta Lake table data
An S3 bucket to store an Amazon EMR Studio Workspace metadata and configuration files
An IAM role for Amazon EMR Studio
An EC2 instance profile for Amazon EMR on EC2 cluster
VPC and subnet for an Amazon EMR on EC2 cluster
A database for a Delta Lake table in Data Catalog

Complete the following steps to deploy the resources.

Choose Launch stack:

For Stack name, enter delta-lake-uniform-on-aws. For the Parameters, DeltaDatabaseName, PublicSubnetForEMRonEC2, and VpcCIDRForEMRonEC2 are set by default. You can also change the default values. Then, choose Next.
Choose Next.
Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Submit.
After the stack creation is complete, check the Outputs. The resource values are used in the following sections and in the Appendices.

Launch an Amazon EMR on EC2 cluster

Complete the following steps to create an Amazon EMR on EC2 cluster.

Open the Amazon EMR on EC2 console.
Choose Create cluster.
Enter delta-lake-uniform-blog-post in Name and confirm choosing emr-7.3.0 as its release label.
For Application bundle, select Spark 3.5.1, Hadoop 3.3.6 and JupyterEnterpriseGateway 2.6.0.
For AWS Glue Data Catalog settings, enable Use for Spark table metadata.
For Networking, enter the values from the CloudFormation Outputs tab for VpcForEMR and PublicSubnetForEMR into Virtual private cloud (VPC) andSubnet respectively. For EC2 security groups, keep Create ElasticMapReduce-Primary for Primary node, and Create ElasticMapReduce-Core for Core and task nodes. The security groups for the Amazon EMR primary and core nodes are automatically created.
For Cluster logs, enter s3://<DeltaLakeS3Bucket>/emr-cluster-logs as the Amazon S3 location. Replace <DeltaLakeS3Bucket> with the S3 bucket from the CloudFormation stack Outputs tab.
For Software settings, select Load JSON from Amazon S3 and enter s3://aws-blogs-artifacts-public/artifacts/BDB-4538/config.json as the Amazon S3 location.
For Amazon EMR service role in Identity and Access Management (IAM) roles section, choose Create a service role. Then, set default to Security Group. If there are existing security groups for Amazon EMR Primary and Core or Task nodes, set those security groups to Security Group.
For EC2 instance profile for Amazon EMR, choose Choose an existing instance profile, and set EMRInstanceProfileRole to Instance profile.
After reviewing the configuration, choose Create cluster.
After the cluster status is Waiting on the Amazon EMR console, the cluster setup is complete (it approximately takes 10 minutes).

Create an Amazon EMR Studio Workspace

Complete the following steps to create an Amazon EMR Studio Workspace to use Delta Lake UniForm on Amazon EMR on EC2.

Open Amazon EMR Studio console.
Choose Create Studio.
For Setup options, choose Custom.
For Studio settings, enter delta-lake-uniform-studio as the Studio name.
For S3 location for Workspace storage, choose Select existing location as a Workspace storage. The S3 location (s3://aws-emr-studio-<ACCOUNT_ID>-<REGION>-delta-lake-uniform-on-aws) can be obtained from EMRStudioS3Bucket on the CloudFormation Outputs tab. Then, choose EMRStudioRole as the IAM role (you can find the IAM Role name on the CloudFormation Outputs tab).
For Workspace settings, enter delta-lake-workspace as the Workspace name.
In Networking and security, choose the VPC ID and Subnet ID that you created in Launch an AWS CloudFormation template. You can obtain the VPC ID and Subnet ID from the VpcForEMR and PublicSubnetForEMR keys on the CloudFormation Outputs tab respectively.
After reviewing the settings, choose Create Studio and launch Workspace.
After creating the Studio Workspace is complete, you are redirected to Jupyter Notebook.

Upload Jupyter Notebook

Complete the following steps to configure a Jupyter Notebook to use Delta Lake UniForm with Amazon EMR.

Download delta-lake-uniform-on-aws.ipynb.
Choose the arrow icon at the top of the page and upload the Notebook you just downloaded.

Choose and open the notebook (delta-lake-uniform-on-aws.ipynb) you uploaded in the left pane.
After the notebook is opened, choose EMR Compute in the navigation pane.

Attach the Amazon EMR on EC2 cluster you created in the previous section. Choose EMR on EC2 cluster and set the cluster you created previously to EMR on EC2 cluster, then choose Attach.
After attaching the cluster is successful, Cluster is attached to the Workspace is displayed on the console.

Create a workgroup and a namespace for Amazon Redshift Serverless

For this step, you configure a workgroup and a namespace for Amazon Redshift Serverless to run queries on a Delta Lake UniForm table. You also configure two subnets in the same VPC created by the CloudFormation stack delta-lake-uniform-on-aws. To deploy the resources, complete the following steps:

Choose Launch stack:

For Stack name, enter redshift-serverless-for-delta-lake-uniform.
For Parameters, enter the Availability Zone and an IP range for each subnet. The VPC ID is automatically retrieved from the CloudFormation stack you created in Launch an AWS CloudFormation template to configure basic resources. If you change the default subnet, note that at least one subnet needs to be the same subnet you created for the Amazon EMR on EC2 cluster (by default, the subnet for Amazon EMR on EC2 cluster is automatically retrieved during this CloudFormation stack creation). You can check the subnet for the cluster on the CloudFormation Outputs Then, choose Next.
Choose Next again, and then choose Submit.
After the stack creation is complete, check the CloudFormation Outputs Make a note of the two Subnet IDs on the Outputs tab to use later in Run queries from Amazon Redshift against the UniForm table.

Now you’re ready to use Delta Lake UniForm on Amazon EMR.

Enable Delta Lake UniForm

Start by creating a Delta Lake table that contains the customer review dataset. After creating the table, run REORG query to enable UniForm on the Delta Lake table.

Create a Delta Lake table

Complete the following steps to create a Delta Lake table based on a customer review dataset and review the table metadata.

Return to the Jupyter Notebook connected to the Amazon EMR on EC2 cluster and run the following cell to add delta-iceberg.jar to use UniForm and configure the spark extension.

Initialize the SparkSession. The following configuration is necessary to use Iceberg through UniForm. Before running the code, replace <DeltaLakeS3Bucket> with the name of the S3 bucket for Delta Lake, which you can find on the CloudFormation stack Outputs tab.

Create a Spark DataFrame from customer reviews.

Create a Delta Lake table with the customer reviews dataset. This step takes approximately 5 minutes.

Run DESCRIBE EXTENDED {DB_TBL} in the next cell to review the table. The output includes the table schema, location, table properties, and so on.

The Delta Lake table creation is complete. Next, enable UniForm on this Delta Lake table.

Run `REORG` query to enable UniForm

To allow an Iceberg client to access the Delta Lake table you created, enable UniForm on the table. You can also create a new Delta Lake table with UniForm enabled. For more information, see Appendix 1 at the end of this post. To enable UniForm and review the table metadata, complete the following steps.

Run the following query to enable UniForm on the Delta Lake table. To enable UniForm on an existing Delta Lake table, you run REORG query against the table.

Run DESCRIBE EXTENDED {DB_TBL} in the next cell to review the table metadata and compare it from before and after enabling UniForm. The new properties, such as delta.enableIcebergCompatV2=true and delta.universalFormat.enabledFormats=iceberg, are added to the table properties.

Run aws s3 ls s3://<DeltaLakeS3Bucket>/warehouse/ --recursive to confirm if the Iceberg table metadata is created. Replace <DeltaLakeS3Bucket> with the S3 bucket from the CloudFormation Outputs tab. The following screenshot shows the command output of table metadata and data files. You can confirm that Delta Lake UniForm generates both Iceberg metadata and Delta Lake metadata files as indicated by the red rectangles below.

Before querying the Delta Lake UniForm table from an Iceberg client, run the following analytic query for the Delta Lake UniForm table from Amazon EMR on EC2 side, and review the reviews count by each product category.

The query result shows the output of the reviews count by product_category:

Enabling UniForm on the Delta Lake table is complete, and now you can query the Delta Lake table from an Iceberg client. Next, you query the Delta Lake table as an Iceberg table from Amazon Redshift.

Run queries against the UniForm table from Amazon Redshift

In the previous section, you enabled UniForm on your existing Delta Lake table. This allows you to run queries on a Delta Lake table as if it were an Iceberg table from Amazon Redshift. In this section, you run an analytic query on the UniForm table using Amazon Redshift Serverless and add records with a new product category to the UniForm table through the Jupyter Notebook connected to the Amazon EMR on EC2 cluster. Then, you verify the added records with another analytic query from Amazon Redshift. You can confirm that Delta Lake UniForm enables Amazon Redshift to query the Delta Lake table through this section.

Query the UniForm table from Amazon Redshift Serverless

Open Amazon Redshift Serverless console
In Namespaces/Workgroups, select the delta-lake-uniform-namespace that you created using the CloudFormation stack.
Choose Query data on the right top corner to open the Amazon Redshift query editor.
After opening the editor, select the delta-lake-uniform-workgroup workgroup in the left pane.
Choose Create connection.
After you successfully create a connection, you can see the delta_uniform_db database and customer_review table you created in the left pane of the editor.
Copy and paste the following analytic query to the editor and choose Run.

SELECT product_category, count(*) as count_by_product_category 
FROM "awsdatacatalog"."delta_uniform_db"."customer_reviews"
GROUP BY product_category ORDER BY count_by_product_category DESC

The editor shows the same result of the review count by product_category as you obtained from Jupyter Notebook in Run REORG query to enable UniForm.

Add new product category records into Delta Lake UniForm table from Amazon EMR

Go back to the Jupyter Notebook on Amazon EMR Workspace to add new records with a new product category (Books) into the Delta Lake UniForm table. After adding the records, query the UniForm table again from Amazon Redshift Serverless.

On the Jupyter Notebook, go to Add new product category records into the UniForm table and run the following cell to load new records.

Run the following cell and review the five records with Books as the product category. The following screenshot shows the output of this code.

Add the new reviews with Books product category. This takes around 2 minutes.

In the next section, you run a query on the UniForm table from Amazon Redshift Serverless to check if the new records with the Books product category have been added.

Review the added records in Delta Lake UniForm table from Amazon Redshift Serverless

To check if the result output includes the records of Books product category:

On the query editor of Amazon Redshift, run the following query and check if the result output includes the records of Books product category.

SELECT product_category, count(*) as count_by_product_category 
FROM "awsdatacatalog"."delta_uniform_db"."customer_reviews"
GROUP BY product_category ORDER BY count_by_product_category DESC

The following screenshot shows the output of the query you ran in the previous step. You can confirm the new product category Books has been added to the table from Amazon Redshift side.

Now you can query from Amazon Redshift against the Delta Lake table by enabling Delta Lake UniForm.

Clean up resources

To clean up your resources, complete the following steps:

In the Amazon EMR Workspaces console, choose Actions and then Delete to delete the workspace.
Choose Delete to delete the Studio.
In the Amazon EMR on EC2 console, choose Terminate to delete the Amazon EMR on EC2 cluster.
In the Amazon S3 console, choose Empty to delete all objects in the following S3 buckets.
1. The S3 bucket for Amazon EMR Studio such as aws-emr-studio-<ACCOUNT_ID>-<REGION>-delta-lake-uniform-on-aws. Replace <ACCOUNT_ID> and <REGION> with your account ID and the bucket’s region.
2. The S3 bucket for Delta Lake tables such as delta-lake-uniform-on-aws-deltalakes3bucket-abcdefghijk.
After you confirm the two buckets are empty, delete the CloudFormation stack redshift-serverless-for-delta-lake-uniform.
After the first CloudFormation stack has been deleted, delete the CloudFormation stack delta-lake-uniform-on-aws.

Conclusion

Delta Lake UniForm on AWS represents an advancement in addressing the challenges of data interoperability and accessibility in modern big data architectures. By enabling Delta Lake tables to be read as Apache Iceberg tables, UniForm expands data access capabilities, allowing organizations to use a broader range of analytics engines and data warehouses such as Amazon Redshift.

The practical implications of this technology are substantial, offering new possibilities for data analysis and insights across diverse platforms. As organizations continue to navigate the complexities of big data, solutions like Delta Lake UniForm that promote interoperability and reduce data silos will become increasingly valuable.

By adopting these advanced open table formats and using cloud platforms such as AWS, organizations can build more robust and efficient data ecosystems. This approach not only enhances the value of existing data assets but also fosters a more agile and adaptable data strategy, ultimately driving innovation and improving decision-making processes in our data-driven world.

Appendix 1: Create a new Delta Lake table with UniForm

You can create a Delta Lake table with UniForm enabled using the following DDL.

CREATE TABLE IF NOT EXISTS delta_uniform_db.customer_reviews_create (
   marketplace string,
   customer_id string,
   review_id string,
   product_id string,
   product_title string,
   star_rating bigint,
   helpful_votes bigint,
   total_votes bigint,
   insight string,
   review_headline string,
   review_body string,
   review_date timestamp,
   review_year bigint,
   product_category string)
USING delta
TBLPROPERTIES ( 
   'delta.universalFormat.enabledFormats'='iceberg',
   'delta.enableIcebergCompatV2'='true',
   'delta.minReaderVersion'='2',
   'delta.minWriterVersion'='7')

Appendix 2: Run queries from Snowflake against the UniForm table

Delta Lake UniForm also allows you to run queries on a Delta Lake table from Snowflake. In this section, you run the same analytic query on the UniForm table using Snowflake as you previously did using Amazon Redshift Serverless in Run queries from Amazon Redshift against the UniForm table. Then you confirm that the query results from Snowflake match the results obtained from the Amazon Redshift Serverless query.

Configure IAM roles for Snowflake to access AWS Glue Data Catalog and Amazon S3

To query the Delta Lake UniForm table in Data Catalog from Snowflake, the following configurations are required.

IAM roles: Create IAM roles for Snowflake to access Data Catalog and Amazon S3.
Data Catalog integration with Snowflake: Snowflake provides two catalog options for Iceberg tables such as Using Snowflake as the Iceberg catalog and Using an external catalog such as Data Catalog. In this post, you choose AWS Glue Data Catalog as an external catalog. For information about the catalog options, refer to Iceberg catalog options in the Snowflake public documentation.
An external volume creation for Amazon S3: To access the UniForm table from Snowflake, an external volume for Amazon S3 needs to be configured. With this configuration, Snowflake can connect the S3 bucket that you created for Iceberg tables. For information about the external volume, refer to Configure an external volume for Iceberg tables.

Create IAM roles for Snowflake to access AWS Glue Data Catalog and Amazon S3

Create the following two IAM roles for Snowflake to access AWS Glue Data Catalog and Amazon S3.

SnowflakeIcebergGlueCatalogRole: This IAM role is used for Snowflake to access the Delta Lake UniForm table in AWS Glue Data Catalog.
SnowflakeIcebergS3Role: This IAM role is used for Snowflake to access the table’s underlying data in the S3 bucket.

To configure the IAM roles, complete the following steps:

Choose Launch stack:

Enter snowflake-iceberg as the stack name and choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Submit.
After the stack creation is complete, check the CloudFormation Outputs tab. Make a note of the names and ARNs of the two IAM roles, which are used in the following section.

Create an AWS Glue Data Catalog Integration

Create a catalog integration for AWS Glue Data Catalog. For more information about the catalog integration for AWS Glue Data Catalog, refer to Configure a catalog integration for AWS Glue in the Snowflake public documentation. To configure the catalog integration, complete the following steps:

Access your Snowflake account and open an empty worksheet (query editor).
Run the following query and create a catalog integration with AWS Glue Data Catalog. Replace <YOUR_ACCOUNT_ID> with the IAM role ARN from the snowflake-iceberg CloudFormation Ouputs tab, and replace <REGION> with the region of AWS Glue Data Catalog.

CREATE CATALOG INTEGRATION glue_catalog_integration
  CATALOG_SOURCE=GLUE
  CATALOG_NAMESPACE='delta_uniform_db'
  TABLE_FORMAT=ICEBERG
  GLUE_AWS_ROLE_ARN='arn:aws:iam::<YOUR_ACCOUNT_ID>:role/SnowflakeIcebergGlueCatalogRole'
  GLUE_CATALOG_ID='<YOUR_ACCOUNT_ID>'
  GLUE_REGION='<REGION>'
  ENABLED=TRUE;

Retrieve GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID by using DESCRIBE CATALOG INTEGRATION glue_catalog_integration in the editor. The output is similar to the following:

+------------------------------------------------------------------------------------------------------------------------------+
| property                 | property_type | property_value                                                 | property_default |
|--------------------------+---------------+----------------------------------------------------------------+------------------|
| ENABLED                  | Boolean       | true                                                           | false            |
| CATALOG_SOURCE           | String        | GLUE                                                           |                  |
| CATALOG_NAMESPACE        | String        | delta_uniform_db                                               |                  |
| TABLE_FORMAT             | String        | ICEBERG                                                        |                  |
| REFRESH_INTERVAL_SECONDS | Integer       | 30                                                             | 30               |
| GLUE_AWS_ROLE_ARN        | String        | arn:aws:iam::123456789012:role/SnowflakeIcebergGlueCatalogRole |                  |
| GLUE_CATALOG_ID          | String        | 123456789012                                                   |                  |
| GLUE_REGION              | String        | us-east-1                                                      |                  |
| GLUE_AWS_IAM_USER_ARN    | String        | arn:aws:iam::123456789012:user/<ID>                            |                  |
| GLUE_AWS_EXTERNAL_ID     | String        | An external ID specified on the IAM Role trust relationships   |                  |
| COMMENT                  | String        |                                                                |                  |
+------------------------------------------------------------------------------------------------------------------------------+

Update the IAM role you created using the CloudFormation stack to enable Snowflake to access AWS Glue Data Catalog using that IAM role. Open Trust Relationships of SnowflakeIcebergGlueCatalogRole on the IAM console, choose Edit and update the trust relationship using the following policy. Replace <GLUE_AWS_IAM_USER_ARN> and <GLUE_AWS_EXTERNAL_ID> with the names you obtained in the previous step.

{
   "Version": "2012-10-17",
   "Statement": [
	  {
		 "Sid": "",
		 "Effect": "Allow",
		 "Principal": {
			"AWS": "<GLUE_AWS_IAM_USER_ARN>"
		 },
		 "Action": "sts:AssumeRole",
		 "Condition": {
			"StringEquals": {
			   "sts:ExternalId": "<GLUE_AWS_EXTERNAL_ID>"
			}
		 }
	  }
   ]
}

You completed setting up the IAM role for Snowflake to access your Data Catalog resources. Next, configure the IAM role for Amazon S3 access.

Register Amazon S3 as an external volume

In this section, you configure an external volume for Amazon S3. Snowflake accesses the UniForm table data files in S3 through the external volume. For the configuration of an external volume for S3, refer to Configure an external volume for Amazon S3 in Snowflake public documentation. To configure the external volume, complete the following steps:

In the query editor, run the following query to create an external volume for the Delta Lake S3 bucket. Replace <DeltaLakeS3Bucket> with the name of the S3 bucket that you created in Launch a CloudFormation template to configure basic resources from the CloudFormation Outputs tab. Replace <ACCOUNT_ID> with your AWS account ID.

CREATE OR REPLACE EXTERNAL VOLUME delta_lake_uniform_s3
   STORAGE_LOCATIONS =
	  (
		 (
			NAME = 'delta-lake-uniform-on-aws'
			STORAGE_PROVIDER = 'S3'
			STORAGE_BASE_URL = 's3://<DeltaLakeS3Bucket>'
			STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<ACCOUNT_ID>:role/SnowflakeIcebergS3Role'
		 )
	  );

Retrieve STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID by running DESCRIBE EXTERNAL VOLUME delta_lake_uniform_s3 in the editor. The output is similar to the following:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| parent_property   | property           | property_type | property_value                                                                                                                       | property_default |
|-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------+------------------|
|                   | ALLOW_WRITES       | Boolean       | true                                                                                                                                 | true             |
| STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String        | {"NAME":"uniform_s3_location","STORAGE_PROVIDER":"S3","STORAGE_BASE_URL":"s3://<DeltaLakeS3Bucket>","STORAGE_ALLOWED_LOCATIONS":["s3 |                  |
|                   |                    |               | ://<DeltaLakeS3Bucket>/*"],"STORAGE_REGION":"us-east-1","PRIVILEGES_VERIFIED":true,"STORAGE_AWS_ROLE_ARN":"arn:aws:iam::123456789012 |                  |
|                   |                    |               | :role/SnowflakeIcebergS3Role","STORAGE_AWS_IAM_USER_ARN":"arn:aws:iam::123456789012:user/<ID>","STORAGE_AWS_EXTERNAL_ID":"<External  |                  |
|                   |                    |               | ID>",... }                                                                                                                           |                  |
| STORAGE_LOCATIONS | ACTIVE             | String        | uniform_s3_location                                                                                                                  |                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Update the IAM role you created using the CloudFormation template (in Create IAM roles for Snowflake to access AWS Glue Data Catalog and Amazon S3) to enable Snowflake to use this IAM role. Open Trust Relationships of SnowflakeIcebergS3Role on the IAM console, choose Edit, and update the trust relationship with the following policy. Replace <STORAGE_AWS_IAM_USER_ARN> and <STORAGE_AWS_EXTERNAL_ID> with the values from the previous step.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<STORAGE_AWS_IAM_USER_ARN&>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<STORAGE_AWS_EXTERNAL_ID>"
        }
      }
    }
  ]
}

The next step is to create an Iceberg table to run queries from Snowflake.

Create an Iceberg table in Snowflake

In this section, you create an Iceberg table in Snowflake. The table is an entry point for Snowflake to access the Delta Lake UniForm table in AWS Glue Data Catalog. To create the table, complete the following steps:

(Optional) If you don’t have a database in Snowflake, run CREATE DATABASE <DATABASE_NAME>, replacing <DATABASE_NAME> with a unique database name for the Iceberg table.
Run the following query in the Snowflake query editor. In this case, the database delta_uniform_snow_db is chosen for the table. Configure the following parameters:
1. EXTERNAL_VOLUME: created by CREATE OR REPLACE EXTERNAL VOLUME query in the previous section, such as delta_lake_uniform_s3.
2. CATALOG: created by the CREATE CATALOG INTEGRATION query in the previous section, such as glue_catalog_integration.
3. CATALOG_TABLE_NAME: the name of Delta Lake UniForm table you created in Data Catalog such as customer_reviews.

The complete query is below:

CREATE OR REPLACE ICEBERG TABLE customer_reviews_snow
  EXTERNAL_VOLUME='delta_lake_uniform_s3'
  CATALOG='glue_catalog_integration'
  CATALOG_TABLE_NAME='customer_reviews';

After the table creation is complete, you’re ready to query the UniForm table in AWS Glue Data Catalog from Snowflake.

Query the UniForm table from Snowflake

In this step, you query the UniForm table from Snowflake. Paste and run the following analytic query in the Snowflake query editor.

SELECT product_category, count(*) as count_by_product_category 
FROM customer_reviews_snow 
GROUP BY product_category ORDER BY count_by_product_category DESC

The query result shows the same output as you saw in Review the added records in Delta Lake UniForm table from Amazon Redshift Serverless section.

+--------------------------------------------------+
| PRODUCT_CATEGORY     | COUNT_BY_PRODUCT_CATEGORY |
|----------------------+---------------------------|
| Office_Products      | 9673711                   |
| Books                | 9672664                   |
| Apparel              | 6448747                   |
| Computers            | 3224215                   |
| Beauty_Personal_Care | 3223599                   |
+--------------------------------------------------+

Now you can query from Snowflake against the Delta Lake table by enabling Delta Lake UniForm.

About the Authors

Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services. He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS. In his free time, he enjoys a coffee break with his colleagues and making coffee at home.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.