All posts by Deepmala Agarwal

Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

2025-03-14 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/architect-fault-tolerant-applications-with-instance-fleets-on-amazon-emr-on-ec2/

Organizations rely on Amazon EMR on EC2 clusters to process large-scale data workloads using frameworks like Apache Spark, Apache Hive, and Trino. Events such as TV advertisements or unplanned promotions might lead to an increase in demand of compute capacity, making effective capacity planning necessary to make sure your workloads don’t hit capacity limits or job failures.

A common scenario is to run daily Spark jobs on Amazon EMR using consistent Amazon Elastic Compute Cloud (Amazon EC2) instance types (for example, a single instance size and family for the cluster). Although this might work well to sustain the baseline, spikes can trigger auto scaling, which narrows the chances of capacity availability when trying to stop and relaunch a larger EMR cluster, because the specific on-demand instance pool might lack capacity to meet the demand.

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when Amazon EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use Amazon EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Solution overview

Instance fleets in Amazon EMR offer a flexible and robust way to manage EC2 instances within your cluster. This feature allows you to specify target capacities for On-Demand and Spot Instances, select up to five EC2 instance types per fleet (or 30 when using the AWS Command Line Interface [AWS CLI] and API with an allocation strategy), and use multiple subnets across different Availability Zones. Importantly, instance fleets support the use of ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capacity. You can configure your instance fleet to prefer or require capacity reservations, making sure that your EMR clusters use your reserved capacity efficiently.

EMR workload patterns typically fall into two categories: stable and variable (spiky). In the following sections, we explore how to optimize for each pattern using various options available with instance fleets, starting with stable workloads and then addressing variable workloads.

Stable workloads are workloads with a predictable pattern of resource utilization over time; for example, a pharmaceutical provider needs to process 21 TB of research data, patient records, and other information daily. The workload is consistent and needs to run reliably every day on long-running persistent clusters. For critical business operations requiring high reliability and guaranteed capacity, we recommend reserving the baseline capacity as part of your capacity planning. We demonstrate the following steps:

Use AWS Cost and Usage Reports (AWS CUR) to estimate the baseline of existing workloads.
Reserve the baseline capacity using ODCR.
Configure Amazon EMR to use the targeted ODCR.

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands. These surges can be triggered by various factors (such as batch processing, real-time data streaming, or seasonal business fluctuations) that trigger Amazon EMR to request more capacity to match the demand. We address the resource allocation by using instance and Availability Zone flexibility, with the following steps:

Introduce EC2 instance flexibility with EMR instance fleets.
Achieve resiliency through intelligent subnet selection with EMR instance fleets.
Use managed scaling to automatically manage scaling in and out.

Stable workloads

In this section, we demonstrate how to define your baseline, configure AWS Identity and Access Management (IAM) permissions, create an ODCR, and associate your reservations to a capacity group and configure Amazon EMR to use targeted ODCRs. You can opt for a mixed ODCR strategy—for example, one ODCR with a short period of duration that supports the launch of your EMR cluster, and another ODCR with a longer period of duration that supports your task nodes based on the baseline capacity reservation.

Estimate the baseline

Make sure to activate the AWS generated cost allocation tag aws:elasticmapreduce:job-flow-id. This enables the field resource_tags_aws_elasticmapreduce_job_flow_id in the AWS CUR to be populated with the EMR cluster ID and is used by the SQL queries in the solution. To activate the cost allocation tag from the AWS Billing Console, complete the following steps:

On the AWS Billing and Cost Management console, choose Cost allocation tags in the navigation pane.
Under AWS generated cost allocation tags, choose the aws:elasticmapreduce:job-flow-id tag.
Choose Activate.

It can take up to 24 hours for tags to activate. For more information, see here.

After the tags are activated, you can use AWS CUR and perform the following query on Amazon Athena to find the compute resources used by the EMR cluster ID vs. the timeline of usage. For more details, see Querying Cost and Usage Reports using Amazon Athena. Update the following query with your CUR table name, EMR cluster ID, desired timestamps, and AWS account ID, and run the query on Athena:

SELECT bill_payer_account_id as Payer,
    product_product_family as PFamily,
    product_product_name as PName,
    resource_tags_aws_elasticmapreduce_job_flow_id,
    line_item_usage_account_id as LinkedAccount,
    line_item_usage_start_date as UsageDate,
    bill_billing_period_start_date as BillingDate,
    SPLIT_PART(line_item_usage_type, ':', 2) AS InstanceType,
    line_item_availability_zone AS AvailabilityZone,
    COUNT(line_item_resource_id) as ResourceIDCount
FROM <YOUR_CUR_TABLE_NAME>
WHERE (
        line_item_usage_start_date BETWEEN TIMESTAMP 'YYYY-MM-DD 00:00:00'
        AND TIMESTAMP 'YYYY-MM-DD 23:59:59' 
    )
    AND line_item_operation LIKE '%%RunInstance%%'
    AND line_item_line_item_type LIKE '%%Usage%%'
    AND product_product_family NOT IN ('Data Transfer')
    AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE '%%<emr-cluster-id>%%'
    AND line_item_usage_account_id IN (
        '<aws_account_id>'
)
GROUP BY 1,2,3,4,5,6,7,8,9

As an example, the preceding query filters instances usage per hour for a given account and EMR cluster for the period of 6 months, to generate the following figure. You can export the results in CSV format and analyze the data. Now that you have a visual representation of your workloads’ baseline and bursts, you can define the strategy and configuration of your EMR cluster.

Create an ODCR to reserve the baseline capacity

ODCRs can be either open or targeted:

With an open ODCR, new instances and existing instances that have matching attributes (such as operating system or instance type) will run using the capacity reservation attributes first.
With a targeted ODCR, instances must match the attributes of the ODCR specification and the ODCR is specifically targeted at launch. This approach is recommended if you have multiple concurrent EMR clusters consuming capacity from the shared On-Demand pool of EC2 instances. EMR clusters larger than the targeted ODCR quantity will fall back to On-Demand Instances that are in the same Availability Zone.

In this example, we use a targeted ODCR with an EMR instance fleet in the us-east-1a Availability Zone. The following diagram illustrates the workflow.

Complete the following steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make a note of the CapacityReservationArn value in the output:

aws ec2 create-capacity-reservation \
     --availability-zone <Input Your Availability Zone> \
     --instance-type r8g.2xlarge \
     --instance-match-criteria targeted \
     --instance-platform Linux/UNIX \
     --instance-count <enter the number of instances out of your baseline estimation>

We get the following output:

{
     "CapacityReservation": {
         "CapacityReservationId": "cr-0123456f9907xxxxx",
         "OwnerId": "XXXX",
         "CapacityReservationArn": "arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx",
         "InstanceType": "r8g.2xlarge",
         "InstancePlatform": "Linux/UNIX",
         "AvailabilityZone": "us-east-1a"

 ....
     }
 }

You can use Amazon CloudWatch to monitor ODCR usage and trigger an alert for unused capacity. For more details, see Monitor Capacity Reservations usage with CloudWatch metrics.

Create a resource group named EMRSparkSteadyStateGroup and make a note of GroupArn values in the output:

aws resource-groups create-group --name EMRSparkSteadyStateGroup \
--configuration '{"Type":"AWS::EC2::CapacityReservationPool"}' '{"Type":"AWS::ResourceGroups::Generic", "Parameters":[{"Name":"allowed-resource-types","Values":["AWS::EC2::CapacityReservation"]}]}'

We get the following output:

"Group": {
         "GroupArn": "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup",
         "Name": "EMRSparkSteadyStateGroup"
     }, ...

Use the following code to associate the capacity reservation to the resource group. You can have multiple capacity reservations associated to a resource group.

aws resource-groups group-resources --group EMRSparkSteadyStateGroup \
 --resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a best practice for effective management and cleanup, Create a tag Purpose=EMR-Spark-Steady-State for the newly created ODCR and the resource group.

# Tag your Capacity Reservation
 aws ec2 create-tags \
 --resources cr-0123456f9907xxxxx \
 --tags Key=Purpose,Value=EMR-Spark-Steady-State
# Tag your Resource Group
 aws resource-groups tag \
 --arn "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup" --tags Purpose=EMR-Spark-Steady-State

Implement Amazon EMR with ODCR

Complete the following steps to create an EMR cluster tagged with the specific targeted ODCR:

Add required permissions to the EMR service role before using capacity reservations. With these permissions, you can lock down the resource with the specific Amazon Resource Name (ARN) of the group name to be created with the following code:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Resource": "*",
             "Action": [
                 "ec2:CreateFleet",
                 "ec2:RunInstances",
                 "ec2:CreateLaunchTemplate",
                 "ec2:CreateLaunchTemplateVersion",
                 "ec2:DeleteLaunchTemplateVersions",
                 "ec2:DescribeCapacityReservations",
                 "ec2:DescribeLaunchTemplateVersions",
                 "resource-groups:ListGroupResources"
             ]
         }
     ]
 }

Configure the EMR cluster to use ODCR with instance fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as shown in the following example:

  {
 ...
     "LaunchSpecifications": {
       "OnDemandSpecification": {
         "AllocationStrategy": "LOWEST_PRICE",
         "CapacityReservationOptions": {
           "UsageStrategy": "USE_CAPACITY_RESERVATIONS_FIRST",
           "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup"
         }
       }
     }
   }

The following step-by-step breakdown illustrates the Amazon EMR decision-making process when prioritizing targeted capacity reservations, from core node provisioning through task node allocation:

Cluster provisioning initiation:
- The user chooses to override the lowest-price allocation strategy.
- The user specifies targeted capacity reservations in the launch request.
Core node provisioning:
- Amazon EMR evaluates all EC2 instance capacity pools with targeted capacity reservations, and selects the pool with the lowest price that has sufficient capacity for all requested core nodes.
- If no pool with targeted reservations has sufficient capacity, Amazon EMR reevaluates all specified EC2 instance capacity pools and selects the lowest-priced pool with sufficient capacity for core nodes. Available open capacity reservations are applied automatically.
Availability Zone selection:
- After the core capacity is acquired, Amazon EMR locks in the Availability Zone for your cluster.
Primary and task node provisioning:
- Amazon EMR evaluates EC2 instance capacity pools within that Availability Zone for primary and task fleets. First, Amazon EMR evaluates all the pools with targeted ODCRs specified in the request, ordered by lowest price by default.
- From the ordered list, Amazon EMR launches as much capacity as possible from the unused targeted ODCRs of each instance pool until the request is fulfilled.
- If the unused targeted ODCRs don’t fulfill the request yet, Amazon EMR continues to launch the remaining capacity into On-Demand pools, in the lowest-price order by default.

For more details about the allocation strategy, refer to Allocation strategy for instance fleets or Amazon EMR Support for Targeted ODCR.

Spiky workloads

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands, triggered by factors such as infrequent but resource-intensive periodic batch processing jobs. For example, a geographic information system processes location data from millions of users in real time to provide up-to-date traffic information, calculate routes, and suggest points of interest. User location data is constantly being generated, but the volume can spike dramatically during rush hour or special events, as illustrated in the following figure. This graph shows the number of used resources (Amazon EC2) by hour; it varies from 1 when the cluster scales in, waiting for jobs, to spikes of 1,000 nodes.

If you’re running spiky workloads with limited flexibility in instance type, family, and Availability Zone, you might face ICE errors when the available capacity can’t meet the cluster’s scaling requirements. To address this, we explore a set of best practices for EMR cluster creation to maximize availability and balance price-performance. Although spiky workloads present a unique challenge in resource management, configuring EMR instance fleets offers a powerful solution. By using diverse instance types, prioritized allocation strategies, Availability Zone flexibility, and managed scaling, organizations can create a robust, cost-effective infrastructure capable of handling unpredictable workload patterns. This configuration offers the following benefits:

Improved availability – By diversifying instance types and using multiple Availability Zones, the cluster mitigates insufficient capacity issues
Cost savings – Allocation strategies reduce costs while minimizing interruptions
Resilience for spiky workloads – Prioritizing instance generations provides seamless scaling under varying demands
Optimized performance – Managed scaling dynamically adjusts resources to meet workload demands efficiently

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Amazon EMR supports instance flexibility with instance fleet deployment. Instance fleets give you a wider variety of options and intelligence around instance provisioning. You can now provide a list of up to 30 instance types with corresponding weighted capacities and spot bid prices (including spot blocks) using the AWS CLI or AWS CloudFormation. Amazon EMR will automatically provision On-Demand and Spot capacity across these instance types when creating your cluster. This can make it more straightforward and more cost-effective to quickly obtain and maintain your desired capacity for your clusters. In August 2024, Amazon EMR introduced the prioritized allocation strategy to enhance instance flexibility with instance fleets. This feature allows you to specify priority levels for your instance types, enabling Amazon EMR to allocate capacity to the highest-priority instances first. This strategy helps improve cost savings and reduces the time required to launch clusters, even in scenarios with limited capacity. For more details, see Amazon EMR support prioritized and capacity-optimized-prioritized allocation strategies for EC2 instances. To maximize cost-efficiency and availability for spiky workloads, combine the price-performance advantages of new-generation instances with the broader availability of previous-generation instances. For workloads with strict latency requirements, fix the instance size to maintain consistent performance. This approach takes advantage of the strengths of both instance generations, providing flexibility and reliability decreasing the likelihood of capacity constraints. For On-Demand nodes, choose the prioritized allocation strategy, so the cluster tries to use newer-generation instances first. While configuring the instance fleet, arrange instances in a prioritized order reflecting price-performance and availability trade-offs, for example:

Primary node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Task Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Instances, make sure the capacity-optimized prioritized allocation strategy is selected to reduce interruptions. See the following CloudFormation template snippet as an example:

...
       "Properties": {
         "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnMaster",
            "InstanceTypeConfigs": [
               {
                 "BidPrice": "10.50",
                 "InstanceType": "m5.xlarge",
                 "Priority": "1",
 ...
             "LaunchSpecifications": {
               "SpotSpecification": {
                 "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                 "TimeoutDurationMinutes": 20,
                 "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
               },
               "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
               }
 ...

Select subnets with EMR instance fleets

When creating a cluster, specify multiple EC2 subnets within a virtual private cloud (VPC), each corresponding to a different Availability Zone. Amazon EMR provides multiple subnet (Availability Zone) options by employing subnet filtering at cluster launch, and selects one of the subnets that has adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets.

Use managed scaling

Managed scaling is another powerful feature of Amazon EMR that automatically adjusts the number of instances in your cluster based on workload demands. This makes sure that your cluster scales up during periods of high demand to meet processing requirements and scales down during idle times to save costs. With managed scaling, you can set minimum and maximum scaling limits, giving you control over costs while benefiting from an optimized and efficient cluster performance.

The following workflow illustrates Amazon EMR configured with instance fleets and managed scaling.

The workflow consists of the following steps:

The user defines the EMR instance configurations and instance types, along with their launch priority.
The user selects subnets for the Amazon EMR configuration to provide Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision instances based on the allocation strategy.
The EMR instance fleet is launched.
The cycle is repeated for scaling operations within the launched Availability Zone, providing optimized performance and scalability.

Conclusion

In this post, we demonstrated how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. As you implement any of the preceding strategies, remember to continuously monitor your cluster’s performance and adjust configurations based on your specific workload patterns and business needs. With the right approach, the challenges of spiky workloads can be transformed into opportunities for optimized performance and cost savings.

To effectively manage workloads with both baseline demands and unexpected spikes, consider implementing a hybrid approach in Amazon EMR. Use ODCRs for consistent baseline capacity and configure instance fleets with a strategic mix of ODCR, On-Demand, and Spot Instances prioritizing ODCR usage.

Try these strategies with your own use case, and leave your questions in the comments.

About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Suba Palanisamy is a Senior Technical Account Manager, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Flavio Torres is a Principal Technical Account Manager at AWS. Flavio helps Enterprise Support customers design, deploy, and scale resilient cloud applications. Outside of work, he enjoys hiking and barbecuing.

Enhance your workload resilience with new Amazon EMR instance fleet features

2025-02-19 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/enhance-your-workload-resilience-with-new-amazon-emr-instance-fleet-features/

Big data processing and analytics have emerged as fundamental components of modern data architectures. Organizations worldwide use these capabilities to extract actionable insights and facilitate data-driven decision-making processes. Amazon EMR has long been a cornerstone for big data processing in the cloud. Now, with a suite of exciting new features for EMR instance fleets that enables you to effectively manage your compute, Amazon is taking cloud-based analytics to the next level.

Amazon EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS. This comprehensive post introduces instance fleets, demonstrates using this new allocation strategy, explores how enhanced Availability Zone and subnet selection works, and examines how these features improve cluster’s resilience. This technical exploration will equip you with the knowledge to implement more resilient and efficient EMR clusters for your organization’s big data processing needs.

The current challenges

Organizations using big data operations might face several challenges:

When preferred instance types are unavailable, finding suitable alternatives often delays cluster launches and disrupts workflows
Selecting the optimal Availability Zone for cluster launch is challenging due to constantly changing available compute capacity, especially when considering future scaling needs
Maintaining uninterrupted operation of mission-critical long-running clusters becomes complex as data processing requirements evolve over time
Organizations frequently struggle to scale their operations to meet growing data processing demands, leading to performance bottlenecks and delayed insights

These challenges underscore the need for more advanced, flexible, and intelligent solutions in the realm of big data operations, driving the demand for innovative features in cloud-based data processing platforms.

Introducing improved EMR instance fleets

Amazon EMR, a cloud-based big data platform, allows you to process large datasets using various open source tools such as Apache Spark, Apache Flink, and Trino. To address the aforementioned challenges, Amazon EMR introduced instance fleets, with a robust set of features.

When setting up an EMR cluster, Amazon EMR offers two configuration options for configuring the primary, core, and task nodes: uniform instance groups or instance fleets.

Uniform instance groups offer a streamlined approach to cluster setup, allowing up to 50 instance groups per cluster. An EMR cluster has a primary instance group for primary node, a core instance group with one or more Amazon Elastic Compute Cloud (Amazon EC2) instances, and the option to add up to 48 task instance groups. Both core and task instance groups are flexible, allowing any number of EC2 instances within each group. Both core and task groups offer flexibility in instance count, and each node type (primary, core, or task) consists of instances sharing the same specifications and purchasing model (On-Demand or Spot). However, this approach limits the ability to mix different instance types or purchasing options within a single group.

Instance fleets provide a versatile approach to provisioning EC2 instances, offering unparalleled flexibility in cluster configuration. This setup assigns one instance fleet each for primary and core nodes, with the task instance fleet being optional. It allows you to specify up to five EC2 instance types (or up to 30 when using the Amazon Command Line Interface (AWS CLI) or API with an instance allocation strategy) for each node type in a cluster, providing enhanced instance diversity to optimize cost and performance while increasing the likelihood of fulfilling capacity requirements. Instance fleets automatically manage the mix of instance types to meet specified target capacities for On-Demand and Spot, reducing operational overhead and improving compute availability.

Key benefits of instance fleets include improved cluster resilience to capacity fluctuations, superior management of Spot Instances with the ability to set timeouts and specify actions if Spot capacity can’t be provisioned, and faster cluster provisioning. The feature also allows you to select multiple subnets for different Availability Zones, enabling Amazon EMR to optimally launch clusters and automatically route traffic away from impacted zones during large-scale events. Additionally, instance fleets offer capacity reservation options for On-Demand Instances and support allocation strategies that prioritize instance types based on user-defined criteria, further enhancing the flexibility and efficiency of EMR cluster management.

Achieve resiliency with instance fleets

Now that you have a good understanding of instance fleets, let’s explore how the new instance fleet capabilities help achieve resiliency for your workloads through the following methods:

EC2 instance allocation – Enables precise control over instance type selection and prioritization
Enhanced subnet selection – Optimizes cluster deployment across Availability Zones

EC2 instance allocation

EMR instance fleets now offer newer allocation strategies for both Spot and On-Demand Instances, giving you control over selection and prioritization of instance types and allowing you to optimize for greater flexibility, resilience, and cost-efficiency.

Amazon EMR supports the following allocation strategies for On-Demand Instances:

Prioritized (new) – Allows you to define a priority order for instance types, giving you precise control over instance selection
Lowest-price (existing) – Selects the lowest-priced instance type from the available options

Amazon EMR supports the following allocation strategies for Spot Instances:

Price-capacity optimized (new) – Selects instances with the lowest price while also considering the available capacity
Capacity-optimized-prioritized (new) – Similar to capacity-optimized, but respects instance type priorities that you specify, on a best-effort basis
Capacity-optimized (existing) – Selects instances from the pools with the most available capacity
Lowest-price (existing) – Selects the lowest-priced Spot Instances
Diversified (existing) – Distributes instances across all pools

When using the prioritized On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

For Spot Instances, Amazon EMR recommends the capacity-optimized allocation strategy. This approach allocates instances from the most available capacity pools, thereby reducing the chance of interruptions and enhancing cluster stability. Amazon EMR also allows you to launch a cluster without an allocation strategy. However, using an allocation strategy is recommended for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

Enhanced subnet selection

Amazon EMR on EC2 offers improved reliability and cluster launch experience for instance fleet clusters through the newly launched enhanced subnet selection. With this feature, EMR on EC2 reduces cluster launch failures resulting from an IP address shortage. Previously, the subnet selection for EMR clusters only considered the available IP addresses for the core instance fleet. Amazon EMR now employs subnet filtering at cluster launch and selects one of the subnets that have adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets. In this scenario, Amazon EMR will also publish an Amazon CloudWatch alert event to notify the user. If none of the configured subnets can be used to provision the core and primary fleet, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take remedial actions as necessary. This capability is enabled by default when you configure more than one subnet for cluster launch, and you don’t need to make any configuration changes to benefit from it.

Solution overview

Now that you have a comprehensive grasp of the two new features, let’s integrate the elements of instance fleets and look at the implementation flow for each feature.

EC2 instance allocation

The following diagram illustrates the instance fleet lifecycle management architecture.

The workflow consists of the following steps:

Create a cluster configuration with the prioritized allocation strategy, specifying instance types, their priority, and a list of potential subnets.
When you launch an EMR cluster, it evaluates compute capacity and available IPs across the specified subnets. Amazon EMR then selects a single Availability Zone that best meets capacity and instance availability needs for the entire cluster.
Amazon EMR launches the cluster using available instance types in one of the configured Availability Zones based on enhanced subnet selection.
During a scale-up scenario, Amazon EMR adds new instances to the clusters while following the configured compute allocation strategy.
If a specific instance type is unavailable, Amazon EMR will select the next available instance types based on the priority order. This flexibility provides capacity availability for production workloads while maintaining scalability.

The following example code provisions an EMR cluster with a primary and core instance fleet configuration with both Spot and On-Demand Instances, using the Capacity-optimized-prioritized allocation strategy for Spot Instances and the Prioritized strategy for On-Demand Instances:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "myCluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
        "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnPrimary",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "TargetOnDemandCapacity": 1
          },
          "CoreInstanceFleet": {
            "Name": "cfnCore",
            "InstanceTypeConfigs": [
              {
                "BidPrice": "10.50",
                "InstanceType": "m5.xlarge",
                "Priority": "1",
                "WeightedCapacity": "1",
                "EbsConfiguration": {
                  "EbsBlockDeviceConfigs": [
                    {
                      "VolumeSpecification": {
                        "VolumeType": "gp2",
                        "SizeInGB": 32
                      }
                    }
                  ]
                }
              }
            ],
            "LaunchSpecifications": {
              "SpotSpecification": {
                "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                "TimeoutDurationMinutes": 20,
                "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
              },
              "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
              }
            },
            "TargetOnDemandCapacity": "5",
            "TargetSpotCapacity": "0"
          }
        },
        "Name": "blog-test",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "ServiceRole": "EMR_DefaultRole",
        "ReleaseLabel": "emr-7.2.0"
      }
    }
  }
}

Enhanced subnet selection

To better understand Step 3 in the preceding workflow, let’s explore how enhanced subnet selection works with instance fleet EMR clusters.

For our example, let’s configure an EMR instance fleet as follows:

Primary fleet (1 unit) – r8g.xlarge, r6g.xlarge, r8g.2xlarge
Core fleet (48 units) – r6g.xlarge, r6g.2xlarge, m7g.2xlarge
Task fleet (48 units) – m7g.2xlarge, r6g.xlarge, r6a.4xlarge

For this example, let’s use the lowest price allocation strategy. Next, let’s check the available IP addresses in our subnets using the AWS CLI:

aws ec2 describe-subnets 
--query "sort_by(Subnets, &SubnetId)[*].[SubnetId, AvailableIpAddressCount, AvailabilityZoneId]" 
--output table

We get the following results:

--------------------------------------------------
|                 DescribeSubnets                |
+---------------------------+-------+------------+
|subnet-XXXXXXXXXXXXXXXX1   |  27  |  us-east-1a |
|subnet-XXXXXXXXXXXXXXXX2   |  251 |  us-east-1b |
|subnet-XXXXXXXXXXXXXXXX3   |  11  |  us-east-1a |
-------------------------------------------------

When launching an EMR cluster, Amazon EMR follows a specific subnet filtering process. First, EMR on EC2 evaluates subnets based on the total IP addresses required for all node types: primary, core, and task nodes. If multiple subnets have sufficient IP capacity to accommodate all instance fleets, Amazon EMR selects one based on the cluster’s allocation strategy. However, if no subnet has enough IPs to support all node types, Amazon EMR considers subnets that can at least accommodate the primary and core nodes, again using the allocation strategy to make the final selection. In our case, Amazon EMR selected a subnet in Availability Zone us-east-1b that had 251 available IPs that can support 97 instances to launch the whole cluster, bypassing smaller subnets with only 27 or 11 available IPs because they didn’t meet the minimum IP requirements for the cluster configuration.

Primary fleet (1 unit) – r6g.xlarge
Core fleet (48 units) – m7g.2xlarge
Task fleet (48 units) – r6g.xlarge

The EMR and CloudWatch event for this cluster would be:

Amazon EMR cluster j-X40BEI1Oxxx (Cluster) 
is being created in subnet (subnet-XXXXXXXXXXXXXXXX2) 
in VPC (vpc-XXXXXXXXXXXXXXXX1) in Availability Zone (us-east-1b), 
which was chosen from the specified VPC options.

If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the entire cluster, it will prioritize launching the core and primary instance fleets. If no configured subnet can accommodate even the core and primary fleets, Amazon EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enable you to monitor your clusters and take necessary actions.

Conclusion

The latest enhancements to EMR instance fleets mark a significant advancement in cloud-based big data processing, addressing key challenges in resource allocation, scalability, and reliability. These features, including priority-based instance selection and enhanced subnet selection, provide you with greater control over resource strategies, improved cluster availability, enhanced capacity optimization across Availability Zones, and more efficient fallback mechanisms for production workloads. Instance fleets help you tackle current resource management challenges while laying the groundwork for future scalability.

Get started today by setting up an EMR cluster using the example configuration provided in this post. For additional configuration options and implementation guidance, refer here or reach out to your AWS account team.

About the Authors

Ravi Kumar Singh is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specialized in building petabyte-scale data infrastructure and analytics platforms. With a passion for building innovative tools, he helps customers unlock valuable insights from their structured and unstructured data. Ravi’s expertise lies in creating robust data foundations using open source technologies and advanced cloud computing that power advanced artificial intelligence and machine learning use cases. A recognized thought leader in the field, he advances the data and AI ecosystem through pioneering solutions and collaborative industry initiatives. As a strong advocate for customer-centric solutions, Ravi constantly seeks ways to simplify complex data challenges and enhance user experiences. Outside of work, Ravi is an avid technology enthusiast who enjoys exploring emerging trends in data science, cloud computing, and machine learning.

Mandisa Nxumalo is a Cloud Engineer at Amazon Web Services (AWS) with over 5 years experience in topics related to cloud services (databases, automation, and others). Currently, specializing in Big data service Amazon EMR. She is passionate about engaging customers to effectively adopt and utilize data driven approaches to improve their big data workflows. Outside work, Mandisa enjoys hiking mountains, chasing waterfalls and travelling across countries.

Kashif Khan is a Sr. Analytics Specialist Solutions Architect at AWS, specializing in big data services like Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With over a decade of experience in the big data domain, he possesses extensive expertise in architecting scalable and robust solutions. His role involves providing architectural guidance and collaborating closely with customers to design tailored solutions using AWS analytics services to unlock the full potential of their data.

Gaurav Sharma is a Specialist Solutions Architect (Analytics) at AWS, supporting US public sector customers on their cloud journey. Outside of work, Gaurav enjoys spending time with his family and reading books.

Enhance data security with fine-grained access controls in Amazon DataZone

2024-07-03 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/enhance-data-security-with-fine-grained-access-controls-in-amazon-datazone/

Fine-grained access control is a crucial aspect of data security for modern data lakes and data warehouses. As organizations handle vast amounts of data across multiple data sources, the need to manage sensitive information has become increasingly important. Making sure the right people have access to the right data, without exposing sensitive information to unauthorized individuals, is essential for maintaining data privacy, compliance, and security.

Today, Amazon DataZone has introduced fine-grained access control, providing you granular control over your data assets in the Amazon DataZone business data catalog across data lakes and data warehouses. With the new capability, data owners can now restrict access to specific records of data at row and column levels, instead of granting access to the entire data asset. For example, if your data contains columns with sensitive information such as personally identifiable information (PII), you can restrict access to only the necessary columns, making sure sensitive information is protected while still allowing access to non-sensitive data. Similarly, you can control access at the row level, allowing users to see only the records that are relevant to their role or task.

In this post, we discuss how to implement fine-grained access control with row and column asset filters using this new feature in Amazon DataZone.

Row and column filters

Row filters enable you to restrict access to specific rows based on criteria you define. For instance, if your table contains data for two regions (America and Europe) and you want to make sure that employees in Europe only access data relevant to their region, you can create a row filter that excludes rows where the region is not Europe (for example, region != 'Europe'). This way, employees in America won’t have access to Europe’s data.

Column filters allow you to limit access to specific columns within your data assets. For example, if your table includes sensitive information such as PII, you can create a column filter to exclude PII columns. This makes sure subscribers can only access non-sensitive data.

The row and column asset filters in Amazon DataZone enable you to control who can access what using a consistent, business user-friendly mechanism for all of your data across AWS data lakes and data warehouses. To use fine-grained access control in Amazon DataZone, you can create row and column filters on top of your data assets in the Amazon DataZone business data catalog. When a user requests a subscription to your data asset, you can approve the subscription by applying the appropriate row and column filters. Amazon DataZone enforces these filters using AWS Lake Formation and Amazon Redshift, making sure the subscriber can only access the rows and columns that they are authorized to use.

Solution overview

To demonstrate the new capability, we consider a sample customer use case where an electronics ecommerce platform is looking to implement fine-grained access controls using Amazon DataZone. The customer has multiple product categories, each operated by different divisions of the company. The platform governance team wants to make sure each division has visibility only to data belonging to their own categories. Additionally, the platform governance team needs to adhere to the finance team requirements that pricing information should be visible only to the finance team.

The sales team, acting as the data producer, has published an AWS Glue table called Product sales that contains data for both Laptops and Servers categories to the Amazon DataZone business data catalog using the project Product-Sales. The analytic teams in both the laptop and server divisions need to access this data for their respective analytics projects. The data owner’s objective is to grant data access to consumers based on the division they belong to. This means giving access to only rows of data with laptop sales to the laptops sales analytics team, and rows with servers sales to the server sales analytics team. Additionally, the data owner wants to restrict both teams from accessing the pricing data. This post demonstrates the implementation steps to achieve this use case in Amazon DataZone.

The steps to configure this solution are as follows:

The publisher creates asset filters for limiting access:
1. We create two row filters: a Laptop Only row filter that limits access to only the rows of data with laptop sales, and a Server Only row filter that limits access to the rows of data with server sales.
2. We also create a column filter called exclude-price-columns that excludes the price-related columns from the Product Sales
Consumers discover and request subscriptions:
1. The analyst from the laptops division requests a subscription to the Product Sales data asset.
2. The analyst from the servers division also request a subscription to the Product Sales data asset.
3. Both subscription requests are sent to the publisher for approval.
The publisher approves the subscriptions and applies the appropriate filters:
1. The publisher approves the request from the analysts in the laptops division, applying the Laptop Only row filter and the exclude-price-columns columns filter.
2. The publisher approves the request from the consumer in the servers division, applying the Server Only row filter and the exclude-price-columns columns filter.
Consumers access the authorized data in Amazon Athena:
1. After the subscription is approved, we query the data in Athena to make sure that the analyst from the laptops division can now access only the product sales data for the Laptop
2. Similarly, the analyst from the servers division can access only the product sales data for the Server
3. Both consumers can see all columns except the price-related columns, as per the applied column filter.

The following diagram illustrates the solution architecture and process flow.

Prerequisites

To follow along with this post, the publisher of the product sales data asset must have published a sales dataset in Amazon DataZone.

Publisher creates asset filters for limiting access

In this section, we detail the steps the publisher takes to create asset filers.

Create row filters

This dataset contains the product categories Laptops and Servers. We want to restrict access to the dataset that is authorized based on the product category. We use the row filter feature in Amazon DataZone to achieve this.

Amazon DataZone allows you to create row filters that can be used when approving subscriptions to make sure that the subscriber can only access rows of data as defined in the row filters. To create a row filter, complete the following steps:

On the Amazon DataZone console, navigate to the product-sales project (the project to which the asset belongs).
Navigate to the Data tab for the project.
Choose Inventory data in the navigation pane, then the asset Product Sales, where you want to create the row filter.

You can add row filters for assets of type AWS Glue tables or Redshift tables.

On the asset detail page, on the Asset filters tab, choose Add asset filter.

We create two row filters, one each for the Laptops and Servers categories.

Complete the following steps to create a laptop only asset row filter:
1. Enter a name for this filter (Laptop Only).
2. Enter a description of the filter (Allow rows with product category as Laptop Only).
3. For the filter type, select Row filter.
4. For the row filter expression, enter one or more expressions:
  1. Choose the column Product Category from the column dropdown menu.
  2. Choose the operator = from the operator dropdown menu.
  3. Enter the value Laptops in the Value field.
5. If you need to add another condition to the filter expression, choose Add condition. For this post, we create a filter with one condition.
6. When using multiple conditions in the row filter expression, choose And or Or to link the conditions.
7. You can also define the subscriber visibility. For this post, we kept the default value (No, show values to subscriber).
8. Choose Create asset filter.
Repeat the same steps to create a row filter called Server Only, except this time enter the value Servers in the Value field.

Create column filters

Next, we create column filters to restrict access to columns with price-related data. Complete the following steps:

In the same asset, add another asset filter of type column filter.
On the Asset filters tab, choose Add asset filter.
For Name, enter a name for the filter (for this post, exclude-price-columns).
For Description, enter a description of the filters (for this post, exclude price data columns).
For the filter type, select Column to create the column filter. This will display all the available columns in the data asset’s schema.
Select all columns except the price-related ones.
Choose Create asset filter.

Consumers discover and request subscriptions

In this section, we switch to the role of an analyst from the laptop division who is working within the project Sales Analytics - Laptop. As the data consumer, we search the catalog to find the Product Sales data asset and request access by subscribing to it.

Log in to your project as a consumer and search for the Product Sales data asset.
On the Product Sales data asset details page, choose Subscribe.
For Project, choose Sales Analytics – Laptops.
For Reason for request, enter the reason for the subscription request.
Choose Subscribe to submit the subscription request.

Publisher approves subscriptions with filters

After the subscription request is submitted, the publisher will receive the request, and they can approve it by following these steps:

As the publisher, open the project Product-Sales.
On the Data tab, choose Incoming requests in the left navigation pane.
Locate the request and choose View request. You can filter by Pending to see only requests that are still open.

This opens the details of the request, where you can see details like who requested the access, for what project, and the reason for the request.

To approve the request, there are two options:
1. Full access – If you choose to approve the subscription with full access option, the subscriber will get access to all the rows and columns in our data asset.
2. Approve with row and column filters – To limit access to specific rows and columns of data, you can choose the option to approve with row and column filters. For this post, we use both filters that we created earlier.
Select Choose filter, then on the dropdown menu, choose the Laptops Only and pii-col-filter
Choose Approve to approve the request.

After access is granted and fulfilled, the subscription looks as shown in the following screenshot.

Now let’s log in as a consumer from the server division.
Repeat the same steps, but this time, while approving the subscription, the publisher of sales data approves with the Server only The other steps remain the same.

Consumers access authorized data in Athena

Now that we have successfully published an asset to the Amazon DataZone catalog and subscribed to it, we can analyze it. Let’s log in as a consumer from the laptop division.

In the Amazon DataZone data portal, choose the consumer project Sales Analytics - Laptops.
On the Schema tab, we can view the subscribed assets.
Choose the project Sales Analytics - Laptops and choose the Overview
In the right pane, open the Athena environment.

We can now run queries on the subscribed table.

Choose the table under Tables and views, then choose Preview to view the SELECT statement in the query editor.
Run a query as the consumer of Sales Analytics - Laptops, in which we can view data only with product category Laptops.

Under Tables and views, you can expand the table product_sales. The price-related columns are not visible in the Athena environment for querying.

Next, you can switch to the role of analyst from the server division and analyze the dataset in similar way.
We run the same query and see that under product_category, the analyst can see Servers only.

Conclusion

Amazon DataZone offers a straightforward way to implement fine-grained access controls on top of your data assets. This feature allows you to define column-level and row-level filters to enforce data privacy before the data is available to data consumers. Amazon DataZone fine-grained access control is generally available in all AWS Regions that support Amazon DataZone.

Try out the fine-grained access control feature in your own use case, and let us know your feedback in the comments section.

About the Authors

Leonardo Gomez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn.

Utkarsh Mittal is a Senior Technical Product Manager for Amazon DataZone at AWS. He is passionate about building innovative products that simplify customers’ end-to-end analytics journeys. Outside of the tech world, Utkarsh loves to play music, with drums being his latest endeavor.

Noise

All posts by Deepmala Agarwal

Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

Solution overview

Stable workloads

Estimate the baseline

Create an ODCR to reserve the baseline capacity

Implement Amazon EMR with ODCR

Spiky workloads

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Select subnets with EMR instance fleets

Use managed scaling

Conclusion

About the Authors

Enhance your workload resilience with new Amazon EMR instance fleet features

The current challenges

Introducing improved EMR instance fleets

Achieve resiliency with instance fleets

EC2 instance allocation

Enhanced subnet selection

Solution overview

EC2 instance allocation

Enhanced subnet selection

Conclusion

About the Authors

Enhance data security with fine-grained access controls in Amazon DataZone

Row and column filters

Solution overview

Prerequisites

Publisher creates asset filters for limiting access

Create row filters

Create column filters

Consumers discover and request subscriptions

Publisher approves subscriptions with filters

Consumers access authorized data in Athena

Conclusion

About the Authors

The collective thoughts of the interwebz