Tag Archives: EMR Serverless

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless

Post Syndicated from Brandon Abear original https://aws.amazon.com/blogs/big-data/how-the-godaddy-data-platform-achieved-over-60-cost-reduction-and-50-performance-boost-by-adopting-amazon-emr-serverless/

This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy.

GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work.

At GoDaddy, we take pride in being a data-driven company. Our relentless pursuit of valuable insights from data fuels our business decisions and ensures customer satisfaction. Our commitment to efficiency is unwavering, and we’ve undertaken an exciting initiative to optimize our batch processing jobs. In this journey, we have identified a structured approach that we refer to as the seven layers of improvement opportunities. This methodology has become our guide in the pursuit of efficiency.

In this post, we discuss how we enhanced operational efficiency with Amazon EMR Serverless. We share our benchmarking results and methodology, and insights into the cost-effectiveness of EMR Serverless vs. fixed capacity Amazon EMR on EC2 transient clusters on our data workflows orchestrated using Amazon Managed Workflows for Apache Airflow (Amazon MWAA). We share our strategy for the adoption of EMR Serverless in areas where it excels. Our findings reveal significant benefits, including over 60% cost reduction, 50% faster Spark workloads, a remarkable five-times improvement in development and testing speed, and a significant reduction in our carbon footprint.

Background

In late 2020, GoDaddy’s data platform initiated its AWS Cloud journey, migrating an 800-node Hadoop cluster with 2.5 PB of data from its data center to EMR on EC2. This lift-and-shift approach facilitated a direct comparison between on-premises and cloud environments, ensuring a smooth transition to AWS pipelines, minimizing data validation issues and migration delays.

By early 2022, we successfully migrated our big data workloads to EMR on EC2. Using best practices learned from the AWS FinHack program, we fine-tuned resource-intensive jobs, converted Pig and Hive jobs to Spark, and reduced our batch workload spend by 22.75% in 2022. However, scalability challenges emerged due to the multitude of jobs. This prompted GoDaddy to embark on a systematic optimization journey, establishing a foundation for more sustainable and efficient big data processing.

Seven layers of improvement opportunities

In our quest for operational efficiency, we have identified seven distinct layers of opportunities for optimization within our batch processing jobs, as shown in the following figure. These layers range from precise code-level enhancements to more comprehensive platform improvements. This multi-layered approach has become our strategic blueprint in the ongoing pursuit of better performance and higher efficiency.

Seven layers of improvement opportunities

The layers are as follows:

  • Code optimization – Focuses on refining the code logic and how it can be optimized for better performance. This involves performance enhancements through selective caching, partition and projection pruning, join optimizations, and other job-specific tuning. Using AI coding solutions is also an integral part of this process.
  • Software updates – Updating to the latest versions of open source software (OSS) to capitalize on new features and improvements. For example, Adaptive Query Execution in Spark 3 brings significant performance and cost improvements.
  • Custom Spark configurations Tuning of custom Spark configurations to maximize resource utilization, memory, and parallelism. We can achieve significant improvements by right-sizing tasks, such as through spark.sql.shuffle.partitions, spark.sql.files.maxPartitionBytes, spark.executor.cores, and spark.executor.memory. However, these custom configurations might be counterproductive if they are not compatible with the specific Spark version.
  • Resource provisioning time The time it takes to launch resources like ephemeral EMR clusters on Amazon Elastic Compute Cloud (Amazon EC2). Although some factors influencing this time are outside of an engineer’s control, identifying and addressing the factors that can be optimized can help reduce overall provisioning time.
  • Fine-grained scaling at task level Dynamically adjusting resources such as CPU, memory, disk, and network bandwidth based on each stage’s needs within a task. The aim here is to avoid fixed cluster sizes that could result in resource waste.
  • Fine-grained scaling across multiple tasks in a workflow Given that each task has unique resource requirements, maintaining a fixed resource size may result in under- or over-provisioning for certain tasks within the same workflow. Traditionally, the size of the largest task determines the cluster size for a multi-task workflow. However, dynamically adjusting resources across multiple tasks and steps within a workflow result in a more cost-effective implementation.
  • Platform-level enhancements – Enhancements at preceding layers can only optimize a given job or a workflow. Platform improvement aims to attain efficiency at the company level. We can achieve this through various means, such as updating or upgrading the core infrastructure, introducing new frameworks, allocating appropriate resources for each job profile, balancing service usage, optimizing the use of Savings Plans and Spot Instances, or implementing other comprehensive changes to boost efficiency across all tasks and workflows.

Layers 1–3: Previous cost reductions

After we migrated from on premises to AWS Cloud, we primarily focused our cost-optimization efforts on the first three layers shown in the diagram. By transitioning our most costly legacy Pig and Hive pipelines to Spark and optimizing Spark configurations for Amazon EMR, we achieved significant cost savings.

For example, a legacy Pig job took 10 hours to complete and ranked among the top 10 most expensive EMR jobs. Upon reviewing TEZ logs and cluster metrics, we discovered that the cluster was vastly over-provisioned for the data volume being processed and remained under-utilized for most of the runtime. Transitioning from Pig to Spark was more efficient. Although no automated tools were available for the conversion, manual optimizations were made, including:

  • Reduced unnecessary disk writes, saving serialization and deserialization time (Layer 1)
  • Replaced Airflow task parallelization with Spark, simplifying the Airflow DAG (Layer 1)
  • Eliminated redundant Spark transformations (Layer 1)
  • Upgraded from Spark 2 to 3, using Adaptive Query Execution (Layer 2)
  • Addressed skewed joins and optimized smaller dimension tables (Layer 3)

As a result, job cost decreased by 95%, and job completion time was reduced to 1 hour. However, this approach was labor-intensive and not scalable for numerous jobs.

Layers 4–6: Find and adopt the right compute solution

In late 2022, following our significant accomplishments in optimization at the previous levels, our attention moved towards enhancing the remaining layers.

Understanding the state of our batch processing

We use Amazon MWAA to orchestrate our data workflows in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows. In this post, the terms workflow and job are used interchangeably, referring to the Directed Acyclic Graphs (DAGs) consisting of tasks orchestrated by Amazon MWAA. For each workflow, we have sequential or parallel tasks, and even a combination of both in the DAG between create_emr and terminate_emr tasks running on a transient EMR cluster with fixed compute capacity throughout the workflow run. Even after optimizing a portion of our workload, we still had numerous non-optimized workflows that were under-utilized due to over-provisioning of compute resources based on the most resource-intensive task in the workflow, as shown in the following figure.

This highlighted the impracticality of static resource allocation and led us to recognize the necessity of a dynamic resource allocation (DRA) system. Before proposing a solution, we gathered extensive data to thoroughly understand our batch processing. Analyzing the cluster step time, excluding provisioning and idle time, revealed significant insights: a right-skewed distribution with over half of the workflows completing in 20 minutes or less and only 10% taking more than 60 minutes. This distribution guided our choice of a fast-provisioning compute solution, dramatically reducing workflow runtimes. The following diagram illustrates step times (excluding provisioning and idle time) of EMR on EC2 transient clusters in one of our batch processing accounts.

Furthermore, based on the step time (excluding provisioning and idle time) distribution of the workflows, we categorized our workflows into three groups:

  • Quick run – Lasting 20 minutes or less
  • Medium run – Lasting between 20–60 minutes
  • Long run – Exceeding 60 minutes, often spanning several hours or more

Another factor we needed to consider was the extensive use of transient clusters for reasons such as security, job and cost isolation, and purpose-built clusters. Additionally, there was a significant variation in resource needs between peak hours and periods of low utilization.

Instead of fixed-size clusters, we could potentially use managed scaling on EMR on EC2 to achieve some cost benefits. However, migrating to EMR Serverless appears to be a more strategic direction for our data platform. In addition to potential cost benefits, EMR Serverless offers additional advantages such as a one-click upgrade to the newest Amazon EMR versions, a simplified operational and debugging experience, and automatic upgrades to the latest generations upon rollout. These features collectively simplify the process of operating a platform on a larger scale.

Evaluating EMR Serverless: A case study at GoDaddy

EMR Serverless is a serverless option in Amazon EMR that eliminates the complexities of configuring, managing, and scaling clusters when running big data frameworks like Apache Spark and Apache Hive. With EMR Serverless, businesses can enjoy numerous benefits, including cost-effectiveness, faster provisioning, simplified developer experience, and improved resilience to Availability Zone failures.

Recognizing the potential of EMR Serverless, we conducted an in-depth benchmark study using real production workflows. The study aimed to assess EMR Serverless performance and efficiency while also creating an adoption plan for large-scale implementation. The findings were highly encouraging, showing EMR Serverless can effectively handle our workloads.

Benchmarking methodology

We split our data workflows into three categories based on total step time (excluding provisioning and idle time): quick run (0–20 minutes), medium run (20–60 minutes), and long run (over 60 minutes). We analyzed the impact of the EMR deployment type (Amazon EC2 vs. EMR Serverless) on two key metrics: cost-efficiency and total runtime speedup, which served as our overall evaluation criteria. Although we did not formally measure ease of use and resiliency, these factors were considered throughout the evaluation process.

The high-level steps to assess the environment are as follows:

  1. Prepare the data and environment:
    1. Choose three to five random production jobs from each job category.
    2. Implement required adjustments to prevent interference with production.
  2. Run tests:
    1. Run scripts over several days or through multiple iterations to gather precise and consistent data points.
    2. Perform tests using EMR on EC2 and EMR Serverless.
  3. Validate data and test runs:
    1. Validate input and output datasets, partitions, and row counts to ensure identical data processing.
  4. Gather metrics and analyze results:
    1. Gather relevant metrics from the tests.
    2. Analyze results to draw insights and conclusions.

Benchmark results

Our benchmark results showed significant enhancements across all three job categories for both runtime speedup and cost-efficiency. The improvements were most pronounced for quick jobs, directly resulting from faster startup times. For instance, a 20-minute (including cluster provisioning and shut down) data workflow running on an EMR on EC2 transient cluster of fixed compute capacity finishes in 10 minutes on EMR Serverless, providing a shorter runtime with cost benefits. Overall, the shift to EMR Serverless delivered substantial performance improvements and cost reductions at scale across job brackets, as seen in the following figure.

Historically, we devoted more time to tuning our long-run workflows. Interestingly, we discovered that the existing custom Spark configurations for these jobs did not always translate well to EMR Serverless. In cases where the results were insignificant, a common approach was to discard previous Spark configurations related to executor cores. By allowing EMR Serverless to autonomously manage these Spark configurations, we often observed improved outcomes. The following graph shows the average runtime and cost improvement per job when comparing EMR Serverless to EMR on EC2.

Per Job Improvement

The following table shows a sample comparison of results for the same workflow running on different deployment options of Amazon EMR (EMR on EC2 and EMR Serverless).

Metric EMR on EC2
(Average)
EMR Serverless
(Average)
EMR on EC2 vs
EMR Serverless
Total Run Cost ($) $ 5.82 $ 2.60 55%
Total Run Time (Minutes) 53.40 39.40 26%
Provisioning Time (Minutes) 10.20 0.05 .
Provisioning Cost ($) $ 1.19 . .
Steps Time (Minutes) 38.20 39.16 -3%
Steps Cost ($) $ 4.30 . .
Idle Time (Minutes) 4.80 . .
EMR Release Label emr-6.9.0 .
Hadoop Distribution Amazon 3.3.3 .
Spark Version Spark 3.3.0 .
Hive/HCatalog Version Hive 3.1.3, HCatalog 3.1.3 .
Job Type Spark .

AWS Graviton2 on EMR Serverless performance evaluation

After seeing compelling results with EMR Serverless for our workloads, we decided to further analyze the performance of the AWS Graviton2 (arm64) architecture within EMR Serverless. AWS had benchmarked Spark workloads on Graviton2 EMR Serverless using the TPC-DS 3TB scale, showing a 27% overall price-performance improvement.

To better understand the integration benefits, we ran our own study using GoDaddy’s production workloads on a daily schedule and observed an impressive 23.8% price-performance enhancement across a range of jobs when using Graviton2. For more details about this study, see GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on Amazon EMR Serverless.

Adoption strategy for EMR Serverless

We strategically implemented a phased rollout of EMR Serverless via deployment rings, enabling systematic integration. This gradual approach let us validate improvements and halt further adoption of EMR Serverless, if needed. It served both as a safety net to catch issues early and a means to refine our infrastructure. The process mitigated change impact through smooth operations while building team expertise of our Data Engineering and DevOps teams. Additionally, it fostered tight feedback loops, allowing prompt adjustments and ensuring efficient EMR Serverless integration.

We divided our workflows into three main adoption groups, as shown in the following image:

  • Canaries This group aids in detecting and resolving any potential problems early in the deployment stage.
  • Early adopters This is the second batch of workflows that adopt the new compute solution after initial issues have been identified and rectified by the canaries group.
  • Broad deployment rings The largest group of rings, this group represents the wide-scale deployment of the solution. These are deployed after successful testing and implementation in the previous two groups.

Rings

We further broke down these workflows into granular deployment rings to adopt EMR Serverless, as shown in the following table.

Ring # Name Details
Ring 0 Canary Low adoption risk jobs that are expected to yield some cost saving benefits.
Ring 1 Early Adopters Low risk Quick-run Spark jobs that expect to yield high gains.
Ring 2 Quick-run Rest of the Quick-run (step_time <= 20 min) Spark jobs
Ring 3 LargerJobs_EZ High potential gain, easy move, medium-run and long-run Spark jobs
Ring 4 LargerJobs Rest of the medium-run and long-run Spark jobs with potential gains
Ring 5 Hive Hive jobs with potentially higher cost savings
Ring 6 Redshift_EZ Easy migration Redshift jobs that suit EMR Serverless
Ring 7 Glue_EZ Easy migration Glue jobs that suit EMR Serverless

Production adoption results summary

The encouraging benchmarking and canary adoption results generated considerable interest in wider EMR Serverless adoption at GoDaddy. To date, the EMR Serverless rollout remains underway. Thus far, it has reduced costs by 62.5% and accelerated total batch workflow completion by 50.4%.

Based on preliminary benchmarks, our team expected substantial gains for quick jobs. To our surprise, actual production deployments surpassed projections, averaging 64.4% faster vs. 42% projected, and 71.8% cheaper vs. 40% predicted.

Remarkably, long-running jobs also saw significant performance improvements due to the rapid provisioning of EMR Serverless and aggressive scaling enabled by dynamic resource allocation. We observed substantial parallelization during high-resource segments, resulting in a 40.5% faster total runtime compared to traditional approaches. The following chart illustrates the average enhancements per job category.

Prod Jobs Savings

Additionally, we observed the highest degree of dispersion for speed improvements within the long-run job category, as shown in the following box-and-whisker plot.

Whisker Plot

Sample workflows adopted EMR Serverless

For a large workflow migrated to EMR Serverless, comparing 3-week averages pre- and post-migration revealed impressive cost savings—a 75.30% decrease based on retail pricing with 10% improvement in total runtime, boosting operational efficiency. The following graph illustrates the cost trend.

Although quick-run jobs realized minimal per-dollar cost reductions, they delivered the most significant percentage cost savings. With thousands of these workflows running daily, the accumulated savings are substantial. The following graph shows the cost trend for a small workload migrated from EMR on EC2 to EMR Serverless. Comparing 3-week pre- and post-migration averages revealed a remarkable 92.43% cost savings on the retail on-demand pricing, alongside an 80.6% acceleration in total runtime.

Sample workflows adopted EMR Serverless 2

Layer 7: Platform-wide improvements

We aim to revolutionize compute operations at GoDaddy, providing simplified yet powerful solutions for all users with our Intelligent Compute Platform. With AWS compute solutions like EMR Serverless and EMR on EC2, it provided optimized runs of data processing and machine learning (ML) workloads. An ML-powered job broker intelligently determines when and how to run jobs based on various parameters, while still allowing power users to customize. Additionally, an ML-powered compute resource manager pre-provisions resources based on load and historical data, providing efficient, fast provisioning at optimum cost. Intelligent compute empowers users with out-of-the-box optimization, catering to diverse personas without compromising power users.

The following diagram shows a high-level illustration of the intelligent compute architecture.

Insights and recommended best-practices

The following section discusses the insights we’ve gathered and the recommended best practices we’ve developed during our preliminary and wider adoption stages.

Infrastructure preparation

Although EMR Serverless is a deployment method within EMR, it requires some infrastructure preparedness to optimize its potential. Consider the following requirements and practical guidance on implementation:

  • Use large subnets across multiple Availability Zones – When running EMR Serverless workloads within your VPC, make sure the subnets span across multiple Availability Zones and are not constrained by IP addresses. Refer to Configuring VPC access and Best practices for subnet planning for details.
  • Modify maximum concurrent vCPU quota For extensive compute requirements, it is recommended to increase your max concurrent vCPUs per account service quota.
  • Amazon MWAA version compatibility When adopting EMR Serverless, GoDaddy’s decentralized Amazon MWAA ecosystem for data pipeline orchestration created compatibility issues from disparate AWS Providers versions. Directly upgrading Amazon MWAA was more efficient than updating numerous DAGs. We facilitated adoption by upgrading Amazon MWAA instances ourselves, documenting issues, and sharing findings and effort estimates for accurate upgrade planning.
  • GoDaddy EMR operator To streamline migrating numerous Airflow DAGs from EMR on EC2 to EMR Serverless, we developed custom operators adapting existing interfaces. This allowed seamless transitions while retaining familiar tuning options. Data engineers could easily migrate pipelines with simple find-replace imports and immediately use EMR Serverless.

Unexpected behavior mitigation

The following are unexpected behaviors we ran into and what we did to mitigate them:

  • Spark DRA aggressive scaling For some jobs (8.33% of initial benchmarks, 13.6% of production), cost increased after migrating to EMR Serverless. This was due to Spark DRA excessively assigning new workers briefly, prioritizing performance over cost. To counteract this, we set maximum executor thresholds by adjusting spark.dynamicAllocation.maxExecutor, effectively limiting EMR Serverless scaling aggression. When migrating from EMR on EC2, we suggest observing the max core count in the Spark History UI to replicate similar compute limits in EMR Serverless, such as --conf spark.executor.cores and --conf spark.dynamicAllocation.maxExecutors.
  • Managing disk space for large-scale jobs When transitioning jobs that process large data volumes with substantial shuffles and significant disk requirements to EMR Serverless, we recommend configuring spark.emr-serverless.executor.disk by referring to existing Spark job metrics. Furthermore, configurations like spark.executor.cores combined with spark.emr-serverless.executor.disk and spark.dynamicAllocation.maxExecutors allow control over the underlying worker size and total attached storage when advantageous. For example, a shuffle-heavy job with relatively low disk usage may benefit from using a larger worker to increase the likelihood of local shuffle fetches.

Conclusion

As discussed in this post, our experiences with adopting EMR Serverless on arm64 have been overwhelmingly positive. The impressive results we’ve achieved, including a 60% reduction in cost, 50% faster runs of batch Spark workloads, and an astounding five-times improvement in development and testing speed, speak volumes about the potential of this technology. Furthermore, our current results suggest that by widely adopting Graviton2 on EMR Serverless, we could potentially reduce the carbon footprint by up to 60% for our batch processing.

However, it’s crucial to understand that these results are not a one-size-fits-all scenario. The enhancements you can expect are subject to factors including, but not limited to, the specific nature of your workflows, cluster configurations, resource utilization levels, and fluctuations in computational capacity. Therefore, we strongly advocate for a data-driven, ring-based deployment strategy when considering the integration of EMR Serverless, which can help optimize its benefits to the fullest.

Special thanks to Mukul Sharma and Boris Berlin for their contributions to benchmarking. Many thanks to Travis Muhlestein (CDO), Abhijit Kundu (VP Eng), Vincent Yung (Sr. Director Eng.), and Wai Kin Lau (Sr. Director Data Eng.) for their continued support.


About the Authors

Brandon Abear is a Principal Data Engineer in the Data & Analytics (DnA) organization at GoDaddy. He enjoys all things big data. In his spare time, he enjoys traveling, watching movies, and playing rhythm games.

Dinesh Sharma is a Principal Data Engineer in the Data & Analytics (DnA) organization at GoDaddy. He is passionate about user experience and developer productivity, always looking for ways to optimize engineering processes and saving cost. In his spare time, he loves reading and is an avid manga fan.

John Bush is a Principal Software Engineer in the Data & Analytics (DnA) organization at GoDaddy. He is passionate about making it easier for organizations to manage data and use it to drive their businesses forward. In his spare time, he loves hiking, camping, and riding his ebike.

Ozcan Ilikhan is the Director of Engineering for the Data and ML Platform at GoDaddy. He has over two decades of multidisciplinary leadership experience, spanning startups to global enterprises. He has a passion for leveraging data and AI in creating solutions that delight customers, empower them to achieve more, and boost operational efficiency. Outside of his professional life, he enjoys reading, hiking, gardening, volunteering, and embarking on DIY projects.

Harsh Vardhan is an AWS Solutions Architect, specializing in big data and analytics. He has over 8 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Define per-team resource limits for big data workloads using Amazon EMR Serverless

Post Syndicated from Gaurav Sharma original https://aws.amazon.com/blogs/big-data/define-per-team-resource-limits-for-big-data-workloads-using-amazon-emr-serverless/

Customers face a challenge when distributing cloud resources between different teams running workloads such as development, testing, or production. The resource distribution challenge also occurs when you have different line-of-business users. The objective is not only to ensure sufficient resources be consistently available to production workloads and critical teams, but also to prevent adhoc jobs from using all the resources and delaying other critical workloads due to mis-configured or non-optimized code. Cost controls and usage tracking across these teams is also a critical factor.

In the legacy big data and Hadoop clusters as well as Amazon EMR provisioned clusters, this problem was overcome by Yarn resource management and defining what were called Yarn queues for different workloads or teams. Another approach was to allocate independent clusters for different teams or different workloads.

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it straightforward to run your big data workloads using open-source analytics frameworks such as Apache Spark and Hive without the need to configure, manage, or scale the clusters. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run your workloads. You continue to get the benefits of Amazon EMR, such as open-source compatibility, concurrency, and optimized runtime performance for popular bigdata frameworks. EMR Serverless provides shorter job startup latency, automatic resource management and effective cost controls.

In this post, we show how to define per-team resource limits for big data workloads using EMR serverless.

Solution overview

EMR Serverless comes with a concept called an  EMR Serverless application, which is an isolated environment with the option to choose one of the open source analytics applications(Spark, Hive) to submit your workloads. You can include your own custom libraries, specify your EMR release version, and most importantly define the resource limits for the compute and memory resources. For instance, if your production Spark jobs run on Amazon EMR 6.9.0 and you need to test the same workload on Amazon EMR 6.10.0, you could use EMR Serverless to define EMR 6.10.0 as your version and test your workload using a predefined limit on resources.

The following diagram illustrates our solution architecture. We see that two different teams namely Prod team and Dev team are submitting their jobs independently to two different EMR Applications (namely ProdApp and DevApp respectively ) having dedicated resources.

EMR Serverless provides controls at the account, application and job level to limit the use of resources such as CPU, memory or disk. In the following sections, we discuss some of these controls.

Service quotas at account level

Amazon EMR Serverless has a default quota of 16 for maximum concurrent vCPUs per account. In other words, a new account can have a maximum of 16 vCPUs running at a given point in time in a particular Region across all EMR Serverless applications. However, this quota is auto-adjustable based on the usage patterns, which are monitored at the account and Region levels.

Resource limits and runtime configurations at the application level

In addition to quotas at the account levels, administrators can limit the use of resources at the application level using a feature known as “maximum capacity” which defines the maximum total vCPU, memory and disk capacity that can be consumed collectively by all the jobs running under this application.

You also have an option to specify common runtime and monitoring configurations at the application level which you would otherwise put in the specific job configurations. This helps create a standardized runtime environment for all the jobs running under an application. This can include settings like defining common connection setting your jobs need access to, log configurations that all your jobs will inherit by default, or Spark resource settings to help balance ad-hoc workloads. You can override these configurations at the job level, but defining them at the application can help reduce the configuration necessary for individual jobs.

For further details, refer to Declaring configurations at application level.

Runtime configurations at Job level

After you have set service, application quotas and runtime configurations at application level, you also have an option to override or add new configurations at the job level as well. For example, you can use different Spark job parameters to define how many maximum executors can be run by that specific job. One such parameter is spark.dynamicAllocation.maxExecutors which defines an upper bound for the number of executors in a job and therefore controls the number of workers in an EMR Serverless application because each executor runs within a single worker. This parameter is part of the dynamic allocation feature of Apache Spark, which allows you to dynamically scale the number of executors(workers) registered with the job up and down based on the workload. Dynamic allocation is enabled by default on EMR Serverless. For detailed steps, refer to Declaring configurations at application level.

With these configurations, you can control the resources used across accounts, applications, and jobs. For example, you can create applications with a predefined maximum capacity to constrain costs or configure jobs with resource limits in order to allow multiple ad hoc jobs to run simultaneously without consuming too many resources.

Best practices and considerations

Extending these usage scenarios further, EMR Serverless provides features and capabilities to implement the following design considerations and best practices based on your workload requirements:

  • To make sure that the users or teams submit their jobs only to their approved applications, you could use tag based AWS Identity and Access Management (IAM) policy conditions. For more details, refer to Using tags for access control.
  • You can use custom images as applications belonging to different teams that have distinct use-cases and software requirements. Using custom images is possible EMR 6.9.0 and onwards. Custom images allows you to package various application dependencies into a single container. Some of the important benefits of using custom images include the ability to use your own JDK and Python versions, apply your organization-specific security policies and integrate EMR Serverless into your build, test and deploy pipelines. For more information, refer to Customizing an EMR Serverless image.
  • If you need to estimate how much a Spark job would cost when run on EMR Serverless, you can use the open-source tool EMR Serverless Estimator. This tool analyzes Spark event logs to provide you with the cost estimate. For more details, refer to Amazon EMR Serverless cost estimator
  • We recommend that you determine your maximum capacity relative to the supported worker sizes by multiplying the number of workers by their size. For example, if you want to limit your application with 50 workers to 2 vCPUs, 16 GB of memory and 20 GB of disk, set the maximum capacity to 100 vCPU, 800 GB of memory, and 1000 GB of disk.
  • You can use tags when you create the EMR Serverless application to help search and filter your resources, or track the AWS costs using AWS Cost Explorer. You can also use tags for controlling who can submit jobs to a particular application or modify its configurations. Refer to Tagging your resources for more details.
  • You can configure the pre-initialized capacity at the time of application creation, which keeps the resources ready to be consumed by the time-sensitive jobs you submit.
  • The number of concurrent jobs you can run depends on important factors like maximum capacity limits, workers required for each job, and available IP address if using a VPC.
  • EMR Serverless will setup elastic network interfaces (ENIs) to securely communicate with resources in your VPC. Make sure you have enough IP addresses in your subnet for the job.
  • It’s a best practice to select multiple subnets from multiple Availability Zones. This is because the subnets you select determine the Availability Zones that are available to run the EMR Serverless application. Each worker uses an IP address in the subnet where it is launched. Make sure the configured subnets have enough IP addresses for the number of workers you plan to run.

Resource usage tracking

EMR Serverless not only allows cloud administrators to limit the resources for each application, it also enables them to monitor the applications and track the usage of resources across these applications. For more details, refer to  EMR Serverless usage metrics .

You can also deploy an AWS CloudFormation template to build a sample CloudWatch Dashboard for EMR Serverless which would help visualize various metrics for your applications and jobs. For more information, refer to EMR Serverless CloudWatch Dashboard.

Conclusion

In this post, we discussed how EMR Serverless empowers cloud and data platform administrators to efficiently distribute as well as restrict the cloud resources at different levels, for different organizational units, users and teams, as well as between critical and non-critical workloads. EMR Serverless resource limiting features make sure cloud cost is under control and resource usage is tracked effectively.

For more information on EMR Serverless applications and resource quotas, please refer to EMR Serverless User Guide and Configuring an application.


About the Authors

Gaurav Sharma is a Specialist Solutions Architect(Analytics) at Amazon Web Services (AWS), supporting US public sector customers on their cloud journey. Outside of work, Gaurav enjoys spending time with his family and reading books.

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services. He builds tools and content to help make the lives of data engineers easier. When not hard at work, he still builds data pipelines and splits logs in his spare time.

How CyberSolutions built a scalable data pipeline using Amazon EMR Serverless and the AWS Data Lab

Post Syndicated from Constantin Scoarță original https://aws.amazon.com/blogs/big-data/how-cybersolutions-built-a-scalable-data-pipeline-using-amazon-emr-serverless-and-the-aws-data-lab/

This post is co-written by Constantin Scoarță and Horațiu Măiereanu from CyberSolutions Tech.

CyberSolutions is one of the leading ecommerce enablers in Germany. We design, implement, maintain, and optimize award-winning ecommerce platforms end to end. Our solutions are based on best-in-class software like SAP Hybris and Adobe Experience Manager, and complemented by unique services that help automate the pricing and sourcing processes.

We have built data pipelines to process, aggregate, and clean our data for our forecasting service. With the growing interest in our services, we wanted to scale our batch-based data pipeline to process more historical data on a daily basis and yet remain performant, cost-efficient, and predictable. To meet our requirements, we have been exploring the use of Amazon EMR Serverless as a potential solution.

To accelerate our initiative, we worked with the AWS Data Lab team. They offer joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics initiatives. We chose to work through a Build Lab, which is a 2–5-day intensive build with a technical customer team.

In this post, we share how we engaged with the AWS Data Lab program to build a scalable and performant data pipeline using EMR Serverless.

Use case

Our forecasting and recommendation algorithm is fed with historical data, which needs to be curated, cleaned, and aggregated. Our solution was based on AWS Glue workflows orchestrating a set of AWS Glue jobs, which worked fine for our requirements. However, as our use case developed, it required more computations and bigger datasets, resulting into unpredictable performance and cost.

This pipeline performs daily extracts from our data warehouse and a few other systems, curates the data, and does some aggregations (such as daily average). Those will be consumed by our internal tools and generate recommendations accordingly. Prior to the engagement, the pipeline was processing 28 days’ worth of historical data in approximately 70 minutes. We wanted to extend that to 100 days and 365 days of data without having to extend the extraction window or factor in the resources configured.

Solution overview

While working with the Data Lab team, we decided to structure our efforts into two approaches. As a short-term improvement, we were looking into optimizing the existing pipeline based on AWS Glue extract, transform, and load (ETL) jobs, orchestrated via AWS Glue workflows. However, for the mid-term to long-term, we looked at EMR Serverless to run our forecasting data pipeline.

EMR Serverless is an option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, we could run applications built using open-source frameworks such as Apache Spark (as in our case) without having to configure, manage, optimize, or secure clusters. The following factors influenced our decision to use EMR Serverless:

  • Our pipeline had minimal dependency on the AWS Glue context and its features, instead running native Apache Spark
  • EMR Serverless offers configurable drivers and workers
  • With EMR Serverless, we were able to take advantage of its cost tracking feature for applications
  • The need for managing our own Spark History Server was eliminated because EMR Serverless automatically creates a monitoring Spark UI for each job

Therefore, we planned the lab activities to be categorized as follows:

  • Improve the existing code to be more performant and scalable
  • Create an EMR Serverless application and adapt the pipeline
  • Run the entire pipeline with different date intervals

The following solution architecture depicts the high-level components we worked with during the Build Lab.

In the following sections, we dive into the lab implementation in more detail.

Improve the existing code

After examining our code decisions, we identified a step in our pipeline that consumed the most time and resources, and we decided to focus on improving it. Our target job for this optimization was the “Create Moving Average” job, which involves computing various aggregations such as averages, medians, and sums on a moving window. Initially, this step took around 4.7 minutes to process an interval of 28 days. However, running the job for larger datasets proved to be challenging – it didn’t scale well and even resulted in errors in some cases.

While reviewing our code, we focused on several areas, including checking data frames at certain steps to ensure that they contained content before proceeding. Initially, we used the count() API to achieve this, but we discovered that head() was a better alternative because it returns the first n rows only and is faster than count() for large input data. With this change, we were able to save around 15 seconds when processing 28 days’ worth of data. Additionally, we optimized our output writing by using coalesce() instead of repartition().

These changes managed to shave off some time, down to 4 minutes per run. However, we could achieve a better performance by using cache() on data frames before performing the aggregations, which materializes the data frame upon the following transformation. Additionally, we used unpersist() to free up executors’ memory after we were done with the mentioned aggregations. This led to a runtime of approximately 3.5 minutes for this job.

Following the successful code improvements, we managed to extend the data input to 100 days, 1 year, and 3 years. For this specific job, the coalesce() function wasn’t avoiding the shuffle operation and caused uneven data distribution per executor, so we switched back to repartition() for this job. By the end, we managed to get successful runs in 4.7, 12, and 57 minutes, using the same number of workers in AWS Glue (10 standard workers).

Adapt code to EMR Serverless

To observe if running the same job in EMR Serverless would yield better results, we configured an application that uses a comparable number of executors as in AWS Glue jobs. In the job configurations, we used 2 cores and 6 GB of memory for the driver and 20 executors with 4 cores and 16 GB of memory. However, we didn’t use additional ephemeral storage (by default, workers come with free 20 GB).

By the time we had the Build Lab, AWS Glue supported Apache Spark 3.1.1; however, we opted to use Spark 3.2.0 (Amazon EMR version 6.6.0) instead. Additionally, during the Build Lab, only x86_64 EMR Serverless applications were available, although it now also supports arm64-based architecture.

We adapted the code utilizing AWS Glue context to work with native Apache Spark. For instance, we needed to overwrite existing partitions and sync updates with the AWS Glue Data Catalog, especially when old partitions were replaced and new ones were added. We achieved this by setting spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC") and using an MSCK REPAIR query to sync the relevant table. Similarly, we replaced the read and write operations to rely on Apache Spark APIs.

During the tests, we intentionally disabled the fine-grained auto scaling feature of EMR Serverless while running jobs, in order to observe how the code would perform with the same number of workers but different date intervals. We achieved that by setting spark.dynamicAllocation.enabled to disabled (the default is true).

For the same code, number of workers, and data inputs, we managed to get better performance results with EMR Serverless, which were 2.5, 2.9, 6, and 16 minutes for 28 days, 100 days, 1 year, and 3 years, respectively.

Run the entire pipeline with different date intervals

Because the code for our jobs was implemented in a modular fashion, we were able to quickly test all of them with EMR Serverless and then link them together to orchestrate the pipeline via Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Regarding performance, our previous pipeline using AWS Glue took around 70 minutes to run with our regular workload. However, our new pipeline, powered by Amazon MWAA-backed EMR Serverless, achieved similar results in approximately 60 minutes. Although this is a notable improvement, the most significant benefit was our ability to scale up to process larger amounts of data using the same number of workers. For instance, processing 1 year’s worth of data only took around 107 minutes to complete.

Conclusion and key takeaways

In this post, we outlined the approach taken by the CyberSolutions team in conjunction with the AWS Data Lab to create a high-performing and scalable demand forecasting pipeline. By using optimized Apache Spark jobs on customizable EMR Serverless workers, we were able to surpass the performance of our previous workflow. Specifically, the new setup resulted in 50–72% better performance for most jobs when processing 100 days of data, resulting in an overall cost savings of around 38%.

EMR Serverless applications’ features helped us have better control over cost. For example, we configured the pre-initialized capacity, which resulted in job start times of 1–4 seconds. And we set up the application behavior to start with the first submitted job and automatically stop after a configurable idle time.

As a next step, we are actively testing AWS Graviton2-based EMR applications, which come with more performance gains and lower cost.


About the Authors

 Constantin Scoarță is a Software Engineer at CyberSolutions Tech. He is mainly focused on building data cleaning and forecasting pipelines. In his spare time, he enjoys hiking, cycling, and skiing.

Horațiu Măiereanu is the Head of Python Development at CyberSolutions Tech. His team builds smart microservices for ecommerce retailers to help them improve and automate their workloads. In his free time, he likes hiking and traveling with his family and friends.

Ahmed Ewis is a Solutions Architect at the AWS Data Lab. He helps AWS customers design and build scalable data platforms using AWS database and analytics services. Outside of work, Ahmed enjoys playing with his child and cooking.

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Post Syndicated from Sankar Sundaram original https://aws.amazon.com/blogs/big-data/build-incremental-data-pipelines-to-load-transactional-data-changes-using-aws-dms-delta-2-0-and-amazon-emr-serverless/

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. A solution to this problem is to use AWS Database Migration Service (AWS DMS) for migrating historical and real-time transactional data into the data lake. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that the applications use. EMR Serverless also provides you with more flexibility on overriding default Spark configurations, customizing EMR Serverless images, and customizing Spark driver and executor sizes to better suit specific workloads.

This post demonstrates how to implement a solution that uses AWS DMS to stream ongoing replication or change data capture (CDC) from an Amazon Aurora PostgreSQL-Compatible Edition database into Amazon Simple Storage Service (Amazon S3). We then apply transformations using Spark jobs on an EMR Serverless application and write transformed output into open-source Delta tables in Amazon S3. The Delta tables created by the EMR Serverless application are exposed through the AWS Glue Data Catalog and can be queried through Amazon Athena. Although this post uses an Aurora PostgreSQL database hosted on AWS as the data source, the solution can be extended to ingest data from any of the AWS DMS supported databases hosted on your data centers.

Solution overview

The following diagram shows the overall architecture of the solution that we implement in this post.

Architecture diagram

The solution consists of the following steps for implementing a full and incremental (CDC) data ingestion from a relational database:

  • Data storage and data generation – We create an Aurora PostgreSQL database and generate fictional trip data by running a stored procedure. The data will have attributes like trip ID (primary key), timestamp, source location, and destination location. Incremental data is generated in the PostgreSQL table by running custom SQL scripts.
  • Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Let’s refer to this S3 bucket as the raw layer.
  • Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9 with Apache Spark version 3.3.0) created using Amazon EMR Studio. The script reads input data from the S3 raw bucket, and then invokes Delta Lake’s MERGE statements to merge the data with the target S3 bucket (curated layer). The script also creates and updates a manifest file on Amazon S3 every time the job is run to enable data access from Athena and Amazon Redshift Spectrum.
  • Data access – The EMR Serverless job has code snippets that create a Delta table in the AWS Glue Data Catalog in Step 5. Steps 6 and 7 describe using Athena and Redshift Spectrum to query data from the Delta tables using standard SQL through the AWS Glue Data Catalog.
  • Data pipeline – Step 8 describes the process for triggering the data pipeline in a periodic manner through Airflow operators using Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Refer to Submitting EMR Serverless jobs from Airflow for additional details. In this post, AWS DMS has been configured to replicate data from Amazon Aurora PostgreSQL-Compatible Edition into an S3 bucket with hourly partitions. The Airflow DAG can be configured to call an EMR Serverless job to process the past X hours of data based on specific project requirements. Implementation of the Airflow setup is not explored within the scope of this post.

The architecture has the following major features:

  • Reliability – The end-to-end architecture is made resilient with the Multi-AZ feature of EMR Serverless and using Multi-AZ deployments for AWS DMS and Amazon Aurora PostgreSQL-Compatible Edition. When you submit jobs to an EMR Serverless application, those jobs are automatically distributed to different Availability Zones in the Region. A job is run in a single Availability Zone to avoid performance implications of network traffic across Availability Zones. In case an Availability Zone is impaired, a job submitted to your EMR Serverless application is automatically run in a different (healthy) Availability Zone. When using resources in a private VPC, EMR Serverless recommends that you specify the private VPC configuration for multiple Availability Zones so that EMR Serverless can automatically select a healthy Availability Zone.
  • Cost optimization – When you run Spark or Hive applications using EMR Serverless, you pay for the amount of vCPU, memory, and storage resources consumed by your applications, leading to optimal utilization of resources. There is no separate charge for Amazon Elastic Compute Cloud (Amazon EC2) instances or Amazon Elastic Block Store (Amazon EBS) volumes. For additional details on cost, refer to Amazon EMR Serverless cost estimator.
  • Performance efficiency – You can run analytics workloads at any scale with automatic on-demand scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark and Hive. The Amazon EMR runtime for Spark is 100% API-compatible with OSS Spark and is over 3.5 times as fast as the standard open-source, so your jobs run faster and incur less compute costs. With fast and fine-grained scaling in EMR Serverless, if a pipeline runs daily and needs to process 1 GB of data one day and 100 GB of data another day, EMR Serverless automatically scales to handle that load.
  • Monitoring – EMR Serverless sends metrics to Amazon CloudWatch at the application and job level every 1 minute. You can set up a single-view dashboard in CloudWatch to visualize application-level and job-level metrics using an AWS CloudFormation template provided on the EMR Serverless CloudWatch Dashboard GitHub repository. Also, EMR Serverless can store application logs in a managed storage, Amazon S3, or both based on your configuration settings. After you submit a job to an EMR Serverless application, you can view the real-time Spark UI or the Hive Tez UI for the running job from the EMR Studio console or request a secure URL using the GetDashboardForJobRun API. For completed jobs, you can view the Spark History Server or the Persistent Hive Tez UI from the EMR Studio console.

The following steps are performed to implement this solution:

  1. Connect to the Aurora PostgreSQL instance and generate a sample dataset.
    • Set up a data pipeline for loading data from Amazon Aurora PostgreSQL-Compatible Edition into Delta Lake on Amazon S3 and query using Athena:
    • Start the AWS DMS task to perform full table load and capture ongoing replication to the S3 raw layer.
    • Run the EMR Serverless Spark application to load data into Delta Lake.
    • Query the Delta tables (native tables) through Athena.
  2. Run the data pipeline to capture incremental data changes into Delta Lake:
    • Generate an incremental (CDC) dataset and insert it into the Aurora PostgreSQL database.
    • Run the EMR Serverless Spark application to merge CDC data in the S3 curated layer (incremental load).
    • Query the Delta Lake tables through Athena to validate the merged data.

Prerequisites

We use a CloudFormation template to provision the AWS resources required for the solution. The CloudFormation template requires you to select an EC2 key pair. This key is configured on an EC2 instance that lives in the public subnet. We use this EC2 instance to connect to the Aurora PostgreSQL instance that lives in the private subnet. Make sure you have a key in the Region where you deploy the template. If you don’t have one, you can create a new key pair.

To walk through this post, we use Delta Lake version > 2.0.0, which is supported in Apache Spark 3.2.x. Choose the Delta Lake version compatible with your Spark version by visiting the Delta Lake releases page. We use an EMR Serverless application with version emr-6.9.0, which supports Spark version 3.3.0.

Deploy your resources

To provision the resources needed for the solution, complete the following steps:

  1. Choose Launch Stack:

  1. For Stack name, enter emr-serverless-deltalake-blog.
  2. For DatabaseUserName, enter the user name for logging in to Amazon Aurora PostgreSQL-Compatible Edition. Keep the default value if you don’t want to change it.
  3. For DatabasePassword, enter the password for logging in to Amazon Aurora PostgreSQL-Compatible Edition.
  4. For ClientIPCIDR, enter the IP address of your SQL client that will be used to connect to the EC2 instance. We use this EC2 instance to connect to the Aurora PostgreSQL database.
  5. For KeyName, enter the key pair to be used in your EC2 instance. This EC2 instance will be used as a proxy to connect from your SQL client to the Aurora PostgreSQL source database.
  6. For EC2ImageId, PrivateSubnet1CIDR, PrivateSubnet2CIDR, PublicSubnetCIDR, and VpcCIDR, keep the default values or choose appropriate values for the VPC and EC2 image for your specific environment.
  7. Choose Next.
  8. Choose Next again.
  9. On the review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  10. Choose Create stack.

After the CloudFormation template is complete and the resources are created, the Outputs tab shows the information shown in the following screenshot.

The CloudFormation template creates all the resources needed for the solution workflow:

  • S3 raw and curated buckets
  • Aurora PostgreSQL database
  • AWS DMS migration task, replication instance, and other resources
  • EC2 instance for running data ingestion scripts
  • AWS Identity and Access Management (IAM) roles and policies needed to perform the necessary activities as part of this solution
  • VPC, subnets, security groups, and relevant network components
  • AWS Lambda functions that perform setup activities required for this workflow
  • Additional components needed for running the EMR Serverless workflow

You can find the PySpark script in the raw S3 bucket on the Amazon S3 console as shown in the following screenshot. The bucket will have the naming structure <CloudFormation template name>-rawS3bucket-<random string>. Make a note of the S3 path to the emr_delta_cdc.py script; you need this information while submitting the Spark job via the EMR Serverless application.

The preceding task for creating the resources via CloudFormation assumes that AWS Lake Formation is not enabled in the Region (which we enable later in this post). If you already have Lake Formation enabled in the Region, make sure the IAM user or role used in the CloudFormation template has the necessary permissions to create a database in the AWS Glue Data Catalog.

Connect to the Aurora PostgreSQL instance and generate a sample dataset

Connect to the Aurora PostgreSQL endpoint using your preferred client. For this post, we use the PSQL command line tool. Note that the IP address of the client machine from which you’re connecting to the database must be updated in the Aurora PostgreSQL security group. This is done by the CloudFormation template based on the input parameter value for ClientIPCIDR. If you’re accessing the database from another machine, update the security group accordingly.

  1. Connect to your EC2 instance from the command line using the public DNS of the EC2 instance from the CloudFormation template output.
  2. Log in to the EC2 instance and connect to the Aurora PostgreSQL instance using the following commands (the Aurora PostgreSQL endpoint is available on the Outputs tab of the CloudFormation stack):
    psql -h << Aurora PostgreSQL endpoint >> -p 5432 -U <<username>> -d emrdelta_source_db

  1. Run the following commands to create a schema and table for the fictional trip dataset:
    create schema delta_emr_source;
    
    create table delta_emr_source.travel_details (trip_id int PRIMARY KEY,tstamp timestamp, route_id varchar(2),destination varchar(50),source_location varchar(50));

  1. Create the following stored procedure to generate the records for the trip dataset and insert the records into the table.
    create or replace procedure delta_emr_source.insert_records(records int)
    language plpgsql
    as $$
    declare
    max_trip_id integer;
    begin
    --get max trip_id
    select coalesce(max(trip_id),1) into max_trip_id from delta_emr_source.travel_details;
    
    --insert records
    for i in max_trip_id+1..max_trip_id+records loop
    INSERT INTO delta_emr_source.travel_details (trip_id, tstamp, route_id,destination,source_location) values (i, current_timestamp, chr(65 + (i % 10)),(array['Seattle', 'New York', 'New Jersey', 'Los Angeles', 'Las Vegas',
    'Tucson', 'Washington DC', 'Philadelphia', 'Miami', 'San Francisco'])[(floor(random() * 10))+1],(array['Seattle', 'New York', 'New Jersey', 'Los Angeles', 'Las Vegas',
    'Tucson', 'Washington DC', 'Philadelphia', 'Miami', 'San Francisco'])[(floor(random() * 10))+1]);
    end loop;
    
    commit;
    
    raise notice 'Inserted record count - %', records;
    end; $$;

  2. Call the preceding stored procedure to insert 20,000 records into the Aurora PostgreSQL database:
    call delta_emr_source.insert_records(20000);

  3. After the stored procedure is complete, verify that the records have been inserted successfully:
    select count(*) from delta_emr_source.travel_details;
    

Set up a data pipeline for loading data into Delta tables on Amazon S3 and query using Athena

In this section, we walk through the steps to set up a data pipeline that loads data from Amazon Aurora PostgreSQL-Compatible Edition into Delta tables on Amazon S3 and then query the data using Athena.

Start the AWS DMS task to perform full table load to the S3 raw layer

To perform the full table load, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task that was created by the CloudFormation template (emrdelta-postgres-s3-migration).
  3. On the Actions menu, choose Restart/Resume.

The task starts the full load and ongoing replication of data from the source database to Amazon S3.

  1. Wait for the job to complete.

You can validate that the data has been migrated successfully checking the Load state column for the AWS DMS task.

  1. Navigate to the S3 bucket created from the CloudFormation template to store raw data from AWS DMS.The bucket will have the naming structure <CloudFormation template name>-rawS3bucket-<random string>.
  2. Navigate to the folder delta_emr_source/travel_details in the raw S3 bucket. You can verify the S3 folder has Parquet data populated from the AWS DMS task.

Run the EMR Serverless Spark application to load data into Delta tables

We use EMR Studio to manage and submit jobs in an EMR Serverless application.

  1. Launch EMR Studio and create an EMR Serverless application.
  2. For Name, enter emr-delta-blog.
  3. For Type, choose Spark.
  4. For Release version, choose your release version.
  5. For Architecture, select x86_64.
  6. For Application setup options, select Choose default settings.

  1. Choose Create application and verify that the EMR application has been created successfully on the Amazon EMR console.

  1. Choose emr_delta_blog and then choose Start application. You can verify that the EMR application has started successfully on the Amazon EMR console, as shown in the following screenshot.


The application will move to Stopped status after a period of inactivity. When you submit the job to the application, it will start again and start the job. This provides cost savings because the jobs are run on demand as opposed to maintaining a running EMR cluster.

  1. While the application is in Started status, choose Submit job to submit the job to the application.

Create a new job in the Job details page

  1. For Name, enter emr-delta-load-job.
  2. For Runtime role, choose emrserverless-execution-role.
  3. For S3 URI, enter the S3 (raw bucket) path where the script emr_delta_cdc.py is uploaded.
  4. For Script arguments, enter ["I","delta_emr_source","9999-12-31-01","travel_details","route_id"].

The script arguments provide the following details to the EMR Serverless application:

  • I – The first argument represents the data load type. The allowed values are I for full load and U for incremental data load.
  • delta_emr_source – The second argument represents the source database schema from which data is being migrated through the AWS DMS task.
  • 9999-12-31-01 – The third argument represents the partition from which data needs to be loaded in an incremental fashion. This argument is used only during CDC data load; for full load, we have provided a default value (9999-12-31-01).
  • travel_details – The fourth argument represents the source database table from which data is being migrated through the AWS DMS task. Use a semicolon as a delimiter when entering multiple tables.
  • route_id – The fifth argument represents the partition keys on which the table data should be partitioned when stored in the S3 curated bucket. Use a semicolon as a delimiter when entering comma-separated partition keys for multiple tables.

With arguments, you can group a set of tables and submit the job to an EMR Serverless application. You can provide multiple table names separated by semicolons and enter the partition keys for those tables also separated by semicolon. If a particular table doesn’t have a partition key, simply enter a semicolon alone. The number of semicolon-separated values should match the table and partition key arguments for the script to run successfully.

Also, if you want to capture additional tables as part of an existing EMR Serverless job, you need to create a new EMR Serverless job to capture full load separately (set the first argument as I along with the new table names) and then change the argument list of the existing EMR Serverless job to add those new tables to capture incremental data load going forward.

EMR Serverless version 6.9.0 comes pre-installed with Delta version 2.1.0. Refer to About Amazon EMR Releases for more details about pre-installed libraries and applications for a specific Amazon EMR release. Before this, we have to upload the Delta JAR files to an S3 bucket in your account and provide the JAR file path in the application configurations using the spark.jars option. In this walkthrough, we create an EMR Serverless 6.9.0 application and use the pre-installed Delta jars from Amazon EMR.

  1. Under Spark properties, choose Edit in text and enter the following configurations:
--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.submit.pyFiles=/usr/share/aws/delta/lib/delta-core.jar --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

If you want to use a different version of Delta JAR files, you can replace the S3 path of the JAR files in these configuration options.

  1. Leave the rest of the configurations at their default and choose Submit job.
  2. Wait for the job to complete successfully. You can verify this on the EMR Serverless console.

  1. Additionally, go to the S3 location (the curated bucket created by AWS CloudFormation) and verify that the Delta files are created along with the manifest file.

  1. Select a job run and then choose Spark History Server (Completed jobs) on the View Application UIs menu.


You can now use the Spark History Server UI to navigate to various tabs and analyze the job run in a detailed manner. For Spark error and output logs, you can navigate to the Executors tab and explore the driver or executor logs as required. This can help you to debug the job in case of failures by looking at the Spark logs.You can also choose Spark UI (Running jobs) to track the progress of the EMR Serverless Spark jobs while they are running.

The data load script is the same for initial and incremental data load because it can handle both the workflows through script arguments:

from pyspark.sql import SparkSession
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import boto3
import sys
from delta import *

# S3 bucket location, auto-populated for this post. Replace for other jobs
raw_bucket="<<raw_bucket_name>>"
curated_bucket= "<<curated_bucket_name>>"

spark = (
SparkSession.builder.appName("SparkSQL")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.enableHiveSupport()
.getOrCreate()
)

#Check for argument list and if it doesn't match the expected argument count, exit the program
if len(sys.argv) != 6:
print("This script requires 5 arguments for successful execution - Load_type,database_schema,CDC_path,source_table,Partition_keys")
print(sys.argv)
sys.exit(0)

s3 = boto3.client('s3')

# Split table names into a list if there are more than one table seperated by semicolon
tables = sys.argv[4].split(";")

schema = sys.argv[2]
load_type = sys.argv[1]
cdc_partition = sys.argv[3]
deltaHivePath = "s3://" + curated_bucket + "/" + schema + "/"
columns_to_drop = ["Op","schema_name", "table_name", "update_ts_dms", "tstamp"]
db_name = "emrserverless_delta"

# Split table partition keys into a list if there are more than one table separated by semicolon
partition_keys = sys.argv[5].split(";")
# Exit if length of table names and partition keys are different to ensure data is provided for all tables.
if len(tables)!=len(partition_keys):
print("Please enter partition keys for all tables. if partition key is not present enter empty semicolon - T1_PK;;T3PK")
sys.exit(0)


i = 0
while i < len(tables):
table = tables[i]
partition_key = partition_keys[i].split(",")
print(partition_key)
if load_type == 'I':
print("Moving to Full-load logic for the table", table)

# Read the data from the raw bucket
source_df1 = spark.read.format("parquet").load(
"s3://" + raw_bucket + "/" + schema + "/" + table + "/")

# There is no target table in Delta format. Loading for the first time
# The following code segment populates Delta table in S3 and also
# updated the Glue catalog for querying with Athena.
additional_options = {"path": deltaHivePath + table + "/"}
if columns_to_drop is not None and columns_to_drop != '':
source_df1 = source_df1.drop(*columns_to_drop)

#Check for presence of partition key and before writing data to Curated bucket
if partition_key[0]:
source_df1.write.mode("append")\
.format("delta")\
.partitionBy(*partition_key)\
.options(**additional_options)\
.saveAsTable(db_name + ".spark_" + table)
else:
source_df1.write.mode("append")\
.format("delta")\
.options(**additional_options)\
.saveAsTable(db_name + ".spark_" + table)

# Generate symlink for Amazon Redshift Spectrum to read data
deltaTable = DeltaTable.forPath(spark, deltaHivePath + table + "/")
deltaTable.generate("symlink_format_manifest")

else:
print("Moving to upsert logic, Reading data from partition - ",cdc_partition)
# The below logic will verify if the CDC path has data before proceeding with
# incremental load. if CDC path is not available for a specific table the load
# process is skipped to avoid spark read error.
resp = s3.list_objects_v2(
Bucket=raw_bucket,
Prefix=schema +"/" +table +"/" +cdc_partition,
Delimiter='/',
MaxKeys=1)
if 'CommonPrefixes' in resp:
update_df = spark.read.format("parquet").load(
"s3://" + raw_bucket + "/" + schema + "/" + table + "/" + cdc_partition + "/")

# Get recent record for each primary key to update the recent transaction to the Delta table
# This step is needed to de-dup transactions like inserts and deletes within the same batch
sort_order = Window.partitionBy(
col('trip_id')).orderBy(
col('update_ts_dms').desc())
update_df = update_df.withColumn("rec_val", row_number().over(
sort_order)).filter("rec_val=1").drop("rec_val")

# upsert script using Merge operation. The below script updates/inserts data
# on all columns. In case you need to insert/update specific columns
# use whenNotMatchedInsert/whenMatchedUpdate functions and parameterize the input for each table
deltaTable = DeltaTable.forPath(spark, deltaHivePath + table + "/")
deltaTable.alias('trg') \
.merge(update_df.alias('src'),'trg.trip_id = src.trip_id')\
.whenNotMatchedInsertAll(condition="src.Op = 'I'") \
.whenMatchedUpdateAll(condition="src.Op='U'") \
.whenMatchedDelete(condition="src.Op = 'D'") \
.execute()

# Generate symlink for Amazon Redshift Spectrum to read data
deltaTable.generate("symlink_format_manifest")
else:
print("The path is empty for table -", table)
i = i + 1
print("The Job has completed execution...")

Monitor EMR Serverless application using CloudWatch dashboards

We can optionally monitor the EMR Serverless application using CloudWatch dashboards by installing the CloudFormation template from the EMR Serverless CloudWatch Dashboard GitHub repository.Follow the instructions on the Getting started section on the GitHub repository and deploy the CloudFormation template in your account.

You need to provide the EMR Serverless application ID as a parameter while deploying the CloudFormation stack, which can be obtained on the EMR Studio Applications page as shown in the following screenshot.

After the CloudFormation template is successfully deployed, navigate to the CloudWatch console to see a custom dashboard created for the EMR Serverless application ID that was provided to the CloudFormation template.

Choose the dashboard to see the different metrics for the EMR Serverless application in a single dashboard view.

You can see the available workers (one driver and two executors that were pre-initialized in the default configuration) and also the spike under successful job count that indicates the initial data load job that was completed successfully.

You could also monitor the CPU, memory, and storage allocated for the application, driver, and executor nodes separately.

The following image shows application metrics for three workers with 12 vCPUs (both driver and executor initialized with 4 vCPUs) and also the memory and storage usage. You can monitor the metrics from this dashboard and pre-initialize your application capacity that suits your specific workloads.

We can see the number of executors that were utilized for this job execution from the executor metrics section within the CloudWatch dashboard. We have used two executors and a driver for running this job.

Query the Delta tables through Athena

Previously, Delta tables were accessed through Athena by generating the manifest files (which maintain the list of data files to read for querying a Delta table). With the newly launched support in Athena for reading native Delta tables, it’s no longer required to generate and update manifest files. The Athena SQL engine version 3 can directly query native Delta tables. If you’re using an older engine version, change the engine version.

Navigate to the Athena console and start querying the data. Run a SELECT query and fetch the first 10 records to verify the data:

SELECT * FROM "AwsDataCatalog"."emrserverless_delta"."spark_travel_details" limit 10;

The table (native Delta table) has been created and updated to the AWS Glue Data Catalog from the EMR Serverless application code. You can successfully query and explore the data through Athena or Spark applications, but the schema definitions for individual columns aren’t updated in Data Catalog with this approach.

The following screenshot shows the Delta table created through code has a single array column. Athena supports reading native Delta tables and therefore we can read the data successfully even though the Data Catalog shows only a single array column.

If you need the individual column-level metadata to be available in the Data Catalog, run an AWS Glue crawler periodically to keep the AWS Glue metadata updated. For more information, refer to Introducing native Delta Lake table support with AWS Glue crawlers.

Run the data pipeline to load incremental data changes into the Delta tables

In this section, we walk through the steps to run the data pipeline.

Generate an incremental (CDC) dataset and insert it into the Aurora PostgreSQL database

  1. Log in to the EC2 instance via SSH and using the PSQL CLI, run the following SQL commands to generate CDC data on the source database:

update delta_emr_source.travel_details set destination='Tucson' where destination='Miami';
call delta_emr_source.insert_records(200);
delete from delta_emr_source.travel_details where destination='Los Angeles';

  1. Navigate to the AWS DMS console and verify whether the incremental records are populated to the S3 raw bucket by the replication task.

You can also verify in the S3 raw bucket location that the files are created under hourly partitioned folders.

Run the EMR Serverless Spark application to merge CDC data in the S3 curated layer (incremental load)

After the AWS DMS task has successfully loaded the incremental data, submit the Spark job on the EMR Serverless application to load the incremental data (CDC) with the following script arguments:

["U", "delta_emr_source", "2022-10-25-21", "travel_details","route_id"]

The partition path given here as 2022-10-25-21 should be changed as applicable in your use case. We use an example use case where the EMR Serverless job runs every hour, and the input data folder is partitioned on an hourly basis from AWS DMS. You can choose an appropriate partitioning strategy on the S3 raw bucket for your use case.

  1. Under Spark properties, choose Edit in text and enter the following configurations:
--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.submit.pyFiles=/usr/share/aws/delta/lib/delta-core.jar --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

  1. When the job is successful, verify in Amazon S3 that more files are created in the _delta_log folder, capturing the changes from the current run.

Query the Delta tables through Athena to validate the merged data

Go to the Athena console to query the data and validate count to ensure that the table contains the most recent data:

SELECT destination, count(*) FROM "AwsDataCatalog"."emrserverless_delta"."spark_travel_details" group by destination;

If you also want to query this data from Amazon Redshift, you can create external tables in Redshift Spectrum for Delta tables. For more information, refer to Creating external tables for data managed in Delta Lake. Redshift Spectrum currently supports querying Delta tables through the manifest file option. A Delta table manifest contains a list of files that make up a consistent snapshot of the Delta table. The code snippet given in this post updates the manifest files every time new data is loaded in the Delta tables to ensure only the latest data is read from the Delta tables.

Clean up

To avoid incurring ongoing charges, clean up your infrastructure by deleting the stack from the AWS CloudFormation console. Delete the EMR Serverless application and any other resources you created during this exercise.

Conclusion

In this post, we demonstrated how to create a transactional data lake with Delta table format using EMR Serverless and AWS DMS. With the flexibility provided by EMR Serverless, you can use the latest version of open-source Delta framework on EMR Serverless (with the latest version of Spark) in order to support a wider range of transactional data lake needs based on various use cases.

Now you can build a transactional data lake for your organization with Delta table format and access data using Athena and Redshift Spectrum for various analytical workloads. You could use this high-level architecture for any other use cases where you need to use the latest version of Spark on EMR Serverless.


About the Authors

Sankar Sundaram is a Data Lab Architect at AWS, where he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions.

Monjumi Sarma is a Data Lab Solutions Architect at AWS. She helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives.