All posts by Hernan Garcia

Introducing Amazon MWAA larger environment sizes

2024-04-16 Hernan Garcia

Post Syndicated from Hernan Garcia original https://aws.amazon.com/blogs/big-data/introducing-amazon-mwaa-larger-environment-sizes/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that streamlines the setup and operation of the infrastructure to orchestrate data pipelines in the cloud. Customers use Amazon MWAA to manage the scalability, availability, and security of their Apache Airflow environments. As they design more intensive, complex, and ever-growing data processing pipelines, customers have asked us for additional underlying resources to provide greater concurrency and capacity for their tasks and workflows.

To address this, today, we are announcing the availability of larger environment classes in Amazon MWAA. In this post, we dive into the capabilities of these new XL and 2XL environments, the scenarios they are well suited for, and how you can set up or upgrade your existing Amazon MWAA environment to take advantage of the increased resources.

Current challenges

When you create an Amazon MWAA environment, a set of managed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate containers are provisioned with defined virtual CPUs and RAM.

As you work with larger, complex, resource-intensive workloads, or run thousands of Directed Acyclic Graphs (DAGs) per day, you may start exhausting CPU availability on schedulers and workers, or reaching memory limits in workers. Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

To improve performance and resiliency of your tasks, consider following Apache Airflow best practices to author DAGs. As an alternative, you can create multiple Amazon MWAA environments to distribute workloads. However, this requires additional engineering and management effort.

New environment classes

With today’s release, you can now create XL and 2XL environments in Amazon MWAA in addition to the existing environment classes. They have two and four times the compute, and three and six times the memory, respectively, of the current large Amazon MWAA environment instance class. These instances add compute and RAM linearly to directly improve capacity and performance of all Apache Airflow components. The following table summarizes the environment capabilities.

Scheduler and Worker CPU / RAM

Web Server

CPU / RAM

Concurrent Tasks

DAG Capacity

mw1.xlarge

8 vCPUs / 24 GB

4 vCPUs / 12 GB

40 tasks (default)

Up to 2000

mw1.2xlarge

16 vCPUs / 48 GB

8 vCPUs / 24 GB

80 tasks (default)

Up to 4000

With the introduction of these larger environments, your Amazon Aurora metadata database will now use larger, memory-optimized instances powered by AWS Graviton2. With the Graviton2 family of processors, you get compute, storage, and networking improvements, and the reduction of your carbon footprint offered by the AWS family of processors.

Pricing

Amazon MWAA pricing dimensions remains unchanged, and you only pay for what you use:

The environment class
Additional worker instances
Additional scheduler instances
Metadata database storage consumed

You now get two additional options in the first three dimensions: XL and 2XL for environment class, additional workers, and schedulers instances. Metadata database storage pricing remains the same. Refer to Amazon Managed Workflows for Apache Airflow Pricing for rates and more details.

Observe Amazon MWAA performance to plan scaling to larger environments

Before you start using the new environment classes, it’s important to understand if you are in a scenario that relates to capacity issues, such as metadata database out of memory, or workers or schedulers running at high CPU usage. Understanding the performance of your environment resources is key to troubleshooting issues related to capacity. We recommend following the guidance described in Introducing container, database, and queue utilization metrics for the Amazon MWAA environment to better understand the state of Amazon MWAA environments, and get insights to right-size your instances.

In the following test, we simulate a high load scenario, use the CloudWatch observability metrics to identify common problems, and make an informed decision to plan scaling to larger environments to mitigate the issues.

During our tests, we ran a complex DAG that dynamically creates over 500 tasks and uses external sensors to wait for a task completion in a different DAG. After running on an Amazon MWAA large environment class with auto scaling set up to a maximum of 10 worker nodes, we noticed the following metrics and values in the CloudWatch dashboard.

The worker nodes have reached maximum CPU capacity, causing the number of queued tasks to keep increasing. The metadata database CPU utilization has peaked at over 65% capacity, and the available database free memory has been reduced. In this situation, we could further increase the worker nodes to scale, but that would put additional load on the metadata database CPU. This might lead to a drop in the number of worker database connections and available free database memory.

With new environment classes, you can vertically scale to increase available resources by editing the environment and selecting a higher class of environment, as shown in the following screenshot.

From the list of environments, we select the one in use for this test. Choose Edit to navigate to the Configure advanced settings page, and select the appropriate xlarge or 2xlarge environment as required.

After you save the change, the environment upgrade will take 20–30 minutes to complete. Any running DAG that got interrupted during the upgrade is scheduled for a retry, depending on the way you configured the retries for your DAGs. You can now choose to invoke them manually or wait for the next scheduled run.

After we upgraded the environment class, we tested the same DAG and observed the metrics were showing improved values because more resources are now available. With this XL environment, you can run more tasks on fewer worker nodes, and therefore the number of queued tasks kept decreasing. Alternately, if you have tasks that require more memory and/or CPU, you can reduce the tasks per worker, but still achieve a high number of tasks per worker with a larger environment size. For example, if you have a large environment where the worker node CPU is maxed out with celery.worker_autoscale (the Airflow configuration that defines the number of tasks per worker) Set at 20,20, you can increase to an XL environment and set celery.worker_autoscale to 20,20 on the XL, rather than the default 40 tasks per worker on an XL environment and the CPU load should reduce significantly.

Set up a new XL environment in Amazon MWAA

You can get started with Amazon MWAA in your account and preferred AWS Region using the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). If you’re adopting infrastructure as code (IaC), you can automate the setup using AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform scripts.

Amazon MWAA XL and 2XL environment classes are available today in all Regions where Amazon MWAA is currently available.

Conclusion

Today, we are announcing the availability of two new environment classes in Amazon MWAA. With XL and 2XL environment classes, you can orchestrate larger volumes of complex or resource-intensive workflows. If you are running DAGs with a high number of dependencies, running thousands of DAGs across multiple environments, or in a scenario that requires you to heavily use workers for compute, you can now overcome the related capacity issues by increasing your environment resources in a few straightforward steps.

In this post, we discussed the capabilities of the two new environment classes, including pricing and some common resource constraint problems they solve. We provided guidance and an example of how to observe your existing environments to plan scaling to XL or 2XL, and we described how you can upgrade existing environments to use the increased resources.

For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Authors

Hernan Garcia is a Senior Solutions Architect at AWS based in the Netherlands. He works in the financial services industry, supporting enterprises in their cloud adoption. He is passionate about serverless technologies, security, and compliance. He enjoys spending time with family and friends, and trying out new dishes from different cuisines.

Jeetendra Vaidya is a Senior Solutions Architect at AWS, bringing his expertise to the realms of AI/ML, serverless, and data analytics domains. He is passionate about assisting customers in architecting secure, scalable, reliable, and cost-effective solutions.

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, watching TV shows, and playing Tabla.

Introducing Apache Airflow version 2.6.3 support on Amazon MWAA

2023-08-10 Hernan Garcia

Post Syndicated from Hernan Garcia original https://aws.amazon.com/blogs/big-data/introducing-apache-airflow-version-2-6-3-support-on-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, Amazon MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security and reducing infrastructure management overhead.

Today, we are announcing the availability of Apache Airflow version 2.6.3 environments. If you’re currently running Apache Airflow version 2.x, you can seamlessly upgrade to v2.6.3 using in-place version upgrades, thereby retaining your workflow run history and environment configurations.

In this post, we delve into some of the new features and capabilities of Apache Airflow v2.6.3 and how you can set up or upgrade your Amazon MWAA environment to accommodate this version as you orchestrate your workflows in the cloud at scale.

New feature: Notifiers

Airflow now gives you an efficient way to create reusable and standardized notifications to handle systemic errors and failures. Notifiers introduce a new object in Airflow, designed to be an extensible layer for adding notifications to DAGs. This framework can send messages to external systems when a task instance or an individual DAG run changes its state. You can build notification logic from a new base object and call it directly from your DAG files. The BaseNotifier is an abstract class that provides a basic structure for sending notifications in Airflow using the various on_*__callback. It is intended for providers to extend and customize this for their specific needs.

Using this framework, you can build custom notification logic directly within your DAG files. For instance, notifications can be sent through email, Slack, or Amazon Simple Notification Service (Amazon SNS) based on the state of a DAG (on_failure, on_success, and so on). You can also create your own custom notifier that updates an API or posts a file to your storage system of choice.

For details on how to create and use a notifier, refer to Creating a notifier.

New feature: Managing tasks stuck in a queued state

Apache Airflow v2.6.3 brings a significant improvement to address the long-standing issue of tasks getting stuck in the queued state when using the CeleryExecutor. In a typical Apache Airflow workflow, tasks progress through a lifecycle, moving from the scheduled state to the queued state, and eventually to the running state. However, tasks can occasionally remain in the queued state longer than expected due to communication issues among the scheduler, the executor, and the worker. In Amazon MWAA, customers have experienced such tasks being queued for up to 12 hours due to how it utilizes the native integration of Amazon Simple Queue Service (Amazon SQS) with CeleryExecutor.

To mitigate this issue, Apache Airflow v2.6.3 introduced a mechanism that checks the Airflow database for tasks that have remained in the queued state beyond a specified timeout, defaulting to 600 seconds. This default can be modified using the environment configuration parameter scheduler.task_queued_timeout. The system then retries such tasks if retries are still available or fails them otherwise, ensuring that your data pipelines continue to function smoothly.

Notably, this update deprecates the previously used celery.stalled_task_timeout and celery.task_adoption_timeout settings, and consolidates their functionalities into a single configuration, scheduler.task_queued_timeout. This enables more effective management of tasks that remain in the queued state. Operators can also configure scheduler.task_queued_timeout_check_interval, which controls how frequently the system checks for tasks that have stayed in the queued state beyond the defined timeout.

For details on how to use task_queued_timeout, refer to the official Airflow documentation.

New feature: A new continuous timetable and support for continuous schedule

With prior versions of Airflow, to run a DAG continuously in a loop, you had to use the TriggerDagRunOperator to rerun the DAG after the last task is finished. With Apache Airflow v2.6.3, you can now run a DAG continuously with a predefined timetable. The simplifies scheduling for continual DAG runs. The new ContinuousTimetable construct will create one continuous DAG run, respecting start_date and end_date, with the new run starting as soon as the previous run has completed, regardless of whether the previous run has succeeded or failed. Using a continuous timetable is especially useful when sensors are used to wait for highly irregular events from external data tools.

You can bound the degree of parallelism to ensure that only one DAG is running at any given time with the max_active_runs parameter:

@dag(
    start_date=datetime(2023, 5, 9),
    schedule="@continuous",
    max_active_runs=1,  
    catchup=False,
)

New feature: Trigger the DAG UI extension with flexible user form concept

Prior to Apache Airflow v2.6.3, you could provide parameters in JSON structure through the Airflow UI for custom workflow runs. You had to model, check, and understand the JSON and enter parameters manually without the option to validate them before triggering a workflow. With Apache Airflow v2.6.3, when you choose Trigger DAG w/ config, a trigger UI form is rendered based on the predefined DAG Params. For your ad hoc, testing, or custom runs, this simplifies the DAG’s parameter entry. If the DAG has no parameters defined, a JSON entry mask is shown. The form elements can be defined with the Param class and attributes define how a form field is displayed.

For an example DAG the following form is generated by DAG Params.

Set Up a New Apache Airflow v2.6.3 Environment

You can set up a new Apache Airflow v2.6.3 environment in your account and preferred Region using the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). If you’re adopting infrastructure as code (IaC), you can automate the setup using either AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform scripts.

When you have successfully created an Apache Airflow v2.6.3 environment in Amazon MWAA, the following packages are automatically installed on the scheduler and worker nodes along with other provider packages:

apache-airflow-providers-amazon==8.2.0

python==3.10.8

For a complete list of provider packages installed, refer to Apache Airflow provider packages installed on Amazon MWAA environments.

Upgrade from older versions of Apache Airflow to Apache Airflow v2.6.3

You can perform in-place version upgrades of your existing Amazon MWAA environments to update your older Apache Airflow v2.x-based environments to v2.6.3. To learn more about in-place version upgrades, refer to Upgrading the Apache Airflow version or Introducing in-place version upgrades with Amazon MWAA.

Conclusion

In this post, we talked about some of the new features of Apache Airflow v2.6.3 and how you can get started using them in Amazon MWAA. Try out these new features like notifiers and continuous timetables, and other enhancements to improve your data orchestration pipelines.

For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Authors

Hernan Garcia is a Senior Solutions Architect at AWS, based out of Amsterdam, working in the Financial Services Industry since 2018. He specializes in application modernization and supports his customers in the adoption of cloud operating models and serverless technologies.

Parnab Basak is a Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Shubham Mehta is an experienced product manager with over eight years of experience and a proven track record of delivering successful products. In his current role as a Senior Product Manager at AWS, he oversees Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and spearheads the Apache Airflow open-source contributions to further enhance the product’s functionality.

Noise

All posts by Hernan Garcia

Introducing Amazon MWAA larger environment sizes

Current challenges

New environment classes

Pricing

Observe Amazon MWAA performance to plan scaling to larger environments

Set up a new XL environment in Amazon MWAA

Conclusion

About the Authors

Introducing Apache Airflow version 2.6.3 support on Amazon MWAA

New feature: Notifiers

New feature: Managing tasks stuck in a queued state

New feature: A new continuous timetable and support for continuous schedule

New feature: Trigger the DAG UI extension with flexible user form concept

Set Up a New Apache Airflow v2.6.3 Environment

Upgrade from older versions of Apache Airflow to Apache Airflow v2.6.3

Conclusion

About the Authors

The collective thoughts of the interwebz