Automating stopping and starting Amazon MWAA environments to reduce cost

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/automating-stopping-and-starting-amazon-mwaa-environments-to-reduce-cost/

This was written by Uma Ramadoss, Specialist Integration Services, and Chandan Rupakheti, Solutions Architect.

This blog post shows how you can save cost by automating the stopping and starting of an Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment. It describes how you can retain the data stored in a metadata database and presents an automated solution you can use in your AWS account.

Customers run end to end data pipelines at scale with MWAA. It is a common best practice to run non-production environments for development and testing. A nonproduction environment often does not need to run throughout the day due to factors such as working hours of the development team. As there is no automatic way to stop an MWAA environment when not in use and deleting an environment causes metadata loss, customers often run it continually and pay the full cost.

Overview

Amazon MWAA has a distributed architecture with multiple components such as scheduler, worker, webserver, queue, and database. Customers build data pipelines as Directed Acyclic Graphs (DAGs) and run in Amazon MWAA. The DAGs use variables and connections from the Amazon MWAA metadata database. The history of DAG runs and related data are stored in the same metadata database. The database also stores other information such as user roles and permissions.

When you delete the Amazon MWAA environment, all the components including the database are deleted so that you do not incur any cost. As this normal deletion results in loss of metadata, you need a customized solution to back up the data and to automate the deletion and recreation.

The sample application deletes and recreates your MWAA environment at a scheduled interval defined by you using Amazon EventBridge Scheduler. It exports all metadata into an Amazon S3 bucket before deletion and imports the metadata back to the environment after creation. As this is a managed database and you cannot access the database outside the Amazon MWAA environment, it uses DAGs to import and export the data. The entire process is orchestrated using AWS Step Functions.

Deployment architecture

The sample application is in a GitHub repository. Use the instructions in the readme to deploy the application.

Sample architecture

The sample application deploys the following resources –

  1. A Step Functions state machine to orchestrate the steps needed to delete the MWAA environment.
  2. A Step Functions state machine to orchestrate the steps needed to recreate the MWAA environment.
  3. EventBridge Scheduler rules to trigger the state machines at the scheduled time.
  4. An S3 bucket to store metadata database backup and environment details.
  5. Two DAG files uploaded to the source S3 bucket configured with the MWAA environment. The export DAG exports metadata from the MWAA metadata database to backup S3 bucket. The import DAG restores the metadata from the backup S3 bucket to the newly created MWAA environment.
  6. AWS Lambda functions for triggering the DAGs using MWAA CLI API.
  7. A Step Functions state machine to wait for the long-running MWAA creation and deletion process.
  8. Amazon EventBridge rule to notify on state machine failures.
  9. Amazon Simple Notification Service (Amazon SNS) topic as a target to the EventBridge rule for failure notifications.
  10. Amazon Interface VPC Endpoint for Step Functions for MWAA environment deployed in the private mode.

Stop workflow

At a scheduled time, Amazon EventBridge Scheduler triggers a Step Functions state machine to stop the MWAA environment. The state machine performs the following actions:

Stop workflow

  1. Fetch Amazon MWAA environment details such as airflow configurations, IAM execution role, logging configurations and VPC details.
  2. If the environment is not in the “AVAILABLE” status, it fails the workflow by branching to the “Pausing unsuccessful” state.
  3. Otherwise, it runs the normal workflow and stores the environment details in an S3 bucket so that Start workflow can recreate the environment with this data.
  4. Trigger an MWAA DAG using AWS Lambda function to export metadata to the Amazon S3 bucket. This step uses Step Functions to wait for callback token integration.
  5. Resume the workflow when the task token is returned from the MWAA DAG.
  6. Delete Amazon MWAA environment.
  7. Wait to confirm the deletion.

Start workflow

At a scheduled time, EventBridge Scheduler triggers the Step Functions state machine to recreate the MWAA environment. The steps in the state machine perform the following actions:

Start workflow

  1. Retrieve the environment details stored in Amazon S3 bucket by the stop workflow.
  2. Create an MWAA environment with the same configuration as the original.
  3. Trigger an MWAA DAG through the Lambda function to restore the metadata from the S3 bucket to the newly created environment.

Cost savings

Consider a small MWAA environment in us-east-2 with a minimum of one worker, a maximum of one worker, and 1GB data storage. At the time of this writing, the monthly cost of the environment is $357.80. Let’s assume you use this environment between 6 am and 6 pm on weekdays.

The schedule in the env file of the sample application looks like:

MWAA_PAUSE_CRON_SCHEDULE=’0 18 ? * MON-FRI *’
MWAA_RESUME_CRON_SCHEDULE=’30 5 ? * MON-FRI *’

As MWAA environment creation takes anywhere between 20 and 30 minutes, the MWAA_RESUME_CRON_SCHEDULE is set at 5.30 pm.

Assuming 21 weekdays per month, the monthly cost of the environment is $123.48 and is 65.46% less compared to running the environment continuously:

  • 21 weekdays * 12 hours * 0.49 USD per hour = $123.48

Additional considerations

The sample application only restores at-store data. Though the deletion process pauses all the DAGs before making the backup, it cannot stop any running tasks or in-flight messages in the queue. It also does not backup tasks that are not in completed state. This can result in task history loss for the tasks that were running during the backup.

Over time, the metadata grows in size, which can increase latency in query performance. You can use a DAG as shown in the example to clean up the database regularly.

Avoid setting the catchup by default configuration flag in the environment setting to true or in the DAG definition unless it is required. Catch up feature runs all the DAG runs that are missed for any data interval. When the environment is created again, if the flag is true, it catches up with the missed DAG runs and can overload the environment.

Conclusion

Automating the deletion and recreation of Amazon MWAA environments is a powerful solution for cost optimization and efficient management of resources. By following the steps outlined in this blog post, you can ensure that your MWAA environment is deleted and recreated without losing any of the metadata or configurations. This allows you to deploy new code changes and updates more quickly and easily, without having to configure your environment each time manually.

The potential cost savings of running your MWAA environment for only 12 hours on weekdays are significant. The example shows how you can save up to 65% of your monthly costs by choosing this option. This makes it an attractive solution for organizations that are looking to reduce cost while maintaining a high level of performance.

Visit the samples repository to learn more about Amazon MWAA. It contains a wide variety of examples and templates that you can use to build your own applications.

For more serverless learning resources, visit Serverless Land.