Monitoring delay of AWS Batch jobs in transit before execution

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/monitoring-delay-of-aws-batch-jobs-in-transit-before-execution/

This post is written by Nikhil Anand, Solutions Architect 

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch processing jobs on AWS. With AWS Batch you no longer have to install and manage batch computing software or server clusters used to run your jobs. This lets you focus on analyzing results and solving problems, not managing infrastructure. When you use AWS Batch, in the job lifetime, a job goes through several states. When creating a compute environment to run the Batch jobs and submit Batch jobs, a settings misconfiguration could cause the job to get stuck in a transit state. This means the job will not proceed to the desired RUNNING state – a common issue faced by most customers.

If your compute environment contains compute resources, but your jobs don’t progress beyond the RUNNABLE state, then something is preventing the jobs from being placed on a compute resource. There are various reasons why a job could remain in the RUNNABLE state. The usual call to action is referring the troubleshooting documentation in order to fix the issue. Similarly, if your job is dependent on another job, then the job would stay in the PENDING state.

However, if you have scheduled actions to be completed with Batch jobs, or if you do not have any mechanism monitoring the jobs, then your jobs might stay in any of the transit states if left unattended. You may end up continuing forward, unaware that your job has yet to run. Eventually, when you see the jobs not progressing beyond the RUNNABLE or PENDING state, you miss the task that the job was expected to do in the given timeframe. This can result in additional time and effort troubleshooting the stuck job.

To prevent this accidental avoidance or lack of in-transit job monitoring, this post provides a monitoring solution for jobs in transit (from the SUBMITTED to the RUNNING state) in AWS Batch.

You can configure a threshold monitoring duration for your jobs so that if a job stays in SUBMITTED/PENDING/RUNNABLE longer than that, then you get a notification. For example, you might have a job that you would want to proceed to the RUNNING state in approximately 15 minutes since the job submission. Sometimes a slight misconfiguration can cause the job to get stuck in RUNNABLE indefinitely. In that case, you can set a threshold of 15 minutes. Or, suppose you have a job that is dependent on the other job that is stuck in processing. In these situations, once the specified duration is crossed, you are notified about your job staying in transit beyond your defined threshold status.

The solution is deployed by using AWS CloudFormation.

Overview of solution

The solution creates an Amazon CloudWatch Events rule that triggers an AWS Lambda function on a schedule. Then, the Lambda function checks every job in transit for more than ‘X’ seconds on all compute environments since the job submission. Specify your own value for ‘X’ when you launch the AWS CloudFormation stack. The solution consists of the following components created via CloudFormation:

  • An Amazon CloudWatch event rule to monitor the submitted jobs in Batch using the target Lambda function
  • An AWS Lambda function with the logic to monitor the submitted jobs and trigger Amazon Simple Notification Service (Amazon SNS) notifications
  • A Lambda execution AWS Identity and Access Management (IAM) role
  • An Amazon SNS topic to be subscribed by end users in order to be notified about the submitted jobs

The solution components and workflow.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Walkthrough

To provision the necessary solution components, use this CloudFormation template. 

  1. While launching the CloudFormation stack, you will be asked to input the following information in addition to the CloudFormation stack name:
    1. The upper threshold (in seconds) for the jobs to stay in the transit state
    2. The evaluation period after which the Lambda runs periodically
    3. The email ID to get notifications after the job stays in the transit state for the defined threshold value.

pecify parameter values during CloudFormation stack launch

  1. Once the stack is created, the following resources will be provisioned – SNS topic, CloudWatch Events rule, Lambda function, Lambda invoke permissions, and Lambda execution role. View it in the ‘Resources’ tab of your CloudFormation stack.

Successful creation of the CloudFormation stack

  1. After the stack is created, the email ID you entered in step III above will receive an email from Amazon SNS in order to confirm the Amazon SNS subscription.

Subscription confirmation email that you receive on the specified email ID.

Click Confirm subscription in the email.

Subscription confirmed.

  1. Based on the customer’s inputs during stack launch, a Lambda function will be periodically invoked to look out for Batch jobs stuck in the RUNNABLE state for the defined threshold.
  2. An Amazon SNS notification is sent out at the evaluation periods with the job IDs of the jobs that have stayed stuck in the RUNNABLE state.

Verifying the solution

Launch your monitoring solution by using the CloudFormation template. Once the stack creation is complete, I get an email to subscribe to the SNS topic. Then, I subscribe to the SNS topic.

Click to launch Stack. 

Submit a job in AWS Batch by using console, CLI, or SDK. To test the solution, submit a job, Job1, to a job queue associated with a compute environment with no public subnets. Compute resources require access in order to communicate with the Amazon ECS service endpoint. This can be done through an interface VPC endpoint or your compute resources having public IP addresses. Since the compute environment was configured to only have a private subnet, Job1 will not proceed from the RUNNABLE state. Similarly, submit another job, Job2, and during submission add a dependency of Job1 on Job2. Therefore, Job2 will not proceed from the PENDING state. Thus, creating a sample space wherein two jobs will be stuck in transit.

AWS Batch jobs submitted and in transit.

Based on the CloudFormation template inputs, you will get notified on the subscribed Email ID when the job stays in transit for more than ‘X’ seconds (the input provided during stack launch).

otification received for the jobs that stayed in transit longer than expected.

Modifications

The Lambda function uses the ListJobs API call. The maximum number of results is returned by ListJobs in paginated output. Therefore, if you are submitting many jobs, then you must modify the Lambda function to fetch more results from the initial response of the call by using the nextToken response element. Use this nextToken element and iterate through in a loop to keep fetching the paginated results until there are no further nextToken elements present.

Cleaning up

To avoid incurring future charges, delete the resources. You can delete the CloudFormation stack that will clean up every resource that it provisioned for the monitoring solution.

Conclusion

This solution lets you detect AWS Batch jobs that remain in the transit state longer than expected. It provides you with an efficient way to monitor your Batch jobs. If the jobs stay in the RUNNABLE/PENDING/SUBMITTED state for a significant amount of time, then it is indicative of potential misconfiguration with either the compute environment setup, or with the job parameters that were passed during the job submission. An early notification around the issue can help you troubleshoot the misconfigurations early on and take subsequent actions.

If you have multiple jobs that remain in the RUNNABLE state and you realize that they will not proceed further to the RUNNING state due to a misconfiguration, then you can shut down all RUNNABLE jobs by using a simple bash script.

For additional references regarding troubleshooting RUNNABLE jobs in AWS Batch, refer to the suggested Knowledge Center article and the troubleshooting documentation.