Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/creating-an-aws-batch-environment-for-mixed-cpu-and-gpu-genomics-workflows/

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

{
    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
            "<subnet-id-1>",
            "<subnet-id-2>",
            "<subnet-id-3>"
        ],
        "tags": {
            "Name": "batch-gpu-worker"
        },
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
            "p3",
            "p2"
        ],
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
            "<security-group-1>",
            "<security-group-2>"
        ],
        "ec2KeyPair": "<keypair-name>"
    }
}

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

{
    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    }, 
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [
            "hiccups", 
            "Ref::InputHICS3Path", 
            "Ref::OutputS3Prefix"
        ], 

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
            {
                "type" : "GPU",
                "value" : "1"
            }
        ],
        /* END NEW STUFF (delete comment before use) */
        
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []
    }
}

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts
COPY juicer-tools.aws.sh .

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/juicer-tools.aws.sh"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

{
    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            },
            "Next": "HiccupsTask"
        },
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            },
            "End": true
        }
    }
}

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.

Conclusion

With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!