Tag Archives: AWS Batch

New – Fully Serverless Batch Computing with AWS Batch Support for AWS Fargate

2020-12-04 Harunobu Kameda

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-fully-serverless-batch-computing-with-aws-batch-support-for-aws-fargate/

We launched AWS Batch on December 2016 as a fully managed batch computing service that enables developers, scientists and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. With AWS Batch, you no longer need to install and manage batch computing software or server clusters to run your jobs. AWS Batch is designed to remove the heavy lifting of batch workload management by creating compute environments, managing queues, and launching the appropriate compute resources to run your jobs quickly and efficiently.

Today, we are happy to introduce the ability to specify AWS Fargate as a computing resource for AWS Batch jobs. AWS Fargate is a serverless computing engine for containers that eliminates the need to provision and manage your own servers. With this enhancement, customers will now have a way to run their jobs on serverless computing resources: Simply submit your analysis, ML inference, map reduce analysis, and other batch workloads, and let Batch and Fargate handle the rest.

Basic Concept
Customers running batch workloads in the cloud have a variety of orchestration needs: for example, workloads need to be queued, submitted to a compute resource, given priorities, dependencies and retries need to be handled, compute needs to be scalable and available, and users need to account for utilization and resource management. While AWS Batch simplifies all the queuing, scheduling, and lifecycle management for customers, and even provisions and manages compute in the customer account, customers are looking for even more simplicity where they can get up and running in minutes. Time spent on image maintenance, right-sizing of compute, and monitoring is time not spent on applications. These customer needs have led us to develop Fargate integration, which we are pleased to announce today.

How It Works
Simply specify Fargate or Fargate Spot as the resource type in Batch and submit a Fargate job definition, and customers can now take advantage of the benefits of serverless computing without the need for image patching, isolation of VM boundaries, and calculation of the correct size.

To start, access the AWS Management Console of AWS Batch. Select Compute environments and Create.We now have 2 new options for Provisioning model: Fargate and Fargate Spot.

With Fargate or Fargate Spot, you don’t need to worry about Amazon EC2 instances or Amazon Machine Images. Just set Fargate or Fargate Spot, your subnets, and the maximum total vCPU of the jobs running in the compute environment, and you have a ready-to-go Fargate computing environment. With Fargate Spot, you can take advantage of up to 70% discount for your fault-tolerant, time-flexible jobs.

Select Create compute environment. Then, Batch will create your Fargate-based compute environment.

Next step is to create the Job Queue, which is where your jobs live when waiting to be run. Then, Connect that to your Fargate compute environment.

After you finished setting up the job queue, next step is to create Job definitions for your Fargate jobs. Select Job definitions from the left pane, and click the Create button.

Once you’ve selected Fargate for the job definition, you are now ready to submit your job. Batch will handle queueing, submission, and job lifecycle for you! You can access Job definitions by clicking Job definitions in the left pane. After selecting Job Definition, click Submit new job.

You need to select the Job queue previously set up for your Fargate compute environment.

You can now submit your new job by pressing the Submit button at the bottom.

Follow the steps below to set up your Fargate-based compute environment using the AWS CLI.

1. Creating Compute Environment

aws batch create-compute-environment --cli-input-json file://below_sample.json

{
    "computeEnvironmentName": "FargateComputeEnvironment",
    "type": "MANAGED",
    "state": "ENABLED",
    "computeResources": {
        "type": "FARGATE", # or FARGATE_SPOT
        "maxvCpus": 40,
        "subnets": [
             "subnet-xxxxxxxx","subnet-xxxxxxxx","subnet-xxxxxxxx"
        ],
        "securityGroupIds": ["sg-xxxxxxxxxxxxxxxx"],
        "tags": {
            "KeyName": "fargate"
        }
    },
"serviceRole": "arn:aws:iam::xxxxxxxxxxxx:role/service-role/AWSBatchServiceRole"
}

2.Creating Job Queue

aws batch create-job-queue --cli-input-json file://below_job_queue.json

{
  "jobQueueName": "FargateJobQueue",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [
    {
      "order": 1,
      "computeEnvironment": "FargateComputeEnvironment"
    }
  ]
}

3.Creating and Registering Job Definitions
aws batch-fargate register-job-definition --cli-input-json file://below_job_definition.json

{
    "jobDefinitionName": "FargateJobDefinition",
    "type": "container",
    "propagateTags": true,
     "containerProperties": {
        "image": "xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/test:latest",
        "networkConfiguration": {
            "assignPublicIp": "ENABLED"
        },
        "fargatePlatformConfiguration": {
            "platformVersion": "LATEST"
        },
        "resourceRequirements": [
            {
                "value": "0.25",
                "type": "VCPU"
            },
            {
                "value": "512",
                "type": "MEMORY"
            }
        ],
        "jobRoleArn": "arn:aws:iam::xxxxxxxxxxxx:role/ecsTaskExecutionRole",
        "executionRoleArn":"arn:aws:iam::xxxxxxxxxxxx:role/ecsTaskExecutionRole",
        "logConfiguration": {
            "logDriver": "awslogs",
            "options": {
            "awslogs-group": "/ecs/sleepenv",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
            }
        }
     },
   "platformCapabilities": [
        "FARGATE"
    ],
    "tags": {
    "Service": "Batch",
    "Name": "JobDefinitionTag",
    "Expected": "MergeTag"
    }

You can also use other container image registries like Docker Hub in addition to Amazon Elastic Container Registry.

4.Submitting Job
aws batch submit-job --job-name faragteJob --job-queue FargateJobQueue --job-definition FargateJobDefinition

Generally Available Today
AWS Batch support for AWS Fargate is generally available today for all AWS Regions where AWS Batch and AWS Fargate are available. Please visit the AWS Batch page and technical documentation for more details.

– Kame

Using shared memory for low-latency, intra-node communication in AWS Batch

2020-11-05 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch/

This post is courtesy of Dario La Porta, Senior Consultant, HPC.

AWS Batch enables developers, scientists, and engineers to run hundreds of thousands of HPC jobs in AWS. By managing the provisioning of computing resources, this allows you to focus on your core business. Shared memory support is a new feature that can help improve overall performance.

This post explains the shared memory paradigm and how it can help you improve the performance of your single and multi-node applications. Performance gains can also help you to reduce the total runtime of your jobs and therefore reduce the overall cost.

The second part of the post shows you how to use shared memory in AWS Batch both in the AWS Management Console and the AWS CLI. Finally, I show the performance gains that are made possible with shared memory usage by walking through a benchmarking analysis with OSU Micro-Benchmarks and GROMACS.

Shared memory paradigm

Advanced, compute-intensive workloads require high-performance hardware to use scalability to deliver results. The Amazon EC2 C5n instance type provides cost-efficient, high-performance hardware with a configurable number of cores.

HPC workloads use algorithms that require parallelization and a low latency communication between the different processes. The two main technologies used for the parallel communications are message-passing with distributed memory and shared memory.

Message Passing Interface (MPI) is a message-passing standard used for the communication in a parallel distributed environment. Elastic Fabric Adapter (EFA) enables your MPI applications to use low-latency, inter-node communication.

The shared memory paradigm allows multiple processors in the same system to communicate using a memory (RAM) portion that is shared between the processes. This method takes advantage of the high-speed memory bus.

MPI with intra-node shared memory communication

The two main MPI implementations, OpenMPI and Intel MPI, enable an intra-node shared memory communication in a distributed compute environment. When configured, you take advantage of the EFA libfabic implementation having consistent and reduced latency. This results in higher throughput than the TCP transport for the intra-node communication. From libfabric 1.9 onwards, the shared memory support has been directly added to the EFA provider. You no longer need to perform any modification to the OFI MTL.

MPI jobs in AWS Batch

AWS Batch enables the execution of MPI jobs using a multi-node configuration. First, a job definition is created that enables the execution of the job in multiple nodes. To learn how to create this definition, see Creating a Multi-node Parallel Job Definition.

To take advantage of the EFA capabilities, select a supported instance type and read Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch. This post shows how to create the necessary resources in AWS Batch and run your first job with EFA.

Shared memory in AWS Batch

The new AWS Batch console interface enables you to configure the shared memory of the container inside the Job Definition. To see this, expand the Additional configuration in the Container properties section:

The Linux parameters section contains the Shared memory size parameter in MB.

You can set the same configuration in the AWS CLI by passing JSON parameters to the RegisterJobDefinition API:

"linuxParameters": {
    "sharedMemorySize": integer
}

When you run the job, it creates a shared memory area on each node that uses two or more processes. The shared memory area cannot be changed during the execution of the job. The size of the shared memory area is determined by the number of cores available in the node and the application requirements. For most jobs, a suggested initial value is 4096 MB.

Modern Linux kernels support a POSIX shared memory API. You can inspect the size of the container shared memory using the df -h /dev/shm command. The output can help you determine the shared memory space needed for your job.

Benchmarks

The following section compares the different performance using shared memory with Intel MPI 2019 update 7 and EFA.

The instance type used for the benchmark is the c5n.18xlarge and, for the multi-node use case, a cluster placement group. This compares the performance increase from shared memory versus using pure EFA communication. The first benchmark focuses on the latency of the communication in a single node use case.

OSU Micro-Benchmarks is a suite of benchmarks for measuring and evaluating the performance of MPI operations. The specific test case used is the osu_latency, measuring the minimum, maximum and average latency of a ping-pong communication between a sender and a receiver. Specifically, this is where the message sender waits for the reply from the receiver. The benchmark uses a variety of data sizes to report the average one-way latency.

The chart shows the latency in μs on the horizontal axis and packet size in bits on the vertical axis. The result shows a decrease in the communication latency using shared memory for the intra-node communication compared with using only EFA. The following chart shows the latency improvement:

The next benchmark demonstrates how shared memory can also increase the performance in a multi-node configuration. The test application is GROMACS, a versatile package to perform molecular dynamics. The overall performance of the application is susceptible to communication latency variance.

The code for the test has been downloaded from the Unified European Applications Benchmark Suite. The specific use case is named lignocellulose-rf and it uses the Reaction field for electrostatics. The details and the download link can be found in the UEABS repository.

The benchmark uses one thread per core and the following mdrun parameters:

-maxh 0.50 -resethway -noconfout -nsteps 10000 -dlb yes -nstlist 100 -pin on

The compilation options and the parameters configuration are explained in the README file. The test is run on a c5n.18xlarge instance, instead of a GPU instance, to focus on measuring the performance improvement caused specifically by increasing the number of total cores of the simulation. The chart explains the performance gain (measure in ns/day) that are achieved by increasing the number of cores. This is possible by using the shared memory for the intra-node communication during the simulation instead of using only EFA networking.

The following chart illustrates a significant percentage performance improvement by using the shared memory:

Conclusion

In this post, I show how the new shared memory support in AWS Batch is able to improve performance while decreasing the latency of the intra-node communication. This performance gain can also lower the cost of running jobs overall.

I show how to enable the usage of the shared memory in AWS Batch from the AWS Management Console or the AWS CLI. I also highlight the performance gain from using shared memory with the high-speed memory bus of the c5n.18xlarge instance, using benchmarking analysis with OSU Micro-Benchmarks and GROMACS.

AWS Batch multi-node parallel jobs are now even more performant with EFA and shared memory configurations, enabling you to focus more on your applications and less on tuning. In addition, the Elastic Fabric Adapter (EFA) has a more consistent latency and higher throughput than the TCP transport for the inter-node communication.

To learn more about using this feature, visit the Getting Started guide.

Introducing retry strategies for AWS Batch

2020-10-26 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/introducing-retry-strategies-for-aws-batch/

This post is contributed by Christian Kniep, Sr. Developer Advocate, HPC and AWS Batch.

Scientists, researchers, and engineers are using AWS Batch to run workloads reliably at scale, and to offload the undifferentiated heavy lifting in their day-to-day work. But even with a slight chance of failure in the stack, the act of mitigating these failures reminds customers that infrastructure, middleware and software are not error proof.

Many customers use Amazon EC2 Spot Instances to save up to 90% on their computing cost by leveraging unused EC2 capacity. If unused EC2 capacity is unavailable, an EC2 Spot Instance can be reclaimed by EC2. While AWS Batch takes care of rescheduling the job on a different instance, this rescheduling should not be handled differently depending on whether it is an application failure or some infrastructure event interrupting the job.

Starting today, customers can define how many retries are performed in cases where a task does not finish correctly. AWS Batch now allows customers define custom retry conditions, so that failures like an interruption of an instance or an infrastructure agent are handled differently, and do not just exhaust the number of retries attempted.

In this blog, I show the benefits of custom retry with AWS Batch by using different error codes from a job to control whether it should be retried. I will also demonstrate how to handle infrastructure events like a failing container image download, or an EC2 Spot interruption.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (AWS CLI) to set up the following:

IAMroles, policies, and profiles to grant access and permissions
A compute environment (CE) to provide the compute resources to run jobs
A job queue, which supervises the job execution and schedules jobs on the CE
Job definitions with different retry strategies,which use a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit jobs to show how you can handle different scenarios, such as infrastructure failure, application handling via error code or a middleware event.

Prerequisite

To make things easier, I first set up a couple of environment variables to have the information available for later use. I use the following code to set up the environment variables:

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, I must create IAM roles manually.

Trust policies

IAM roles are defined to be used by an individual service. In the simplest case, I want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. The definition of which entity is able to use an IAM role is called a Trust Policy. To set up a Trust Policy for an IAM role, I use the following code snippet:

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I can now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. I can set up this role with the following logic:

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

At this point, I have created the IAM roles and policies so that the instances and services are able to interact with the AWS API operations, including trust policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a CE, which will launch instances to run the example jobs.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "compute-0",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 2,
    "maxvCpus": 32,
    "desiredvCpus": 4,
    "instanceTypes": [ "m5.xlarge","m5.2xlarge","m4.xlarge","m4.2xlarge","m5a.xlarge","m5a.2xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-instances"},
    "ec2KeyPair": "batch-ssh-key",
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file:// compute-environment.json

Once this is complete, my compute environment begins to launch instances. This takes a few minutes. I can use the following command to check on the status of the compute environment whenever I want:

aws batch describe-compute-environments |jq '.computeEnvironments[] |select(.computeEnvironmentName=="compute-0")'

The command uses jq to filter the output to only show the compute environment I just created.

Job queue

Now that I have my compute environment up and running, I can create a job queue, which accepts job submissions and schedules the jobs to the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "queue-0",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "compute-0"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. It is referenced in a job submission to specify the defaults of a job configuration, while some of the parameters can be overwritten when you submit.

Within the job definition, different retry strategies can be configured along with a maximum number of attempts for the job.
Three possible conditions can be used:

onExitCode will evaluate non-zero exit codes
onReason matched against middleware errors
onStatusReason can be used to react to infrastructure events such as an instance termination

Different conditions are assigned an action to either EXIT or RETRY the job. Important to note, that a job finishing with an exit code of zero will EXIT the job and not evaluate the retry condition. The default behavior for all non-zero exit code is the following:

{
  "onExitCode" : ""
  "onStatusReason" : ""
  "onReason" : "*"
  "action": retry
}

This condition retries every job that does not succeed (exit code 0) until the attempts are exhausted.

Spot Instance interruptions

AWS Batch works great with Spot Instances and customers are using this to reduce their compute cost. If Spot Instances become unavailable, instances are reclaimed by EC2, which can lead to one or more of my hosts being shut down. When this happens, the jobs running on those hosts are shut down due to an infrastructure event, not an application failure. Previously, separating these kinds of events from one another was only possible by catching the notification on the instance itself or through CloudWatch Events. Now with customer retry, you don’t have to rely on instance notifications or CloudWatch Events.

Using the job definition below, the job is restarted if the instance running the job gets shut down, which includes the termination due to a Spot Instance reclaim. The additional condition makes sure that the job exits whenever the exit code is not zero, otherwise the job would be rescheduled until the attempts are exhausted (see default behavior above).

cat > jdef-spot..json << EOF
{
    "jobDefinitionName": "spot",
    "type": "container",
    "containerProperties": {
        "image": "alpine:latest",
        "vcpus": 2,
        "memory": 256,
        "command":  ["sleep","600"],
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 5,
        "evaluateOnExit": 
        [{
            "onStatusReason" :"Host EC2*",
            "action": "RETRY"
        },{
  		  "onReason" : "*"
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-spot.json

To simulate a Spot Instances reclaim, I submit a job, and manually shut down the host the job is running on. This triggers my condition to ask AWS Batch to make 5 attempts to finish the job before it marks the job a failure.

When I use the AWS CLI to describe my job, it displays the number of attempts to retry.

By shutting down my instance, the job returns to the status RUNNABLE and will be scheduled again until it succeeds or reaches the maximum attempts defined.

Exit code mitigation

I can also use the exit code to decide which mitigation I want to use based on the exit code of the job script or application itself.

To illustrate this, I can create a new job definition that uses a container image that exits on a random exit code between 0 and 3. Traditionally, an exit code of 0 means success, and won’t trigger this retry strategy. For all other (nonzero) exit codes the retry strategy is evaluated. In my example, 1 or 2 reflect situations where a retry is needed, but an exit code of 3 means that AWS Batch should let the job fail.

cat > jdef-randomEC.json << EOF
{
    "jobDefinitionName": "randomEC",
    "type": "container",
    "containerProperties": {
        "image": "qnib/random-ec:2020-10-13.3",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 10,
        "evaluateOnExit": 
        [{
            "onExitCode": "1",
            "action": "RETRY"
        },{
            "onExitCode": "2",
            "action": "RETRY"
        },{
            "onExitCode": "3",
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-randomEC.json

A submitted job retries until the exit code 0 is successful, 3 for a failure or the attempts are exhausted (in this case, 10 of them).

aws batch submit-job  --job-name randomEC-$(date +"%F_%H-%M-%S") --job-queue queue-0   --job-definition randomEC:1

The output of a job submission shows the job name and the job id.

In case the exit code is 1, and the job will be requeued.

Container image pull failure

The first example showed an error on the infrastructure layer and the second showed how to handle errors on the application layer. In this last example, I show how to handle errors that are introduced in the middleware layer, in this case: the container daemon.

It might happen if your Docker registry is down or having issues. To demonstrate this, I used an image name that is not present in the registry. In that case, the job should not get rescheduled to fail again immediately.

The following job definition again defines 10 attempts for a job, except when the container cannot be pulled. This leads to a direct failure of the job.

cat > jdef-noContainer.json << EOF
{
    "jobDefinitionName": "noContainer",
    "type": "container",
    "containerProperties": {
        "image": "no-container-image",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 10,
        "evaluateOnExit": 
        [{
            "onReason": "CannotPullContainerError:*",
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-noContainer.json

Note that the job defines an image name (“no-container-image”) which is not present in the registry. The job is set up to fail when trying to download the image, and will do so repeatedly, if AWS Batch keeps trying.

Even though the job definition has 10 attempts configured for this job, it fell straight through to FAILED as the retry strategy sets the action exit when a CannotPullContainerError occurs. Many of the error codes I can create conditions for are documented in the Amazon ECS user guide (e.g. task error codes / container pull error).

Conclusion

In this blog post, I showed three different scenarios that leverage the new custom retry features in AWS Batch to control when a job should exit or get rescheduled.

By defining retry strategies you can react to an infrastructure event (like an EC2 Spot interruption), an application signal (via the exit code), or an event within the middleware (like a container image not being available).

This new feature allows you to have fine grained control over how your jobs react to different error scenarios.

How to run 3D interactive applications with NICE DCV in AWS Batch

2020-10-06 Ben Peven

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/how-to-run-3d-interactive-applications-with-nice-dcv-in-aws-batch/

This post is contributed by Alberto Falzone, Consultant, HPC and Roberto Meda, Senior Consultant, HPC.

High Performance Computing (HPC) workflows across industry verticals such as Design and Engineering, Oil and Gas, and Life Sciences often require GPU-based 3D/OpenGL rendering. Setting up drivers and applications for these types of workflows can require significant effort.

Similar GPU intensive workloads, such as AI/ML, are heavily using containers to package software stacks and reduce the complexity of installing and setting up the required binaries and scripts to download and run a simple container image. This approach is rarely used in the visualization of previously mentioned pre- and post-processing steps due to the complexity of using a graphical user interface within a container.

This post describes how to reduce the complexity of installing and configuring a GPU accelerated application while maintaining performance by using NICE DCV. NICE DCV is a high-performance remote display protocol that provides customers with a secure way to deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions.

With remote server-side graphical rendering, and optimized streaming technology over network, huge volume data can be analyzed easily without moving or downloading on client, saving on data transfer costs.

Services and solution overview

This post provides a step-by-step guide on how to build a container able to run accelerated graphical applications using NICE DCV, and setup AWS Batch to run it. Finally, I will showcase how to submit an AWS Batch job that will provision the compute environment (CE) that contains a set of managed or unmanaged compute resources that are used to run jobs, launch the application in a container, and how to connect to the application with NICE DCV.

Services

Before reviewing the solution, below are the AWS services and products you will use to run your application:

AWS Batch (AWS Batch) plans, schedules, and runs batch workloads on Amazon Elastic Container Service (ECS), dynamically provisioning the defined CE with Amazon EC2
Amazon Elastic Container Registry (Amazon ECR) is a fully managed Docker container registry that simplifies how developers store, manage, and deploy Docker container images. In this example, you use it to register the Docker image with all the required software stack that will be used from AWS Batch to submit batch jobs.
NICE DCV (NICE DCV) is a high-performance remote display protocol that delivers remote desktops and application streaming from any cloud or data center to any device, over varying network conditions. With NICE DCV and Amazon EC2, customers can run graphics-intensive applications remotely on G3/G4 EC2 instances, and stream the results to client machines not provided with a GPU.
AWS Secrets Manager (AWS Secrets Manager) helps you to securely encrypt, store, and retrieve credentials for your databases and other services. Instead of hardcoding credentials in your apps, you can make calls to Secrets Manager to retrieve your credentials whenever needed.
AWS Systems Manager (AWS Systems Manager) gives you visibility and control of your infrastructure on AWS, and provides a unified user interface so you can view operational data from multiple AWS services. It also allows you to automate operational tasks across your AWS resources. Here it is used to retrieve a public parameter.
Amazon Simple Notification Service (Amazon SNS) enables applications, end-users, and devices to instantly send and receive notifications from the cloud. You can send notifications by email to the user who has created a valid and verified subscription.

Solution

The goal of this solution is to run an interactive Linux desktop session in a single Amazon ECS container, with support for GPU rendering, and connect remotely through NICE DCV protocol. AWS Batch will dynamically provision EC2 instances, with or without GPU (e.g. G3/G4 instances).

You will build and register the DCV Container image to be used for the DCV Desktop Sessions. In AWS Batch, we will set up a managed CE starting from the Amazon ECS GPU-optimized AMI, which comes with the NVIDIA drivers and Amazon ECS agent already installed. Also, you will use Amazon Secrets Manager to safely store user credentials and Amazon SNS to automatically notify the user that the interactive job is ready.

Tutorial

As a Computational Fluid Dynamics (CFD) visualization application example you will use Paraview.

This blog post goes through the following steps:

Prepare required components
- Launch temporary EC2 instance to build a DCV container image
- Store user’s credentials and notification data
- Create required roles
Build DCV container image
Create a repository on Amazon ECR
- Push the DCV container image
Configure AWS Batch
- Create a managed CE
- Create a related job queue
- Create its Job Definition
Submit a batch job
Connect to the interactive desktop session using NICE DCV
- Run the Paraview application to visualize results of a job simulation

Prerequisites

An Amazon Linux 2 instance as a Docker host, launched from the latest Amazon ECS GPU-optimized AMI
In order to connect to desktop sessions, inbound DCV port must be opened (by default DCV port is 8443)
AWS account credentials with the necessary access permissions
AWS Command Line Interface (CLI) installed and configured with the same AWS credentials
To easily install third-party/open source required software, assume that the Docker host has outbound internet access allowed

Step 1. Required components

In this step you’ll create a temporary EC2 instance dedicated to a Docker image, and create the IAM policies required for the next steps. Next create the secrets in AWS Secrets Manager service to store sensible data like credentials and SNS topic ARN, and apply and verify the required system settings.

1.1 Launch the temporary EC2 instance for Docker image building

Launch the EC2 instance that becomes your Docker host from the Amazon ECS GPU-optimized AMI. Retrieve its AMI ID. For cost saving, you can use one of t3* family instance type for this stage (e.g. t3.medium).

1.2 Store user credentials and notification data

As an example of avoiding hardcoded credentials or keys into scripts used in next stages, we’ll use AWS Secrets Manager to safely store final user’s OS credentials and other sensible data.

In the AWS Management Console select Secrets Manager, create a new secret, select type Other type of secrets, and specify key pair. Store the user login name as a key, e.g.: user001, and the password as value, then name the secret as Run_DCV_in_Batch, or alternatively you can use the commands. Note xxxxxxxxxx is your chosen password.

aws secretsmanager create-secret --secret-id Run_DCV_in_Batchaws secretsmanager put-secret-value --secret-id Run_DCV_in_Batch --secret-string '{"user001":"xxxxxxxxxx"}'

Create an SNS Topic to send email notifications to the user when a DCV session is ready for connection:
- In the AWS Management Console select Simple Notification Service, then Topics, and finally Create Topic and its related subscription with the chosen email address. Learn more.
In the AWS Management Console select Secrets Manager service to create a new secret named DCV_Session_Ready_Notification, with type other type of secrets and key pair values. Store the string sns_topic_arn as a key and the SNS Topic ARN as value:

aws secretsmanager create-secret --secret-id DCV_Session_Ready_Notification aws secretsmanager put-secret-value --secret-id DCV_Session_Ready_Notification --secret-string '{"sns_topic_arn":"<put here your SNS Topic ARN>"}'

1.3 Create required role and policy

To simplify, define a single role named dcv-ecs-batch-role gathering all the necessary policies. This role will be associated to the EC2 instance that launches from an AWS Batch job submission, so it is included inside the CE definition later.

To allow DCV sessions, push images into Amazon ECR and AWS Batch operations, create the role and include the following AWS managed and custom policies:

AmazonEC2ContainerRegistryFullAccess
AmazonEC2ContainerServiceforEC2Role
SecretsManagerReadWrite
AmazonSNSFullAccess
AmazonECSTaskExecutionRolePolicy

To reach the NICE DCV licenses stored in Amazon S3 (see licensing the NICE DCV server for more details), define a custom policy named DCVLicensePolicy (the following policy is for eu-west-1 Region, you might also use us-east-1):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::dcv-license.eu-west-1/*"
        }
    ]
}

Note: If needed, you can add additional policies to allow the copy data from/to S3 bucket.

Update the Trust relationships of the same role in order to allow the Amazon ECS tasks execution and use this role from the AWS Batch Job definition as well:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

1.4 Create required Security Group

In the AWS Management Console, access EC2, and create a Security Group, named dcv-sg, that is open to DCV sessions and DCV clients by enabling tcp port 8443 in Inbound.

Step 2. DCV container image

Now you will build a container that provides OpenGL acceleration via NICE DCV. You’ll write the Dockerfile starting from Amazon Linux 2 base image, and add DCV with its related requirements.

2.1 Define the Dockerfile

The base software packages in the Dockerfile will contain: NVIDIA libraries, X server and GNOME desktop and some external scripts to manage the DCV service startup and email notification for the user.

Starting from the base image just pulled, our Dockerfile does install all required (and optional) system tools and libraries, desktop manager packages, manage the Prerequisites for Linux NICE DCV Servers , Install the NICE DCV Server on Linux and Paraview application for 2D/3D data visualization.

The final contents of the Dockerfile is available here; in the same repository, you can also find scripts that manage the DCV service system script, the notification message sent to the User, the creation of local User at startup and the run script for the DCV container.

2.2 Build Dockerfile

Install required tools both to unpack archives and perform command on AWS:

sudo yum install -y unzip awscli

Download the Git archive within the EC2 instance, and unpack on a temporary directory:

curl -s -L -o - https://github.com/aws-samples/aws-batch-using-nice-dcv/archive/latest.tar.gz | tar zxvf -

From inside the folder containing aws-batch-using-nice-dcv.dockerfile, let’s build the Docker image:

docker build -t dcv -f aws-batch-using-nice-dcv.dockerfile .

The first time it takes a while since it has to download and install all the required packages and related dependencies. After the command completes, check it has been built and tagged correctly with the command:

docker images

Step 3. Amazon ECR configuration

In this step, you’ll push/archive our newly built DCV container AMI into Amazon ECR. Having this image in Amazon ECR allows you to use it inside Amazon ECS and AWS Batch.

3.1 Push DCV image into Amazon ECR repository

Set a desired name for your new repository, e.g. dcv, and push your latest dcv image into it. The push procedure is described in Amazon ECR by selecting your repository, and clicking on the top-right button View push commands.

Install the required tool to manage content in JSON format:

sudo yum install -y jq

Amazon ECR push commands to run include:

AWS_REGION="$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)"
eval $(aws ecr get-login --no-include-email --region "${AWS_REGION}") Note: If you receive an “Unknown options: –no-include-email” error when using the AWS CLI, ensure that you have the latest version installed. Learn more.

Create the repository:

aws ecr create-repository --repository-name=dcv —region "${AWS_REGION}"DCV_REPOSITORY=$(aws ecr describe-repositories --repository-names=dcv --region "${AWS_REGION}"| jq -r '.repositories[0].repositoryUri')

Tag the image to push the image to the Amazon ECR repository:

docker build -t "${DCV_REPOSITORY}:$(date +%F)" -f aws-batch-using-nice-dcv.dockerfile .

Push command:

docker push "${DCV_REPOSITORY}:$(date +%F)"

Step 4. AWS Batch configuration

The final step is to set up AWS Batch to manage your DCV containers. The link to all previous steps is the use of our DCV container image inside the AWS Batch CE.

4.1 Compute environment

Create an AWS Batch CE using othe newly created AMI.

Log into the AWS Management Console, select AWS Batch, select ‘get started’, and skip the wizard on next page.
Choose Compute Environments on the left, and click on Create Environment.
Specify all your desired settings, e.g.:
- - Managed type
  - Name: DCV-GPU-CE
  - Service role: AWSBatchServiceRole
  - Instance role: dcv-ecs-batch-role
Since you want OpenGL acceleration, choose an instance type with GPU (e.g. g4dn.xlarge).
Choose an allocation strategy. In this example I choose BEST_FIT_PROGRESSIVE
Assign the security group dcv-sg, created previously at step 1.4 that keeps DCV port 8443 open.
Add a Nametag with the value e.g. “DCV-GPU-Batch-Instance”; to assign it to the EC2 instances started by AWS Batch automatically, so you can recognize it if needed.

4.2 Job Queue

Time to create a Job Queue for DCV with your preferred settings.

Select Job Queues from the left menu, then select Create queue (naming, for instance, e.g. DCV-GPU-Queue)
Specify a required Priority integer value.
Associate to this queue the CE you defined in the previous step (e.g. DCV-GPU-CE).

4.3 Job Definition

Now, we create a Job Definition by selecting the related item in the left menu, and select Create.

We’ll use, listed per section:

Job Definition name (e.g. DCV-GPU-JD)
Execution timeout to 1h: 3600
Parameter section:
- Add the Parameter named command with value: --network=host
  - Note: This parameter is required and equivalent to specify the same option to the docker run.Learn more.
Environment section:
- Job role: dcv-ecs-batch-role
- Container image: Use the ECR repository previously created, e.g. dkr.ecr.eu-west-1.amazonaws.com/dcv. If you don’t remember the Amazon ECR image URI, just return to Amazon ECR -> Repository -> Images.
- vCPUs: 8
  - Note: Value equal to the vCPUs of the chosen instance type (in this example: gdn4.2xlarge), having one job per node to avoid conflicts on usage of TCP ports required by NICE DCV daemons.
- Memory (MiB): 2048
Security section:
- Check Privileged
- Set user root (run as root)
Environment Variables section:
- DISPLAY: 0
- NVIDIA_VISIBLE_DEVICES: 0
- NVIDIA_ALL_CAPABILITIES: all

Note: Amazon ECS provides a GPU-optimized AMI that comes ready with pre-configured NVIDIA kernel drivers and a Docker GPU runtime, learn more; the variables above make available the required graphic device(s) inside the container.

4.4 Create and submit a Job

We can finally, create an AWS Batch job, by selecting Batch → Jobs → Submit Job.
Let’s specify the job queue and job definition defined in the previous steps. Leave the command filed as pre-filled from job definition.

4.5 Connect to sessions

Once the job is in RUNNING state, go to the AWS Batch dashboard, you can get the IP address/DNS in several ways as noted in How do I get the ID or IP address of an Amazon EC2 instance for an AWS Batch job. For example, assuming the tag Name set on CE is DCV-GPU-Batch-Instance:

aws ec2 describe-instances --filters Name=instance-state-name,Values=running Name=tag:Name,Values="DCV-GPU-Batch-Instance" --query "Reservations[].Instances[].{id: InstanceId, tm: LaunchTime, ip: PublicIpAddress}" | jq -r 'sort_by(.tm) | reverse | .[0]' | jq -r .ip

Note: It could be required to add the EC2 policy to the list of instances in the IAM role. If the AWS SNS Topic is properly configured, as mentioned in subsection 1.2, you receive the notification email message with the URL link to connect to the interactive graphical DCV session.

Finally, connect to it:

https://<ip address>:8443

Note: You might need to wait for the host to report as running on EC2 in AWS Management Console.

Below is a NICE DCV session running inside a container using the web browser, or equivalently the NICE DCV native client as well, running Paraview visualization application. It shows the basic elbow results coming from an external OpenFoam simulation, which data has been previously copied over from an S3 bucket; and the dcvgltest as well:

Cleanup

Once you’ve finished running the application, avoid incurring future charges by navigating to the AWS Batch console and terminate the job, set CE parameter Minimum vCPUs and Desired vCPUs equal to 0. Also, navigate to Amazon EC2 and stop the temporary EC2 instance used to build the Docker image.

For a full cleanup of all of the configurations and resources used, delete: the job definition, the job queue and the CE (AWS Batch), the Docker image and ECR Repository (Amazon ECR), the role dcv-ecs-batch-role (Amazon IAM), the security group dcv-sg (Amazon EC2), the Topic DCV_Session_Ready_Notification (AWS SNS), and the secret Run_DCV_in_Batch (Amazon Secrets Manager).

Conclusion

This blog post demonstrates how AWS Batch enables innovative approaches to run HPC workflows including not only batch jobs, but also pre-/post-analysis steps done through interactive graphical OpenGL/3D applications.

You are now ready to start interactive applications with AWS Batch and NICE DCV on G-series instance types with dedicated 3D hardware. This allows you to take advantage of graphical remote rendering on optimized infrastructure without moving data to save costs.

Custom logging with AWS Batch

2020-10-02 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch.

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

IAM roles, policies, and profiles to grant access and permissions
A compute environment to provide the compute resources to run jobs
A job queue, which supervises the job execution and schedules jobs on a compute environment
A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.

Prerequisite

To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "logs:CreateLogGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
      }
    }
  }]
}
EOF
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file://compute-environment.json

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
{
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
    }
  }
}
EOF
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events

Splunk

Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

{
  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"
      }
    }
  }
}

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk

Conclusion

This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.

Shared memory paradigm

MPI with intra-node shared memory communication

MPI jobs in AWS Batch

Shared memory in AWS Batch

Benchmarks

Conclusion

Example setup

Prerequisite

IAM

Trust policies

Instance role

Service role

Compute environment

Job queue

Job definition

Spot Instance interruptions

Exit code mitigation

Container image pull failure

Conclusion

Services and solution overview

Services

Solution

Tutorial

Prerequisites

Step 1. Required components

1.1 Launch the temporary EC2 instance for Docker image building

1.2 Store user credentials and notification data

1.3 Create required role and policy

1.4 Create required Security Group

Step 2. DCV container image

2.1 Define the Dockerfile

2.2 Build Dockerfile

Step 3. Amazon ECR configuration

3.1 Push DCV image into Amazon ECR repository

Step 4. AWS Batch configuration

4.1 Compute environment

4.2 Job Queue

4.3 Job Definition

4.4 Create and submit a Job

4.5 Connect to sessions

Cleanup

Conclusion

Example setup

Prerequisite

IAM

Trust Policies

Instance role

Service Role

Compute environment

Job Queue

Job definition

Job Submission

Splunk

Conclusion

The collective thoughts of the interwebz