Tag Archives: GPU

VC4 and V3D OpenGL drivers for Raspberry Pi: an update

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/vc4-and-v3d-opengl-drivers-for-raspberry-pi-an-update/

Here’s an update from Iago Toral of Igalia on development of the open source VC4 and V3D OpenGL drivers used by Raspberry Pi.

Some of you may already know that Eric Anholt, the original developer of the open source VC4 and V3D OpenGL drivers used by Raspberry Pi, is no longer actively developing these drivers and a team from Igalia has stepped in to continue his work. My name is Iago Toral (itoral), and together with my colleagues Alejandro Piñeiro (apinheiro) and José Casanova (chema), we have been hard at work learning about the V3D GPU hardware and Eric’s driver design over the past few months.

Learning a new GPU is a lot of work, but I think we have been making good progress and in this post we would like to share with the community some of our recent contributions to the driver and some of the plans we have for the future.

But before we go into the technical details of what we have been up to, I would like to give some context about the GPU hardware and current driver status for Raspberry Pi 4, which is where we have been focusing our efforts.

The GPU bundled with Raspberry Pi 4 is a VideoCore VI capable of OpenGL ES 3.2, a significant step above the VideoCore IV present in Raspberry Pi 3 which could only do OpenGL ES 2.0. Despite the fact that both GPU models belong in Broadcom’s VideoCore family, they have quite significant architectural differences, so we also have two separate OpenGL driver implementations. Unfortunately, as you may have guessed, this also means that driver work on one GPU won’t be directly useful for the other, and that any new feature development that we do for the Raspberry Pi 4 driver stack won’t naturally transport to Raspberry Pi 3.

The driver code for both GPU models is available in the Mesa upstream repository. The codename for the VideoCore IV driver is VC4, and the codename for the VideoCore VI driver is V3D. There are no downstream repositories – all development happens directly upstream, which has a number of benefits for end users:

  1. It is relatively easy for the more adventurous users to experiment with development builds of the driver.
  2. It is fairly simple to follow development activities by tracking merge requests with the V3D and VC4 labels.

At present, the V3D driver exposes OpenGL ES 3.0 and OpenGL 2.1. As I mentioned above, the VideoCore VI GPU can do OpenGL ES 3.2, but it can’t do OpenGL 3.0, so future feature work will focus on OpenGL ES.

Okay, so with that introduction out of the way, let’s now go into the nitty-gritty of what we have been working on as we ramped up over the last few months:

Disclaimer: I won’t detail here everything we have been doing because then this would become a long and boring changelog list; instead I will try to summarize the areas where we put more effort and the benefits that the work should bring. For those interested in the full list of changes, you can always go to the upstream Mesa repository and scan it for commits with Igalia authorship and the v3d tag.

First we have the shader compiler, where we implemented a bunch of optimizations that should be producing better (faster) code for many shader workloads. This involved work at the NIR level, the lower-level IR specific to V3D, and the assembly instruction scheduler. The shader-db graph below shows how the shader compiler has evolved over the last few months. It should be noted here that one of the benefits of working within the Mesa ecosystem is that we get a lot of shader optimization work done by other Mesa contributors, since some parts of the compiler stack are shared across multiple drivers.

Bar chart with y-axis range from -12.00% to +2.00%. It is annotated, "Lower is better except for Threads". There are four bars: Instructions (about -4.75%); Threads (about 0.25%); Uniforms (about -11.00%); and Splits (about 0.50%).

Evolution of the shader compiler (June vs present)

Another area where we have done significant work is transform feedback. Here, we fixed some relevant flushing bugs that could cause transform feedback results to not be visible to applications after rendering. We also optimized the transform feedback process to better use the hardware for in-pipeline synchronization of transform feedback workloads without having to always resort to external job flushing, which should be better for performance. Finally, we also provided a better implementation for transform feedback primitive count queries that makes better use of the GPU (the previous implementation handled all this on the CPU side), which is also correct at handling overflow of the transform feedback buffers (there was no overflow handling previously).

We also implemented support for OpenGL Logic Operations, an OpenGL 2.0 feature that was somehow missing in the V3D driver. This was responsible for this bug, since, as it turns out, the default LibreOffice theme in Raspbian was triggering a path in Glamor that relied on this feature to render the cursor. Although Raspbian has since been updated to use a different theme, we made sure to implement this feature and verify that the bug is now fixed for the original theme as well.

Fixing Piglit and CTS test failures has been another focus of our work in these initial months, trying to get us closer to driver conformance. You can check the graph below showcasing Piglit test results to have a quick view at how things have evolved over the last few months. This work includes a relevant bug fix for a rather annoying bug in the way the kernel driver was handling L2 cache invalidation that could lead to GPU hangs. If you have observed any messages from the kernel warning about write violations (maybe accompanied by GPU hangs), those should now be fixed in the kernel driver. This fix goes along with a user-space fix to go that should be merged soon in the upstream V3D driver.

A bar chart with y-axis ranging from 0 to 16000. There are three groups of bars: "June (master)"; "Present (master)"; Present (GLES 3.1)". Each group has three bars: Pass; Fail; Skip. Passes are higher in the "Present (master)" and "Present (GLES 3.1)" groups of bars than in the "June (master)" group, and skips and fails are lower.

Evolution of Piglit test results (June vs present)

A a curiosity, here is a picture of our own little continuous integration system that we use to run regression tests both regularly and before submitting code for review.

Ten Raspberry Pis with small black fans, most of them in colourful Pimoroni Pibow open cases, in a nest of cables and labels

Our continuous integration system

The other big piece of work we have been tackling, and that we are very excited about, is OpenGL ES 3.1, which will bring Compute Shaders to Raspberry Pi 4! Credit for this goes to Eric Anholt, who did all the implementation work before leaving – he just never got to the point where it was ready to be merged, so we have picked up Eric’s original work, rebased it, and worked on bug fixes to have a fully conformant implementation. We are currently hard at work squashing the last few bugs exposed by the Khronos Conformance Test Suite and we hope to be ready to merge this functionality in the next major Mesa release, so look forward to it!

Compute Shaders is a really cool feature but it won’t be the last. I’d like to end this post with a small note on another important large feature that is currently in early stages of development: Geometry Shaders, which will bring the V3D driver one step closer to exposing a full programmable 3D pipeline – so look forward to that as well!

The post VC4 and V3D OpenGL drivers for Raspberry Pi: an update appeared first on Raspberry Pi.

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/creating-an-aws-batch-environment-for-mixed-cpu-and-gpu-genomics-workflows/

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

{
    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
            "<subnet-id-1>",
            "<subnet-id-2>",
            "<subnet-id-3>"
        ],
        "tags": {
            "Name": "batch-gpu-worker"
        },
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
            "p3",
            "p2"
        ],
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
            "<security-group-1>",
            "<security-group-2>"
        ],
        "ec2KeyPair": "<keypair-name>"
    }
}

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

{
    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    }, 
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [
            "hiccups", 
            "Ref::InputHICS3Path", 
            "Ref::OutputS3Prefix"
        ], 

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
            {
                "type" : "GPU",
                "value" : "1"
            }
        ],
        /* END NEW STUFF (delete comment before use) */
        
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []
    }
}

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts
COPY juicer-tools.aws.sh .

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/juicer-tools.aws.sh"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

{
    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            },
            "Next": "HiccupsTask"
        },
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            },
            "End": true
        }
    }
}

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.

Conclusion

With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!

GPU workloads on AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Contributed by Manuel Manzano Hoss, Cloud Support Engineer

I remember playing around with graphics processing units (GPUs) workload examples in 2017 when the Deep Learning on AWS Batch post was published by my colleague Kiuk Chung. He provided an example of how to train a convolutional neural network (CNN), the LeNet architecture, to recognize handwritten digits from the MNIST dataset using Apache MXNet as the framework. Back then, to run such jobs with GPU capabilities, I had to do the following:

  • Create a custom GPU-enabled AMI that had installed Docker, the ECS agent, NVIDIA driver and container runtime, and CUDA.
  • Identify the type of P2 EC2 instance that had the required amount of GPU for my job.
  • Check the amount of vCPUs that it offered (even if I was not interested on using them).
  • Specify that number of vCPUs for my job.

All that, when I didn’t have any certainty that the instance was going to have the GPU required available when my job was already running. Back then, there was no GPU pinning. Other jobs running on the same EC2 instance were able to use that GPU, making the orchestration of my jobs a tricky task.

Fast forward two years. Today, AWS Batch announced integrated support for Amazon EC2 Accelerated Instances. It is now possible to specify an amount of GPU as a resource that AWS Batch considers in choosing the EC2 instance to run your job, along with vCPU and memory. That allows me to take advantage of the main benefits of using AWS Batch, the compute resource selection algorithm and job scheduler. It also frees me from having to check the types of EC2 instances that have enough GPU.

Also, I can take advantage of the Amazon ECS GPU-optimized AMI maintained by AWS. It comes with the NVIDIA drivers and all the necessary software to run GPU-enabled jobs. When I allow the P2 or P3 instance types on my compute environment, AWS Batch launches my compute resources using the Amazon ECS GPU-optimized AMI automatically.

In other words, now I don’t worry about the GPU task list mentioned earlier. I can focus on deciding which framework and command to run on my GPU-accelerated workload. At the same time, I’m now sure that my jobs have access to the required performance, as physical GPUs are pinned to each job and not shared among them.

A GPU race against the past

As a kind of GPU-race exercise, I checked a similar example to the one from Kiuk’s post, to see how fast it could be to run a GPU-enabled job now. I used the AWS Management Console to demonstrate how simple the steps are.

In this case, I decided to use the deep neural network architecture called multilayer perceptron (MLP), not the LeNet CNN, to compare the validation accuracy between them.

To make the test even simpler and faster to implement, I thought I would use one of the recently announced AWS Deep Learning containers, which come pre-packed with different frameworks and ready-to-process data. I chose the container that comes with MXNet and Python 2.7, customized for Training and GPU. For more information about the Docker images available, see the AWS Deep Learning Containers documentation.

In the AWS Batch console, I created a managed compute environment with the default settings, allowing AWS Batch to create the required IAM roles on my behalf.

On the configuration of the compute resources, I selected the P2 and P3 families of instances, as those are the type of instance with GPU capabilities. You can select On-Demand Instances, but in this case I decided to use Spot Instances to take advantage of the discounts that this pricing model offers. I left the defaults for all other settings, selecting the AmazonEC2SpotFleetRole role that I created the first time that I used Spot Instances:

Finally, I also left the network settings as default. My compute environment selected the default VPC, three subnets, and a security group. They are enough to run my jobs and at the same time keep my environment safe by limiting connections from outside the VPC:

I created a job queue, GPU_JobQueue, attaching it to the compute environment that I just created:

Next, I registered the same job definition that I would have created following Kiuk’s post. I specified enough memory to run this test, one vCPU, and the AWS Deep Learning Docker image that I chose, in this case mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04. The amount of GPU required was in this case, one. To have access to run the script, the container must run as privileged, or using the root user.

Finally, I submitted the job. I first cloned the MXNet repository for the train_mnist.py Python script. Then I ran the script itself, with the parameter –gpus 0 to indicate that the assigned GPU should be used. The job inherits all the other parameters from the job definition:

sh -c 'git clone -b 1.3.1 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0'

That’s all, and my GPU-enabled job was running. It took me less than two minutes to go from zero to having the job submitted. This is the log of my job, from which I removed the iterations from epoch 1 to 18 to make it shorter:

14:32:31     Cloning into 'incubator-mxnet'...

14:33:50     Note: checking out '19c501680183237d52a862e6ae1dc4ddc296305b'.

14:33:51     INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=No

14:33:51     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:54     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881

14:33:55     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:55     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 200 9912422

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:59     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 200 4542

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:34:00     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 200 1648877

14:34:04     INFO:root:Epoch[0] Batch [0-100] Speed: 37038.30 samples/sec accuracy=0.793472

14:34:04     INFO:root:Epoch[0] Batch [100-200] Speed: 36457.89 samples/sec accuracy=0.906719

14:34:04     INFO:root:Epoch[0] Batch [200-300] Speed: 36981.20 samples/sec accuracy=0.927500

14:34:04     INFO:root:Epoch[0] Batch [300-400] Speed: 36925.04 samples/sec accuracy=0.935156

14:34:04     INFO:root:Epoch[0] Batch [400-500] Speed: 37262.36 samples/sec accuracy=0.940156

14:34:05     INFO:root:Epoch[0] Batch [500-600] Speed: 37729.64 samples/sec accuracy=0.942813

14:34:05     INFO:root:Epoch[0] Batch [600-700] Speed: 37493.55 samples/sec accuracy=0.949063

14:34:05     INFO:root:Epoch[0] Batch [700-800] Speed: 37320.80 samples/sec accuracy=0.953906

14:34:05     INFO:root:Epoch[0] Batch [800-900] Speed: 37705.85 samples/sec accuracy=0.958281

14:34:05     INFO:root:Epoch[0] Train-accuracy=0.924024

14:34:05     INFO:root:Epoch[0] Time cost=1.633

...  LOGS REMOVED

14:34:44     INFO:root:Epoch[19] Batch [0-100] Speed: 36864.44 samples/sec accuracy=0.999691

14:34:44     INFO:root:Epoch[19] Batch [100-200] Speed: 37088.35 samples/sec accuracy=1.000000

14:34:44     INFO:root:Epoch[19] Batch [200-300] Speed: 36706.91 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [300-400] Speed: 37941.19 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [400-500] Speed: 37180.97 samples/sec accuracy=0.999844

14:34:44     INFO:root:Epoch[19] Batch [500-600] Speed: 37122.30 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Batch [600-700] Speed: 37199.37 samples/sec accuracy=0.999687

14:34:45     INFO:root:Epoch[19] Batch [700-800] Speed: 37284.93 samples/sec accuracy=0.999219

14:34:45     INFO:root:Epoch[19] Batch [800-900] Speed: 36996.80 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Train-accuracy=0.999733

14:34:45     INFO:root:Epoch[19] Time cost=1.617

14:34:45     INFO:root:Epoch[19] Validation-accuracy=0.983579

Summary

As you can see, after AWS Batch launched the instance, the job took slightly more than two minutes to run. I spent roughly five minutes from start to finish. That was much faster than the time that I was previously spending just to configure the AMI. Using the AWS CLI, one of the AWS SDKs, or AWS CloudFormation, the same environment could be created even faster.

From a training point of view, I lost on the validation accuracy, as the results obtained using the LeNet CNN are higher than when using an MLP network. On the other hand, my job was faster, with a time cost of 1.6 seconds in average for each epoch. As the software stack evolves, and increased hardware capabilities come along, these numbers keep improving, but that shouldn’t mean extra complexity. Using managed primitives like the one presented in this post enables a simpler implementation.

I encourage you to test this example and see for yourself how just a few clicks or commands lets you start running GPU jobs with AWS Batch. Then, it is just a matter of replacing the Docker image that I used for one with the framework of your choice, TensorFlow, Caffe, PyTorch, Keras, etc. Start to run your GPU-enabled machine learning, deep learning, computational fluid dynamics (CFD), seismic analysis, molecular modeling, genomics, or computational finance workloads. It’s faster and easier than ever.

If you decide to give it a try, have any doubt or just want to let me know what you think about this post, please write in the comments section!

Scheduling GPUs for deep learning tasks on Amazon ECS

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/

This post is contributed by Brent Langston – Sr. Developer Advocate, Amazon Container Services

Last week, AWS announced enhanced Amazon Elastic Container Service (Amazon ECS) support for GPU-enabled EC2 instances. This means that now GPUs are first class resources that can be requested in your task definition, and scheduled on your cluster by ECS.

Previously, to schedule a GPU workload, you had to maintain your own custom configured AMI, with a custom configured Docker runtime. You also had to use custom vCPU logic as a stand-in for assigning your GPU workloads to GPU instances. Even when all that was in place, there was still no pinning of a GPU to a task. One task might consume more GPU resources than it should. This could cause other tasks to not have a GPU available.

Now, AWS maintains an ECS-optimized AMI that includes the correct NVIDIA drivers and Docker customizations. You can use this AMI to provision your GPU workloads. With this enhancement, GPUs can also be requested directly in the task definition. Like allocating CPU or RAM to a task, now you can explicitly request a number of GPUs to be allocated to your task. The scheduler looks for matching resources on the cluster to place those tasks. The GPUs are pinned to the task for as long as the task is running, and can’t be allocated to any other tasks.

I thought I’d see how easy it is to deploy GPU workloads to my ECS cluster. I’m working in the US-EAST-2 (Ohio) region, from my AWS Cloud9 IDE, so these commands work for Amazon Linux. Feel free to adapt to your environment as necessary.

If you’d like to run this example yourself, you can find all the code in this GitHub repo. If you run this example in your own account, be aware of the instance pricing, and clean up your resources when your experiment is complete.

Clone the repo using the following command:

git clone https://github.com/brentley/tensorflow-container.git

Setup

You need the latest version of the AWS CLI (for this post, I used 1.16.98):

echo “export PATH=$HOME/.local/bin:$HOME/bin:$PATH” >> ~/.bash_profile
source ~/.bash_profile
pip install --user -U awscli

Provision an ECS cluster, with two C5 instances, and two P3 instances:

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --capabilities CAPABILITY_IAM                            

While AWS CloudFormation is provisioning resources, examine the template used to build your infrastructure. Open `cluster-cpu-gpu.yml`, and you see that you are provisioning a test VPC with two c5.2xlarge instances, and two p3.2xlarge instances. This gives you one NVIDIA Tesla V100 GPU per instance, for a total of two GPUs to run training tasks.

I adapted the TensorFlow benchmark Docker container to create a training workload. I use this container to compare the GPU scheduling and runtime.

When the CloudFormation stack is deployed, register a task definition with the ECS service:

aws ecs register-task-definition --cli-input-json file://gpu-1-taskdef.json

To request GPU resources in the task definition, the only change needed is to include a GPU resource requirement in the container definition:

            "resourceRequirements": [
                {
                    "type": "GPU",
                    "value": "1"
                }
            ],

Including this resource requirement ensures that the ECS scheduler allocates the task to an instance with a free GPU resource.

Launch a single-GPU training workload

Now you’re ready to launch the first GPU workload.

export cluster=$(aws cloudformation describe-stacks --stack-name tensorflow-test --query 
'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' --output text) 
echo $cluster
aws ecs run-task --cluster $cluster --task-definition tensorflow-1-gpu

When you launch the task, the output shows the `guIds` values that are assigned to the task. This GPU is pinned to this task, and can’t be shared with any other tasks. If all GPUs are allocated, you can’t schedule additional GPU tasks until a running task with a GPU completes. That frees the GPU to be scheduled again.

When you look at the log output in Amazon CloudWatch Logs, you see that the container discovered one GPU: `/gpu0` and the training benchmark trained at a rate of 321.16 images/sec.

With your two p3.2xlarge nodes in the cluster, you are limited to two concurrent single GPU based workloads. To scale horizontally, you could add additional p3.2xlarge nodes. This would limit your workloads to a single GPU each.  To scale vertically, you could bump up the instance type,  which would allow you to assign multiple GPUs to a single task.  Now, let’s see how fast your TensorFlow container can train when assigned multiple GPUs.

Launch a multiple-GPU training workload

To begin, replace the p3.2xlarge instances with p3.16xlarge instances. This gives your cluster two instances that each have eight GPUs, for a total of 16 GPUs that can be allocated.

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --parameter-overrides GPUInstanceType=p3.16xlarge --capabilities CAPABILITY_IAM

When the CloudFormation deploy is complete, register two more task definitions to launch your benchmark container requesting more GPUs:

aws ecs register-task-definition --cli-input-json file://gpu-4-taskdef.json  
aws ecs register-task-definition --cli-input-json file://gpu-8-taskdef.json 

Next, launch two TensorFlow benchmark containers, one requesting four GPUs, and one requesting eight GPUs:

aws ecs run-task --cluster $cluster --task-definition tensorflow-4-gpu
aws ecs run-task --cluster $cluster --task-definition tensorflow-8-gpu

With each task request, GPUs are allocated: four in the first request, and eight in the second request. Again, these GPUs are pinned to the task, and not usable by any other task until these tasks are complete.

Check the log output in CloudWatch Logs:

On the “devices” lines, you can see that the container discovered and used four (or eight) GPUs. Also, the total images/sec improved to 1297.41 with four GPUs, and 1707.23 with eight GPUs.

Because you can pin single or multiple GPUs to a task, running advanced GPU based training tasks on Amazon ECS is easier than ever!

Cleanup

To clean up your running resources, delete the CloudFormation stack:

aws cloudformation delete-stack --stack-name tensorflow-test

Conclusion

For more information, see Working with GPUs on Amazon ECS.

If you want to keep up on the latest container info from AWS, please follow me on Twitter and tweet any questions! @brentContained

Running GPU-Accelerated Kubernetes Workloads on P3 and P2 EC2 Instances with Amazon EKS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/

This post contributed by Scott Malkie, AWS Solutions Architect

Amazon EC2 P3 and P2 instances, featuring NVIDIA GPUs, power some of the most computationally advanced workloads today, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding. Now Amazon Elastic Container Service for Kubernetes (Amazon EKS) supports P3 and P2 instances, making it easy to deploy, manage, and scale GPU-based containerized applications.

This blog post walks through how to start up GPU-powered worker nodes and connect them to an existing Amazon EKS cluster. Then it demonstrates an example application to show how containers can take advantage of all that GPU power!

Prerequisites

You need an existing Amazon EKS cluster, kubectl, and the aws-iam-authenticator set up according to Getting Started with Amazon EKS.

Two steps are required to enable GPU workloads. First, join Amazon EC2 P3 or P2 GPU compute instances as worker nodes to the Kubernetes cluster. Second, configure pods to enable container-level access to the node’s GPUs.

Spinning up Amazon EC2 GPU instances and joining them to an existing Amazon EKS Cluster

To start the worker nodes, use the standard AWS CloudFormation template for Amazon EKS worker nodes, specifying the AMI ID of the new Amazon EKS-optimized AMI for GPU workloads. This AMI is available on AWS Marketplace.

Subscribe to the AMI and then launch it using the AWS CloudFormation template. The template takes care of networking, configuring kubelets, and placing your worker nodes into an Auto Scaling group, as shown in the following image.

This template creates an Auto Scaling group with up to two p3.8xlarge Amazon EC2 GPU instances. Powered by up to eight NVIDIA Tesla V100 GPUs, these instances deliver up to 1 petaflop of mixed-precision performance per instance to significantly accelerate ML and HPC applications. Amazon EC2 P3 instances have been proven to reduce ML training times from days to hours and to reduce time-to-results for HPC.

After the AWS CloudFormation template completes, the Outputs view contains the NodeInstanceRole parameter, as shown in the following image.

NodeInstanceRole needs to be passed in to the AWS Authenticator ConfigMap, as documented in the AWS EKS Getting Started Guide. To do so, edit the ConfigMap template and run the command kubectl apply -f aws-auth-cm.yaml in your terminal to apply the ConfigMap. You can then run kubectl get nodes —watch to watch the two Amazon EC2 GPU instances join the cluster, as shown in the following image.

Configuring Kubernetes pods to access GPU resources

First, use the following command to apply the NVIDIA Kubernetes device plugin as a daemon set on the cluster.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

This command produces the following output:

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

kubectl get nodes \
"-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

The following output shows that each node has four GPUs available:

Next, modify any Kubernetes pod manifests, such as the following one, to take advantage of these GPUs. In general, adding the resources configuration (resources: limits:) to pod manifests gives containers access to one GPU. A pod can have access to all of the GPUs available to the node that it’s running on.

apiVersion: v1
kind: Pod
metadata:
  name: pod-name
spec:
  containers:
  - name: container-name
    ...
    resources:
      limits:
        nvidia.com/gpu: 4

As a more specific example, the following sample manifest displays the results of the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:latest
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 4

Download this manifest as nvidia-smi-pod.yaml and launch it with kubectl apply -f nvidia-smi-pod.yaml.

To confirm successful nvidia-smi execution, use the following command to examine the log.

kubectl logs nvidia-smi

The above commands produce the following output:

Existing limitations

  • GPUs cannot be overprovisioned – containers and pods cannot share GPUs
  • The maximum number of GPUs that you can schedule to a pod is capped by the number of GPUs available to that pod’s node
  • Depending on your account, you might have Amazon EC2 service limits on how many and which type of Amazon EC2 GPU compute instances you can launch simultaneously

For more information about GPU support in Kubernetes, see the Kubernetes documentation. For more information about using Amazon EKS, see the Amazon EKS documentation. Guidance setting up and running Amazon EKS can be found in the AWS Workshop for Kubernetes on GitHub.

Please leave any comments about this post and share what you’re working on. I can’t wait to see what you build with GPU-powered workloads on Amazon EKS!

Deploy an 8K HEVC pipeline using Amazon EC2 P3 instances with AWS Batch

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploy-an-8k-hevc-pipeline-using-amazon-ec2-p3-instances-with-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

AWS provides several managed services for file- and streaming-based media encoding options.

Currently, these services offer up to 4K encoding. Recent developments and the growing popularity of 8K content has now increased the need to distribute higher resolution content.

In this solution, you use an Amazon EC2 P3 instance to create a file-based encoding pipeline utilizing AWS Batch by first uploading a sample 8K (7680×4320) file to Amazon S3.

AWS Batch

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

P3 instances for video transcoding workloads

The P3 instance comes equipped with the NVIDIA Tesla V100 GPU. The V100 is a 16 GB 5,120 CUDA Core-GPU based on the latest Volta architecture; well suited for video coding workloads. The largest instance size in that family, p3.16xlarge, has 64 vCPU, 488 GB of RAM, 8 NVIDIA Tesla V100 GPUs, and 25 Gbps networking bandwidth.

Other than being a mainstay in computational workloads the V100 offers enhanced hardware-based encoding/decoding (NVENC/NVDEC). The following tables summarize the NVENC/NVDEC options available compared to other GPUs offered at AWS.

NVENC Support Matrix

AWS GPU instance
GPU FAMILYGPUH.264 (AVCHD) YUV 4:2:0H.264 (AVCHD) YUV 4:4:4H.264 (AVCHD) LosslessH.265 (HEVC) 4K YUV 4:2:0H.265 (HEVC) 4K YUV 4:4:4H.265 (HEVC) 4K LosslessH.265 (HEVC) 8k
G2KeplerGRID K520YES
P2Kepler (2nd Gen)Tesla K80YES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYESYES

NVDEC Support Matrix

AWS GPU instanceGPU FAMILYGPUMPEG-2VC-1H.264 (AVCHD)H.265 (HEVC)VP8VP9
G2KeplerGRID K520YESYESYES
P2Kepler (2nd Gen)Tesla K80YESYESYES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYES

Cinematic 8K encoding is supported using the Tesla V100 (P3 instance family) either in landscape or portrait orientations using the HEVC codec. 

GPUH264H264_444H264_MEH264_WxHHEVCHEVC_Main10HEVC_LosslessHEVC_SAOHEVC_444HEVC_MEHEVC_WxH
Tesla M60+++4096x
4096
+4096x
4096
Tesla V100+++4096x
4096
++++++8192x
8192

Prerequisites

To follow along with these procedures, ensure that you have the following:

  • An AWS account with permissions to create IAM roles and policies, as well as read and write access to S3
  • Registration with the NVIDIA Developer Network
  • Familiarity with Docker

Deployment

For deployment, you containerize the encoding pipeline. After building the underlying P3 container instance, you then use nvidia-docker2 to build the video-encoding Docker image, which is registered with Amazon Elastic Container Registry (Amazon ECR).

As shown in the following diagram, the pipeline reads an input raw YUV file from S3, then pulls the containerized encoding application to execute at scale on the P3 container instance. The encoded video file is then be transferred to S3.

The nvidia-docker2 image video encoding stack contains the following components:

  • NVIDIA CUDA 9.2
  • FFMPEG 4.0
  • NVIDIA Video Codec SDK 8.1

This is a relatively lengthy procedure. However, after it’s built, the underlying instance and Docker image are reusable and can be quickly deployed as part of a high performance computing (HPC) pipeline.

Creating the ECS container instance

The underlying instance can be built by selecting the Amazon Linux AMI with the p3.2xlarge instance type in a public subnet. Additionally, add an EBS volume (150 GB), which is used for the 8k input, raw yuv, and output files. Scale the storage amount for larger input files. Persist the mount in /etc/fstab. Connect to the instance over SSH and install any OS updates as well as the EPEL Release and support packages as well as the base docker-ce.

sudo yum update -y
sudo yum install yum-utils \
                 device-mapper-persistent-data \
                 lvm2 \

sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install epel-release-latest-7.noarch.rpm
sudo yum update
sudo yum install docker-ce -y

The NVIDIA/CUDA stack can be installed using the cuda-repo-rhel7.rpm file. The CUDA framework installs the NVIDIA driver dependencies.

sudo yum install cuda -y

Next, install nvidia-docker2 as provided in the NVIDIA GitHub repo.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

sudo tee /etc/docker/daemon.json <<EOF
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
EOF

sudo systemctl restart docker

With the base components in place, make this instance compatible with the ECS service:

sudo yum install ecs-init -y

Create the /etc/ecs/ecs.config file with the following template:

cat << EOF > /etc/ecs/ecs.config
ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=default
EOF

Iptables and packet forwarding rules need to be created to pass IAM roles into task operations:

sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf
sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
sudo sh -c 'iptables-save > /etc/sysconfig/iptables'

Finally, a systemd unit file needs to be created:

sudo cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Docker Container %I
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--privileged \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log:Z \
--volume=/var/lib/ecs/data:/data:Z \
--volume=/etc/ecs:/etc/ecs \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=/usr/bin/docker stop %i

[Install]
WantedBy=default.target
EOF

sudo systemctl enable [email protected]
sudo systemctl start [email protected]
sudo systemctl status [email protected]

Ensure that the [email protected] service starts successfully.

Creating the NVIDIA-Docker image

With Docker installed, pull the latest nvidia/cuda:latest image from DockerHub.

docker pull nvidia/cuda:latest

It is best at this point to run the Docker container in interactive mode. However, a Docker build file can be created afterwards. At the time of publication, only CUDA 9.0 is installed. NVIDIA has already provided the necessary repositories. Install CUDA 9.2, and support packages, inside the Docker container, referenced by the (docker)  label:

docker run -it --runtime=nvidia --rm nvidia/cuda
(docker) apt update
(docker) apt install pkg-config build-essential wget curl nasm unzip \
                     git libglew-dev cuda-toolkit-9-2 python3-pip -y
(docker) pip3 install awscli

Next, download the FFMPEG 4.0, nv-codec-headers, and the Video Codec SDK 8.1 from the NVIDIA Developer platform.

First, extract the nv-codec-headers and into the directory:

(docker) make
(docker) make install

Extract the ffmpeg-4.0 directory and compile and install FFmpeg:

(docker) ./configure --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
(docker) make -j 4
(docker) make install

Download and extract the NVIDIA Video Codec SDK 8.1. The “Samples” directory has a preconfigured Makefile that compiles the binaries in the SDK. After it’s successful, confirm that the binaries are correctly set up.

(docker): ~/Video_Codec_SDK_8.1.24/Samples/AppEncode/AppEncCuda$ ./AppEncCuda -h
Options:
-i Input file path
-o Output file path
-s Input resolution in this form: WxH
-if Input format: iyuv nv12 yuv444 p010 yuv444p16 bgra bgra10 ayuv abgr abgr10
-gpu Ordinal of GPU to use
-codec Codec: h264 hevc
-preset Preset: default hp hq bd ll ll_hp ll_hq lossless lossless_hp
-profile H264: baseline main high high444; HEVC: main main10 frext
-444 (Only for RGB input) YUV444 encode
-rc Rate control mode: constqp vbr cbr cbr_ll_hq cbr_hq vbr_hq
-fps Frame rate
-gop Length of GOP (Group of Pictures)
-bf Number of consecutive B-frames
-bitrate Average bit rate, can be in unit of 1, K, M
-maxbitrate Max bit rate, can be in unit of 1, K, M
-vbvbufsize VBV buffer size in bits, can be in unit of 1, K, M
-vbvinit VBV initial delay in bits, can be in unit of 1, K, M
-aq Enable spatial AQ and set its stength (range 1-15, 0-auto)
-temporalaq (No value) Enable temporal AQ
-lookahead Maximum depth of lookahead (range 0-32)
-cq Target constant quality level for VBR mode (range 1-51, 0-auto)
-qmin Min QP value
-qmax Max QP value
-initqp Initial QP value
-constqp QP value for constqp rate control mode
Note: QP value can be in the form of qp_of_P_B_I or qp_P,qp_B,qp_I (no space)

Encoder Capability
# GPU H264 H264_444 H264_ME H264_WxH HEVC HEVC_Main10 HEVC_Lossless HEVC_SAO HEVC_444 HEVC_ME HEVC_WxH
0 Tesla V100-SXM2-16GB + + + 4096x4096 + + + + + + 8192x8192

Create a small script to be used for the 8K-encoding test inside the Docker container. Save the file as /root/nvenc-processor.sh. In the basic form, this script encodes using a single thread. For comparison, the same file is encoded using four threads.

(docker)
#!/bin/bash -xe
time aws s3 cp $S3_INPUT /mnt/8k.webm

time /usr/local/bin/ffmpeg -y -hwaccel cuda -i /mnt/8k.webm -c:v rawvideo -pix_fmt yuv420p /mnt/8k.yuv
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncCuda/AppEncCuda -i /mnt/8k.yuv -o /mnt/8k.hevc -s 7680x4320 -codec hevc
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncPerf/AppEncPerf -i /mnt/8k.yuv -s 7680x4320 -thread 4 -codec hevc

time aws s3 cp /mnt/8k.hevc $S3_OUTPUT

This script downloads a file from S3 and processes it through FFmpeg. Using the AppEncCuda and AppEncPerf methods, create the 8K-encoded file to be uploaded back to S3. Commit your Docker container into a new Docker image:

docker commit -m "creating hvec-processor image" <containerid> nvidia-hvec:latest

Ensure that a Docker repo has been created in Amazon ECS. Choose Repositories, Create Repository. After you open the repository, choose View Push Commands. Commit the new created image to your ECR repo.

After confirming that your image is in your ECR repo, delete all images locally in the instance:

docker rmi -f $(docker images -a -q)

Before stopping the instance, remove the ECS agent checkpoint file:

sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json

Create an AMI from the instance, maintaining the attached EBS volume. Note the AMI ID.

Creating IAM role permissions

To ensure that access to ECS is controlled and to allow AWS Batch to be called, create two IAM roles:

  • BatchServiceRole allows AWS Batch to call services on your behalf.
  • ecsInstanceRole is specific to this workflow and adds permissions for S3FullAccess. This allows the container to read from and write to your S3 bucket. The following screenshot shows the example policy stack.

In AWS Batch, select the compute environment and create a managed compute environment. Assign a cluster name and min and max vCPUs values. Use the AMI ID, and IAM roles created earlier. Use the Spot pricing model with a consideration of running at 60% of the On-Demand price. Look at the current Spot price to see if more aggressive discounts are possible.

Note the cluster name. In Amazon ECS, you should see the cluster created. Next, create a job queue and associate this job queue with the compute environment created earlier. Note the job queue name.

Next, create a job definition file. This provides the job parameters to be used including mounting paths, CPU, and memory requirements.

{
    "containerProperties": {
        "mountPoints": [
            {
                "sourceVolume": "codec-data",
                "readOnly": false,
                "containerPath": "/mnt"
            }],
        "image": "<accountnumber>.dkr.ecr.us-east-1.amazonaws.com/nvidia/nvidia-hvec:latest",
        "command": ["/root/nvenc-processor.sh"],
        "volumes": [
            {
                "host": {"sourcePath": "/mnt"},
                "name": "codec-data"
            }],
        "memory": 32768,
        "vcpus": 8,
        "privileged": true,
        "environment": [
            {
                "name": "S3_INPUT",
                "value": "s3://<bucket>/<key_name>"
            },
            {
                "name": "S3_OUTPUT",
                "value": "s3://<bucket>"
            }
        ],
        "ulimits": []
    },
    "type": "container",
    "jobDefinitionName": "nvenc-test"
}

Save the file as nvenc-test.json and register the job in AWS Batch.

aws batch register-job-definition --cli-input-json file://nvenc-test.json

In the AWS Batch console, create a job queue assigning a priority of 1 to the compute environment created earlier. Create a job assigning a job name, with the job definition file, and job queue. Add additional environment variables for the S3 bucket. Ensure that these buckets and input file are created.

S3_INPUT = s3://<bucket>/<key_name> 
S3_OUTPUT = s3://<bucket> 

Execute the job. In a few moments, the job should be in the Running state. Check the CloudWatch logs for an updated status of the job progression. Open the job record information and scroll down to CloudWatch metrics. The events are logged in a new AWS Batch log stream.

A 1-minute 8K YUV 4:2:0 file took approximately 10 minutes single-threaded (top panel), and 58 seconds using four threads (bottom panel). The nvenc-processor.sh script serves as a basic implementation of 8K encoding. Explore the options provided by the NVIDIA Video Codec SDK for additional encoding/decoding and transcoding options.

Conclusion

With AWS Batch, a customized container instance, and a dockerized NVIDIA video encoding platform, AWS can provide your HD, 4K, and now 8K media distribution. I invite you to incorporate this into your automated pipeline.

With some minor modification, it’s possible to trigger this pipeline after a new file is uploaded into S3. Then, execute through AWS Lambda or as part of an AWS Step Functions workflow.

 

 

 

 

 

Building a GPU workstation for visual effects with AWS

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/building-a-gpu-workstation-for-visual-effects-with-aws/

Contributed by Mike Owen, Solutions Architect, AWS Thinkbox

The elasticity, scalability, and cost effectiveness of the cloud value proposition is attractive to media customers. One of the key design patterns in media and entertainment (M&E) workloads is using the cloud as a content lake and bringing the underlying processes closer without having to synchronize data. In this high-end graphics visualization business, a pixel-perfect, color-accurate, fully interactive native desktop experience is required for both Windows and Linux platforms. Visual effects (VFX) artists also require input peripherals such as latest-generation Wacom 8K pressure-sensitive tablets and Wacom Cintiq monitors to work as seamlessly as they do on-premises.

AWS offers Amazon EC2 G3 instances backed by NVIDIA Tesla M60 GPUs with powerful graphics capabilities: OpenGL 4.6, DirectX 12, CUDA 9.2, GRID 6.1. You can combine these instances with the Teradici streaming protocol via their Cloud Access Software (CAS) agent to enable a high-end desktop experience on either Windows or Linux with an on-demand pricing model to fit your business needs. Teradici PCoIP is a popular protocol in the M&E industry, where Teradici have delivered a custom silicon accelerated zero-client hardware device to deliver secure pixel streaming to on-premises monitors. AWS also enables customers to create managed virtual desktop environments with Amazon WorkSpaces Graphics bundles (Windows and Linux) or Amazon AppStream 2.0 (Windows). Both solutions offer a managed environment with GPU-backed instances. This blog describes how you can set up an unmanaged VFX desktop using Amazon EC2 G3 instances combined with high-performance storage and scalable compute options such as Amazon EC2 Spot Instances.

Configuration

The following diagram describes a typical Windows and Linux configuration. In this setup, you use a Teradici PCoIP Zero Client over a dedicated network connection from your on-premises location via your chosen network provider to their nearest AWS Region containing an Amazon EC2 G3 instance. AWS Direct Connect provides a low-latency, high-bandwidth dedicated connection that doesn’t traverse the public internet. With the Windows instance, you might use a creative pen display such as a Wacom Cintiq monitor or, on a Linux instance, the latest generation of Wacom 8K pressure-sensitive tablets. You can connect both types of environments to dual 2K monitors and be ready for film VFX work.

Once built, the g3.4xl instance runs your custom Amazon Machine Image (AMI) with encrypted volume(s) in Amazon Elastic Block Storage (EBS) containing all your software, pulling floating licenses from your on-premises license servers where necessary. For Linux, you have the option of centrally installing your software via a fast NVMe SSD–based i3 instance type and building a minimal-sized boot AMI. In both cases, you can add encrypted Amazon EBS SSD volumes for increased local storage. The Teradici CAS agent runs on each individual G3 instance and can be provisioned, brokered, and managed by the optional Teradici Cloud Access Manager (CAM) solution. Finally, Amazon WorkSpaces Graphics bundles are compatible with a Teradici zero client, providing easy access to a fully managed Windows desktop. This might be useful for Linux-based studios that require ad hoc Windows usage such as Adobe Creative Cloud.

In this configuration, a Teradici zero client interacts with the provisioned desktop (served on a G3 instance) in the cloud. The Teradici CAS agent captures the frame buffer and sends it in real time to the zero client over the network via UDP using the PCoIP protocol. A smooth, reliable experience depends on a low-latency and high-bandwidth connection to the Amazon EC2 instance hosting the desktop. Bandwidth requirements depend on the number of monitors used, resolution, frame rate, and lossless quality of the desktop experience. For Wacom tablet support, Teradici CAS 2.12 requires the latency level to be less than 25 ms. You can use ping.psa.fun or cloudping.info to check the latency time of public pings between your location and your closest AWS Region. Ideally, you will provision an AWS Direct Connect connection for private (doesn’t traverse the public internet) and fast (low-latency) connectivity to the AWS Region from your location. You can also use a public internet connection for initial testing. In both cases, you can route traffic over a VPN for added security.

Shortcut

Instead of doing a manual build, you can visit the AWS Marketplace and subscribe to a Teradici-provided pre-built AMI. It already has the NVIDIA GRID driver and Teradici CAS software installed, configured, and licensed as part of the overall usage cost. See the following offerings on AWS Marketplace:

Prerequisites

Make sure that everything in the following list is in place before deploying to either platform:

  • Create an AWS account.
  • Ensure that your AWS account has an EC2 key-pair associated with it by going to the AWS Management Console and checking Key Pairs under Network and Security in the applicable AWS Region.
  • Set up an AWS account <ACCESS KEY> and <SECRET ACCESS KEY> to access the NVIDIA GRID driver from an Amazon S3 bucket. The deployment instructions explain how to install and set up the AWS Command-Line Interface (AWS CLI).
  • Minimum version: CentOS 7.2 or Windows 2016.
  • Recommended Teradici PCoIP Zero Client firmware version: 6.0. Contact Teradici to download.
  • Contact Teradici who will provide a 60-day trial license: <TERADICI LICENSE CODE> for Cloud Access Software. You should receive your license within 1 business day. If you don’t receive your license, please contact [email protected].
  • You must have superuser (root) or Administrator privileges to the AMI.
  • The Amazon EC2 security group provides a stateful firewall on each instance via a set of rules. The following inbound ports must be available on the Amazon EC2 instance from a specific client’s source IP address (restrictive access).
TypeProtocolPort RangeSourceDescriptionPlatform
Custom TCP RuleTCP443<YOUR SOURCE IP>HTTPSBoth
SSHTCP22<YOUR SOURCE IP>SSHLinux only
Custom TCP RuleTCP4172<YOUR SOURCE IP>PCoIPBoth
Custom UDP RuleUDP4172<YOUR SOURCE IP>PCoIPBoth
Custom TCP RuleTCP60443<YOUR SOURCE IP>PCoIPBoth
RDPTCP3389<YOUR SOURCE IP>RDPWindows only

Deploying the desktop on Linux

For our Linux deployment, we use the latest CentOS 7.5 AMI from AWS Marketplace and install the NVIDIA/Xorg/KDE/Wacom stack to create a fully functioning VFX Linux desktop environment. This stack contains the following components:

  • CentOS 7.5.1804_2 AMI
  • NVIDIA Grid 6.1 (390.57 May 2018) driver
  • Teradici CAS 2.12
  • Wacom 0.40 driver

Feel free to use your own CentOS 7.2+ AMI and modify the step by step instructions accordingly.

Setting up the desktop on Linux

To launch a g3.4xl instance in the closest AWS Region in your AWS account using the created key-pair and security group, use an AMI ID from the ones in the following table. For reference, search for the AMI using the keywords CentOS Linux 7 x86_64 HVM EBS 1804_2.

AWS RegionAWS Region IDAMI ID
US East (N. Virginia)us-east-1ami-d5bf2caa
US East (Ohio)us-east-2ami-77724e12
US West (N. California)us-west-1ami-3b89905b
US West (Oregon)us-west-2ami-5490ed2c
EU (Frankfurt)eu-central-1ami-9a183671
EU (Ireland)eu-west-1ami-4c457735
Asia Pacific (Tokyo)ap-northeast-1ami-3185744e
Asia Pacific (Singapore)ap-southeast-1ami-da6151a6
Asia Pacific (Sydney)ap-southeast-2ami-0d13c26f

Once the g3.4xl instance has passed its EC2 instance 2/2 status checks, we can build in true AWS style.

First, log in to the instance and set up the environment.

# ssh into running Amazon EC2 instance
ssh [email protected]<IP-ADDRESS>.<AWS-REGION>.compute.amazonaws.com
# yes

# set a password for your user
sudo passwd centos

# disable selinux
sudo sed -ir 's/SELINUX=\(disabled\|enforcing\|permissive\)/SELINUX=disabled/' /etc/selinux/config

# install the EPEL repository
sudo yum install wget -y
sudo wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -i epel-release-latest-7.noarch.rpm

# run yum update to make sure all packages are up-to-date
sudo yum update -y

# install the "Server with GUI" group
sudo yum groupinstall "Server with GUI" -y

# prefer KDE desktop? (optional)
sudo yum groupinstall -y "KDE Plasma Workspaces"
sudo systemctl set-default graphical.target
echo "exec startkde" >> ~/.xinitrc
startx

# uninstall KDE (optional)
# sudo yum groupremove -y "KDE Plasma Workspaces"
# sudo yum autoremove -y
# sudo reboot

# reboot to make sure the latest installed kernel is running
sudo reboot

# install kernel-devel
sudo yum install kernel-devel -y

Next, install and register the Teradici CAS 2.12 software.

# import the Teradici signing key
sudo rpm --import https://downloads.teradici.com/rhel/teradici.pub.gpg

# grab the PCoIP repo file
sudo curl -o /etc/yum.repos.d/pcoip.repo https://downloads.teradici.com/rhel/pcoip.repo

# install PCoIP agent package
sudo yum install pcoip-agent-graphics -y

# load vhci-hcd kernel modules
sudo modprobe -a usb-vhci-hcd usb-vhci-iocifc

# register with the licensing service
pcoip-register-host --registration-code=<TERADICI LICENSE CODE>

# set up PCoIP agent config to enable USB
echo """pcoip.grid_diff_map = 0 pcoip.enable_usb = 1 pcoip.usb_auth_table = "23XXXXXX" pcoip.usb_unauth_table = "" """ | sudo tee /etc/pcoip-agent/pcoip-agent.conf

# make sure you're running latest pcoip-agent version
sudo yum update pcoip-agent-graphics

Then install the NVIDIA GRID graphics driver and apply performance optimization to its configuration.

# NVIDIA GRID driver
# https://docs.nvidia.com/grid/index.html
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

# install nano editor
sudo yum install nano -y

# remove any old NVIDIA drivers/CUDA
sudo yum erase nvidia cuda

# disable the nouveau open source driver for NVIDIA graphics cards
sudo touch /etc/modprobe.d/blacklist.conf

# paste the following lines in one go into your shell
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF

# edit the /etc/default/grub file and add the line:
sudo nano /etc/default/grub
GRUB_CMDLINE_LINUX="rdblacklist=nouveau"

# rebuild grub2 config
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

# install pip
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py --user

# install AWS CLI
pip install awscli --upgrade --user

# configure AWS CLI credentials
aws configure

# AWS Access Key ID [None]: <ACCESS KEY>
# AWS Secret Access Key [None]: <SECRET ACCESS KEY>
# Default Region name [None]: <AWS REGION>
# Default output format [None]: <enter>

# 390.57 driver
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
chmod +x NVIDIA-Linux-x86_64-390.57-grid.run

sudo /bin/bash ./NVIDIA-Linux-x86_64-390.57-grid.run

# respond to the NVIDIA installer prompts as follows:
    # <accept> the EULA
    # <Yes> to register kernel module sources with DKMS
    # <No> to installing 32-bit libraries
    # <No> to modifying the x.org file at end of install
    # <OK> to complete the installer

# check driver installed
nvidia-smi -q | head

# g3/NVIDIA optimization settings
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html
sudo nvidia-persistenced
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,1177

sudo reboot

Install CUDA if required by any of your VFX software such as Autodesk Maya or SideFX Houdini:

# install CUDA and OpenCL
# https://developer.download.nvidia.com/compute/cuda/9.2/Prod/docs/sidebar/CUDA_Installation_Guide_Linux.pdf
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=runfilelocal

wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda_9.2.88_396.26_linux
mv cuda_9.2.88_396.26_linux cuda_9.2.88_396.26_linux.run

# don't install the actual graphics driver, just CUDA 9.2 toolkit, sym-link
sudo /bin/sh cuda_9.2.88_396.26_linux.run

#########################################
Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26?
(y)es/(n)o/(q)uit: n

Install the CUDA 9.2 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
[ default is /usr/local/cuda-9.2 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.2 Samples?
(y)es/(n)o/(q)uit: n

Installing the CUDA Toolkit in /usr/local/cuda-9.2 ...
#########################################

# CUDA Patch 1 (Released May 16, 2018)
wget https://developer.nvidia.com/compute/cuda/9.2/Prod/patches/1/cuda_9.2.88.1_linux
mv cuda_9.2.88.1_linux cuda_9.2.88.1_linux.run
sudo /bin/sh cuda_9.2.88.1_linux.run

# Ensure these ENV VARs are present: /etc/profile.d
export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Finally, install Wacom drivers.

# install Wacom driver
# https://github.com/linuxwacom/input-wacom/releases
cd ~
wget https://github.com/linuxwacom/input-wacom/releases/download/input-wacom-0.40.0/input-wacom-0.40.0.tar.bz2
tar jxf input-wacom-0.40.0.tar.bz2
cd input-wacom-0.40.0
sudo su
./configure
make && make install
modprobe wacom
dracut --force
sudo touch /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf

# edit Wacom conf file as follows
sudo nano /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf

Section "InputClass"
    Identifier "Wacom pressure compatibility"
    MatchDriver "wacom"
    Option "Pressure2K" "true"
EndSection

# check Elastic Network Adapter (ENA) is running on your instance
modinfo ena
ethtool -i eth0
aws ec2 describe-images --image-id <AMI-ID> --query 'Images[].EnaSupport'

# if that command returns false, proceed to enable it
# make sure that you have AWS CLI installed with AWS credentials on your local machine
sudo shutdown now
aws ec2 modify-instance-attribute --instance-id <CURRENT EC2 INSTANCE ID> --ena-support

# if you're using a pre-existing Linux AMI, you need to install the ENA driver yourself
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html#enhanced-networking-ena-linux

sudo reboot

Deploying the desktop on Windows

We use the latest AWS-provided Windows 2016 AMI for our deployment and install the NVIDIA/Teradici/Wacom stack to create a fully functioning VFX Windows desktop environment. This stack contains the following components:

  • Windows Server 2016 Base 2018.04.11
  • NVIDIA Grid 6.1 (391.58 May 2018) driver
  • Teradici CAS 2.12
  • Latest Wacom driver

Feel free to use your own Windows 2016 AMI and modify the step by step instructions accordingly.

Windows Instructions

To launch a g3.4xl instance in the closest AWS Region in your AWS account using the created key-pair and security group, use an AMI ID from the ones in the following table. For reference, the AMI name is Microsoft Windows Server 2016 Base 2018.04.11.

AWS RegionAWS Region IDAMI ID
US East (N. Virginia)us-east-1ami-3633b149
US East (Ohio)us-east-2ami-5984b43c
US West (N. California)us-west-1ami-3dd1c25d
US West (Oregon)us-west-2ami-f3dcbc8b
EU (Frankfurt)eu-central-1ami-b5530b5e
EU (Ireland)eu-west-1ami-4cc09a35
Asia Pacific (Tokyo)ap-northeast-1ami-0e809272
Asia Pacific (Singapore)ap-southeast-1ami-00a2847c
Asia Pacific (Sydney)ap-southeast-2ami-7279b010

Once the g3.4xl instance has passed its Amazon EC2 instance 2/2 status checks, let’s go build:

# use AWS Management Console to right-click EC2 instance and "Get Windows Password" -> <RDP PASSWORD>

# RDP into machine
# address: ec2-<IP-ADDRESS>.<AWS-REGION>.compute.amazonaws.com
# username: Administrator
# password: <RDP PASSWORD>

# set a password in command prompt
# https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-windows-passwords.html
net user Administrator <NEW PASSWORD>

# configure Powershell - Allow ExecutionPolicy of Powershell scripts
Set-ExecutionPolicy -ExecutionPolicy AllSigned
A

# enable Software Secure Attention Sequence (SAS) setting
Open gpedit.msc
Expand Computer Configuration > Administrative Templates > Windows Components
Select Windows Logon Options
Double-click Disable or enable software Secure Attention Sequence
Select Enabled
Select Services from the drop down list in the bottom left pane
Click OK

# install AWS CLI
# https://docs.aws.amazon.com/cli/latest/userguide/awscli-install-windows.html
# download and install: https://s3.amazonaws.com/aws-cli/AWSCLI64.msi

# configure AWS CLI credentials in Powershell
aws configure

# AWS Access Key ID [None]: <ACCESS KEY>
# AWS Secret Access Key [None]: <SECRET ACCESS KEY>
# Default Region name [None]: <AWS REGION>
# Default output format [None]: <enter>

# download NVIDIA GRID driver from Amazon S3
# right-click Powershell, Run As Administrator, paste following into Powershell

$Bucket = "ec2-windows-nvidia-drivers"
$KeyPrefix = "latest"
$LocalPath = "C:\Users\Administrator\Desktop\NVIDIA"
$Objects = Get-S3Object -BucketName $Bucket -KeyPrefix $KeyPrefix -Region us-east-1
foreach ($Object in $Objects) {
    $LocalFileName = $Object.Key
    if ($LocalFileName -ne '' -and $Object.Size -ne 0) {
        $LocalFilePath = Join-Path $LocalPath $LocalFileName
        Copy-S3Object -BucketName $Bucket -Key $Object.Key -LocalFile $LocalFilePath -Region us-east-1
    }
}

# run NVIDIA GRID installer
C:\Users\Administrator\Desktop\NVIDIA\391.57_grid_win10_server2016_64bit_international.exe

# reboot machine via command prompt
cmd shutdown /r

# Optimize GPU settings (follow these instructions)
# https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/optimize_gpu.html

# via Powershell
cd "C:\Program Files\NVIDIA Corporation\NVSMI"
.\nvidia-smi --auto-boost-default=0
.\nvidia-smi -ac "2505,1177"

# go to www.teradici.com, create account, and request access from Teradici via support ticket
# download Teradici PCoIP CAS software: PCoIP Graphics Agent 2.12 for Windows or later

# install PCoIP graphics agent package via GUI based installer
enter <TERADICI LICENSE CODE> via GUI installer
reboot machine

# download and install latest Wacom drivers from Wacom website
# https://www.wacom.com/en/support/product-support/drivers

# double-check the Elastic Network Adapter (ENA) is running
# ensure you have AWS CLI installed with AWS credentials on your local machine
aws ec2 describe-instances --instance-ids <CURRENT EC2 INSTANCE ID> --query "Reservations[].Instances[].EnaSupport"

# if the check returns false, install ENA drivers
# https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/enhanced-networking-ena.html

# if you're using a pre-existing Windows AMI, you need to install the ENA driver yourself
# https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/enhanced-networking-ena.html

Validating the desktop

Finally, take your new Linux or Windows VFX workstation for a spin. Using a zero client:

# connect Wacom tablet to zero-client and start a PCoIP session...
# ensure you configure zero-client to connect via:
# “Auto-Detect” in local z/c connection settings

# install any other software you need...

# don't forget to configure your floating license servers...

# finally, create a new AMI to capture your new custom VFX workstation image in your account

Teradici provides a software client for Windows and macOS that you can use to validate the setup of your Windows or Linux desktop. It’s also handy for system administrators who need to access a graphics workstation for artist technical support.

Testing the desktop

For testing, let’s run Autodesk 3ds Max on Windows and Autodesk Maya on Linux.

In 3ds Max, we have a 35-million-poly scene from the GPU-accelerated renderer Redshift, fully interactive and able to use the NVIDIA card to perform CUDA-based GPU final rendering.

In Maya, we show the 16 vCPUs and 120 GB of RAM available to this 3D scene file. The file takes 10 minutes to final render at HD resolution on a g3.4xl instance or, if you decide to offload the CUDA rendering to the Amazon EC2 P3.16xl instance type, just 19 seconds!

Conclusion

The Amazon EC2 G3 instance type is purpose-built to provide a high-end professional graphics infrastructure for visual computing applications. With remote protocols like Teradici PCoIP, G3 instances are the next-generation VFX cloud desktops that can deliver outstanding performance. With many studios already taking advantage of elastic cloud scaling for rendering, now is a great time to deploy cloud desktops for your business.

Deploying a 4K, GPU-backed Linux desktop instance on AWS

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/deploying-4k-gpu-backed-linux-desktop-instance-on-aws/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

AWS currently supports many managed des­ktop delivery mechanisms. Amazon WorkSpaces and Amazon AppStream 2.0 both deliver managed Windows-based machine images with GPU-backed instances. However, many desktop services and applications are better served through a Linux backed instance. Given the variety of Linux distributions as well as desktop managers, it can be valuable to have a generic solution for provisioning a Linux desktop on Amazon EC2.

A GPU-backed instance reduces the computational requirements from the client (local) machine, eliminating the need for a local discrete GPU to run graphical workloads. The framebuffer objects generated by the GPU are compressed when sent over the network, and decompressed by the local CPU resources. This allows clients to take advantage of the server GPU and display the high-resolution content on local thin clients, mobile devices, and low-powered desktops and laptops. Such GPU-backed Linux instances have been used for VFX rendering, computational drug discovery, and computational fluid dynamics (CFD) simulation use cases. An upcoming followup post details enabling this technology on the Windows platform.

Configuration

In this configuration, a client machine connects to the provisioned desktop (server) in the cloud. The server captures the framebuffer, which is sent in real time to the client machine over the network. Thus latency is an important metric to consider when provisioning this solution. I recommend choosing the nearest AWS Region (under 100 ms). Some customers may even prefer to install AWS Direct Connect.

RegionLatency
US-East (Virginia)18 ms
US East (Ohio)31 ms
US-West (California)77 ms
US-West (Oregon)97 ms
Canada (Central)29 ms
Europe (Ireland)89 ms
Europe (London)90 ms
Europe (Frankfurt)108 ms
Asia Pacific (Mumbai)197 ms
Asia Pacific (Seoul)198 ms
Asia Pacific (Singapore)288 ms
Asia Pacific (Sydney)218 ms
Asia Pacific (Tokyo)188 ms
South America (São Paulo)138 ms
China (Beijing)267 ms
AWS GovCloud (US)97 ms

Source: http://www.cloudping.info/ from the Amazon offices located in Herndon, VA

Bandwidth requirements depend on the quality of the desktop experience as well as the desired resolution. Provision the backend Linux desktop instance with a 4096×2160 (4K) resolution. Depending on the specific G3 instance type selected, multi-GPU managed desktops give additional performance benefits. Each instance can also host multiple users, either in collaborative sessions, or with up to four independent 4K monitors. The GPU framebuffer memory used per session generally limits the number of sessions per managed desktop.

A smooth reliable experience depends on a low latency and high-bandwidth connection to the EC2 instance hosting the desktop. One of the benefits of using a multithreaded framebuffer reader is that only the defined block of the rendered desktop that is changing needs to be sent over the network. Full-screen redraws may be necessary only in rare cases. The minimum requirements for this 4K (3840×2160) configuration are as follows:

  • Bandwidth: 50 Mbps
  • Latency: < 30 ms
  • Jitter: < 5 ms

Deployment

Use RHEL/CentOS for the deployment. Except for DCV, this stack is compatible with Debian/Ubuntu distributions. Use the CentOS 7.5 Server AMI and install the NVIDIA/Xorg/KDE stack  to create a fully functioning desktop environment with a max resolution of 16384 x 8640 (that is, 4x4K) at 60 Hz.

This stack contains the following software:

  • CentOS 7.5 Base
  • Xorg 1.19
  • NVIDIA Grid Driver 6.1 (for the G3 instance family)
  • KDE Desktop environment
  • VirtualGL
  • TurboVNC
  • NICE DCV

To make the most efficient use of the NVIDIA Tesla M60 framebuffer memory, disable the compositing features of the desktop manager. Other non-compositing desktop managers (such as XFCE, MATE, etc.) are supported as well. This ensures that the GPU is reserved for specific OpenGL API tasks for the application, and that the performance is not impacted by the desktop environment decorations.

Start up a CentOS 7.5 server desktop based on the latest AMI available in the closest Region:

Distributor ID:    CentOS
Description:       CentOS Linux release 7.5.1804 (Core)
Release:           7.5.1804
Codename:          Core

Now install the Xorg stack with the KDE desktop manager:

sudo yum install epel-release
sudo yum update
sudo yum groupinstall "Development Tools"
sudo yum install xorg-* kernel-devel dkms python-pip lsb
sudo pip install awscli
sudo yum groupinstall "KDE Plasma Workspaces"
sudo systemctl disable firewalld #AWS security groups will provide our firewall rules
# if there is a kernel update
sudo reboot

Download the NVIDIA Grid driver (6.1). For more information, see Installing the NVIDIA Driver on Linux Instances.

aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/ .
chmod +x latest/NVIDIA-Linux-x86_64-390.57-grid.run
sudo .latest/NVIDIA-Linux-x86_64-390.57-grid.run
# register the driver with dkms, ignore errors associated with 32bit compatible libraries

Deposit the xorg.conf file in /etc/X11/xorg.conf:

Section "ServerLayout"
        Identifier     "X.org Configured"
        Screen      0  "Screen0" 0 0
        InputDevice    "Mouse0" "CorePointer"
        InputDevice    "Keyboard0" "CoreKeyboard"
EndSection
 
Section "Files"
        ModulePath   "/usr/lib64/xorg/modules"
        FontPath     "catalogue:/etc/X11/fontpath.d"
        FontPath     "built-ins"
EndSection
 
Section "Module"
        Load  "glx"
EndSection
 
Section "InputDevice"
        Identifier  "Keyboard0"
        Driver      "kbd"
EndSection
 
Section "InputDevice"
        Identifier  "Mouse0"
        Driver      "mouse"
        Option      "Protocol" "auto"
        Option      "Device" "/dev/input/mice"
        Option      "ZAxisMapping" "4 5 6 7"
EndSection
 
Section "Monitor"
        Identifier   "Monitor0"
        VendorName   "Monitor Vendor"
        ModelName    "Monitor Model"
        Modeline "3840x2160_60.00"  712.34  3840 4152 4576 5312  2160 2161 2164 2235  -HSync +Vsync
EndSection

 
Section "Device"
        Identifier  "Card0"
        Driver      "nvidia"
        Option "ConnectToAcpid" "0"
        BusID       "PCI:0:30:0"
EndSection
 
Section "Screen"
        Identifier "Screen0"
        Device     "Card0"
        Monitor    "Monitor0"
        SubSection "Display"
                Viewport   0 0
                Depth     24
        Modes    "4096x2160" "3840x2160"
        EndSubSection
EndSection

Reboot again and check that the nvidia-gridd service is running. You may notice errors. They can be safely ignored after the nvidia-gridd service successfully acquires a license.

[[email protected] ~]# systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-05-29 18:37:35 UTC; 39s ago
  Process: 863 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
 Main PID: 881 (nvidia-gridd)
   CGroup: /system.slice/nvidia-gridd.service
           └─881 /usr/bin/nvidia-gridd
May 29 18:37:35 ip-10-0-125-164.ec2.internal systemd[1]: Starting NVIDIA Grid Daemon...
May 29 18:37:35 ip-10-0-125-164.ec2.internal nvidia-gridd[881]: Started (881)
May 29 18:37:35 ip-10-0-125-164.ec2.internal systemd[1]: Started NVIDIA Grid Daemon.
May 29 18:37:36 ip-10-0-125-164.ec2.internal nvidia-gridd[881]: Configuration parameter ( ServerAddress  FeatureType) not set
May 29 18:37:40 ip-10-0-125-164.ec2.internal nvidia-gridd[881]: Calling load_byte_array(tra)
May 29 18:37:41 ip-10-0-125-164.ec2.internal nvidia-gridd[881]: License acquired successfully (2)

You can confirm that 4K resolution is enabled by running the following command:

DISPLAY=:0 xrandr -q
Screen 0: minimum 8 x 8, current 4096 x 2160, maximum 16384 x 8640
DVI-D-0 connected primary 4096x2160+0+0 (normal left inverted right x axis y axis) 641mm x 400mm
2560x1600 59.86+
4096x2160 60.03*
3840x2160 60.00 

Finally, check that your underlying GL renderer is using the NVIDIA driver by querying glxinfo

DISPLAY=:0 glxinfo

OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Quadro FX Tesla M60/PCIe/SSE2
OpenGL core profile version string: 4.5.0 NVIDIA 390.57
OpenGL core profile shading language version string: 4.50 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.6.0 NVIDIA 390.57
OpenGL shading language version string: 4.60 NVIDIA

At the time of publication, OpenGL 4.5 is enabled. Your applications can take advantage of that API for rendering.

To interact with the instance, install server-side desktop remote display software that can specifically take advantage of the 3D hardware acceleration. For example, AWS provides the NICE DCV platform.

DCV is an accelerated remote desktop framework that provides in-web browser desktop connections. DCV is supported in both Windows and Linux (RHEL/CentOS). In the Windows platform, OpenGL and DirectX are fully supported. DCV entitlement is free when provisioning on AWS. NICE DCV is also provided as a component to the AWS EnginFrame and myHPC solutions.

To install DCV, download the NICE DCV 2017 EL7 archive and Administrative Guide. After you extract the archive in the instance, you see a list of nice-* RPMS. You don’t have to worry about licensing, as the installer captures that the instance is running in AWS.

sudo yum localinstall nice-*
sudo systemctl enable dcvserver
sudo systemctl start dcvserver

When the DCV server starts, you have the option to create a single console session or multiple virtual sessions. You must assign a password for the CentOS user issued, by running the following command:

sudo passwd centos

Start the console session:

sudo dcv create-session --type=console --owner centos session1
sudo dcv list-sessions

The AWS security groups are enabled to allow TCP 8443 traffic to the instance. You see the DCV login portal and can interact with the instance. Other popular frameworks include the following:

You can also find plug and play images for managed desktops in the AWS Marketplace.

Optimization

Implement the changes outlined in the Optimizing GPU Settings (P2, P3, and G3 Instances) topic. You can turn off the autoboost feature and set the maximum graphics and memory clocks manually.

sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,1177

Application testing

For testing, look at PyMOL (PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.). PyMOL is a standard commercial drug discovery application that is used for processing, and visualizing biochemical structures.  I used the opensource fork.

With the NVIDIA GRID licensing enabled earlier, PyMOL can take advantage of the Quadro features supplied by the Tesla M60. After it’s installed and loaded, you can confirm the functionality of the entire G3 instance software stack installed earlier:

PyMOL(TM) Molecular Graphics System, Version 2.1.0.
 Copyright (c) Schrodinger, LLC.
 All Rights Reserved.
 
    Created by Warren L. DeLano, Ph.D. 
 
    PyMOL is user-supported open-source software.  Although some versions
    are freely available, PyMOL is not in the public domain.
 
    If PyMOL is helpful in your work or study, then please volunteer 
    support for our ongoing efforts to create open and affordable scientific
    software by purchasing a PyMOL Maintenance and/or Support subscription.

    More information can be found at "http://www.pymol.org".
 
    Enter "help" for a list of commands.
    Enter "help <command-name>" for information on a specific command.

 Hit ESC anytime to toggle between text and graphics.

 Detected OpenGL version 2.0 or greater. Shaders available.
 Detected GLSL version 4.60.
 OpenGL graphics engine:
  GL_VENDOR:   NVIDIA Corporation
  GL_RENDERER: Quadro FX Tesla M60/PCIe/SSE2
  GL_VERSION:  4.6.0 NVIDIA 390.57
 Adapting to Quadro hardware.
 Detected 16 CPU cores.  Enabled multithreaded rendering.

In the PyMOL window, run “fetch 5ta3”, which is a 39k amino acid protein, under the 4K desktop environment. Rotating and translating the protein should be smooth and respond quickly to pointer events.

The PyMOL Gallery contains other representative examples that take advantage of various visualization and processing workflows. Also, you can find many demos (choose Wizard, Demo).

Under the Sculpting demo, you can show the pointer latency between the client and server.

Finally, look at ray tracing. From the PyMOL wiki, take a chemical structure and render each frame with ray tracing to produce a video. On the Tesla M60 with Quadro features enabled, the total render time was approximately 1 minute.

Scalability

As I mentioned previously, the framebuffer redirection protocols have a feature set to create multiple virtual sessions per node. A virtual session is not necessarily tied to a single user either. In other words, the number of independent virtual sessions is limited by the total amount of GPU frame buffer memory used in all sessions per GPU. Thus, it’s possible to scale horizontally by increasing the number of G3 instances, or vertically by using larger instance types in the G3 family.

Summary

The G3 instance type is purpose-built to provide a managed, high-end professional graphics infrastructure for visual computing needs. With NICE DCV, you can take advantage of NVIDIA Quadro software features for a range of applications including drug discovery and VFX rendering. Connected with the AWS high-performance network backbone, the instance can become an integral part of your graphics workload pipeline. Now, you can power up and deliver your applications to teams working anywhere in the world.