All posts by Geoff Murase

Integrating an Inferencing Pipeline with NVIDIA DeepStream and the G4 Instance Family

Post Syndicated from Geoff Murase original

Contributed by: Amr Ragab, Business Development Manager, Accelerated Computing, AWS and Kong Zhao, Solution Architect, NVIDIA Corporation

AWS continually evolves GPU offerings, striving to showcase how new technical improvements created by AWS partners improve the platform’s performance.

One result from AWS’s collaboration with NVIDIA is the recent release of the G4 instance type, a technology update from the G2 and G3. The G4 features a Turing T4 GPU with 16GB of GPU memory, offered under the Nitro hypervisor with one GPU to 4 GPUS per node. A bare metal option will be released in the coming months. It also includes up to 1.8 TB of local non-volatile memory express (NVMe) storage and up to 100 Gbps of network bandwidth.

The Turing T4 is the latest offering from NVIDIA, accelerating machine learning (ML) training and inferencing, video transcoding, and other compute-intensive workloads. With such a diverse array of optimized directives, you can now perform diverse accelerated compute workloads on a single instance family.

NVIDIA has also taken the lead in providing a robust and performant software layer in the form of SDKs and container solutions through the NVIDIA GPU Cloud (NGC) container registry. These accelerated components, combined with AWS elasticity and scale, provide a powerful combination for performant pipelines on AWS.


This post focuses on one such NVIDIA SDK: DeepStream.

The DeepStream SDK is built to provide an end-to-end video processing and ML inferencing analytics solution. It uses the Video Codec API and TensorRT as key components.

DeepStream also supports an edge-cloud strategy to stream perception on the edge and other sensor metadata into AWS for further processing. An example includes wide-area consumption of multiple camera streams and metadata through the Amazon Kinesis platform.

Another classic workload that can take advantage of DeepStream is compiling the model artifacts resulting from distributed training in AWS with Amazon SageMaker Neo. Use this model on the edge or on an Amazon S3 video data lake.

If you are interested in exploring these solutions, contact your AWS account team.


Set up programmatic access to AWS to instantiate a g4dn.2xlarge instance type with Ubuntu 18.04 in a subnet that routes SSH access. If you are interested in the full stack details, the following are required to set up the instance to execute DeepStream SDK workflows.

  • An Ubuntu 18.04 Instance with:
    • NVIDIA Turing T4 Driver (418.67 or latest)
    • CUDA 10.1
    • nvidia-docker2

Alternatively, you can launch the NVIDIA Deep Learning AMI available in AWS Marketplace, which includes the latest drivers and SDKs.

aws ec2 run-instances --region us-east-1 --image-id ami-026c8acd92718196b --instance-type g4dn.2xlarge --key-name <key-name> —subnet-id <subnet> --security-group-ids {<security-groupids>} —block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=75}'

When the instance is up, connect with SSH and pull the latest DeepStream SDK Docker image from the NGC container registry.

docker pull

nvidia-docker run -it --rm -v /usr/lib/x86_64-linux-gnu/ -v /tmp/.X11-unix:/tmp/.X11-unix -p 8554:8554 -e DISPLAY=$DISPLAY 4.0-19.07

If your instance is running a full X environment, you can pass the authentication and display to the container to view the results in real time. However, for the purposes of this post, just execute the workload on the shell.

Go to the /root/deepstream_sdk_v4.0_x86_64/samples/configs/deepstream-app/ folder.

The following configuration files are included in the package:

  • source30_1080p_resnet_dec_infer_tiled_display_int8.txt: This configuration file demonstrates 30 stream decodes with primary inferencing.
  • source4_1080p_resnet_dec_infer_tiled_display_int8.txt: This configuration file demonstrates four stream decodes with primary inferencing, object tracking, and three different secondary classifiers.
  • source4_1080p_resnet_dec_infer_tracker_sgie_tiled_display_int8_gpu1.txt: This configuration file demonstrates four stream decodes with primary inferencing, object tracking, and three different secondary classifiers on GPU 1.
  • config_infer_primary.txt: This configuration file configures an nvinfer element as the primary detector.
  • config_infer_secondary_carcolor.txt, config_infer_secondary_carmake.txt, config_infer_secondary_vehicletypes.txt: These configuration files configure an nvinfer element as the secondary classifier.
  • iou_config.txt: This configuration file configures a low-level Intersection over Union (IOU) tracker.
  • source1_usb_dec_infer_resnet_int8.txt: This configuration file demonstrates one USB camera as input.

The following sample models are provided with the SDK.

ModelModel typeNumber of classesResolution
Primary DetectorResnet104640 x 368
Secondary Car Color ClassifierResnet1812224 x 224
Secondary Car Make ClassifierResnet186224 x 224
Secondary Vehicle Type ClassifierResnet1820224 x 224

Edit the configuration file source30_1080p_dec_infer-resnet_tiled_display_int8.txt to disable [sink0] and enable [sink1] for file output. Save the file, then run the DeepStream sample code.

#Type - 1=FakeSink 2=EglSink 3=File
#1=mp4 2=mkv
#1=h264 2=h265
deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.txt

You get performance data on the inferencing workflow.

 (deepstream-app:1059): GLib-GObject-WARNING **: 20:38:25.991: g_object_set_is_valid_property: object class 'nvv4l2h264enc' has no property named 'bufapi-version'
Creating LL OSD context new
Runtime commands:
        h: Print this help
        q: Quit
        p: Pause
        r: Resume
NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.
** INFO: <bus_callback:163>: Pipeline ready
Creating LL OSD context new
** INFO: <bus_callback:149>: Pipeline running
**PERF: FPS 0 (Avg)     FPS 1 (Avg)     FPS 2 (Avg)     FPS 3 (Avg)     FPS 4 (Avg)     FPS 5 (Avg)     FPS 6 (Avg)     FPS 7 (Avg)     FPS 8 (Avg)     FPS 9 (Avg)     FPS 10 (Avg)   FPS 11 (Avg)     FPS 12 (Avg)    FPS 13 (Avg)    FPS 14 (Avg)    FPS 15 (Avg)    FPS 16 (Avg)    FPS 17 (Avg)    FPS 18 (Avg)    FPS 19 (Avg)    FPS 20 (Avg)    FPS 21 (Avg)    FPS 22 (Avg)    FPS 23 (Avg)    FPS 24 (Avg)    FPS 25 (Avg)    FPS 26 (Avg)    FPS 27 (Avg)    FPS 28 (Avg)    FPS 29 (Avg)
**PERF: 35.02 (35.02)   37.92 (37.92)   37.93 (37.93)   35.85 (35.85)   36.39 (36.39)   38.40 (38.40)   35.85 (35.85)   35.18 (35.18)   35.60 (35.60)   35.02 (35.02)   38.77 (38.77)  37.71 (37.71)    35.18 (35.18)   38.60 (38.60)   38.40 (38.40)   38.60 (38.60)   34.77 (34.77)   37.70 (37.70)   35.97 (35.97)   37.00 (37.00)   35.51 (35.51)   38.40 (38.40)   38.60 (38.60)   38.40 (38.40)   38.13 (38.13)   37.70 (37.70)   35.85 (35.85)   35.97 (35.97)   37.92 (37.92)   37.92 (37.92)
**PERF: 39.10 (37.76)   38.90 (38.60)   38.70 (38.47)   38.70 (37.78)   38.90 (38.10)   38.90 (38.75)   39.10 (38.05)   38.70 (37.55)   39.10 (37.96)   39.10 (37.76)   39.10 (39.00)  39.10 (38.68)    39.10 (37.83)   39.10 (38.95)   39.10 (38.89)   39.10 (38.95)   38.90 (37.55)   38.70 (38.39)   38.90 (37.96)   38.50 (38.03)   39.10 (37.98)   38.90 (38.75)   38.30 (38.39)   38.70 (38.61)   38.90 (38.67)   39.10 (38.66)   39.10 (38.05)   39.10 (38.10)   39.10 (38.74)   38.90 (38.60)
**PERF: 38.91 (38.22)   38.71 (38.65)   39.31 (38.82)   39.31 (38.40)   39.11 (38.51)   38.91 (38.82)   39.31 (38.56)   39.31 (38.26)   39.11 (38.42)   38.51 (38.06)   38.51 (38.80)  39.31 (38.94)    39.31 (38.42)   39.11 (39.02)   37.71 (38.41)   39.31 (39.10)   39.31 (38.26)   39.31 (38.77)   39.31 (38.51)   39.31 (38.55)   39.11 (38.44)   39.31 (38.98)   39.11 (38.69)   39.31 (38.90)   39.11 (38.85)   39.31 (38.93)   39.31 (38.56)   39.31 (38.59)   39.31 (38.97)   39.31 (38.89)
**PERF: 37.56 (38.03)   38.15 (38.50)   38.35 (38.68)   38.35 (38.38)   37.76 (38.29)   38.15 (38.62)   38.35 (38.50)   37.56 (38.06)   38.15 (38.35)   37.76 (37.97)   37.96 (38.55)  38.35 (38.77)    38.35 (38.40)   37.56 (38.59)   38.35 (38.39)   37.96 (38.77)   36.96 (37.88)   38.35 (38.65)   38.15 (38.41)   38.35 (38.49)   38.35 (38.41)   38.35 (38.80)   37.96 (38.47)   37.96 (38.62)   37.56 (38.47)   37.56 (38.53)   38.15 (38.44)   38.35 (38.52)   38.35 (38.79)   38.35 (38.73)
**PERF: 40.71 (38.63)   40.31 (38.91)   40.51 (39.09)   40.90 (38.95)   39.91 (38.65)   40.90 (39.14)   40.90 (39.04)   40.51 (38.60)   40.71 (38.87)   40.51 (38.54)   40.71 (39.04)  40.90 (39.25)    40.71 (38.92)   40.90 (39.11)   40.90 (38.96)   40.90 (39.25)   40.90 (38.56)   40.90 (39.15)   40.11 (38.79)   40.90 (39.03)   40.90 (38.97)   40.90 (39.27)   40.90 (39.02)   40.90 (39.14)   39.51 (38.71)   40.90 (39.06)   40.51 (38.90)   40.71 (39.01)   40.90 (39.27)   40.90 (39.22)
**PERF: 39.46 (38.78)   39.26 (38.97)   39.46 (39.16)   39.26 (39.00)   39.26 (38.76)   39.26 (39.16)   39.06 (39.04)   39.46 (38.76)   39.46 (38.98)   39.26 (38.67)   39.46 (39.12)  39.46 (39.29)    39.26 (38.98)   39.26 (39.14)   39.26 (39.01)   38.65 (39.14)   38.45 (38.54)   39.46 (39.21)   39.46 (38.91)   39.46 (39.11)   39.26 (39.03)   39.26 (39.27)   39.46 (39.10)   39.26 (39.16)   39.26 (38.81)   39.26 (39.10)   39.06 (38.93)   39.46 (39.09)   39.06 (39.23)   39.26 (39.23)
**PERF: 39.04 (38.82)   38.84 (38.95)   38.84 (39.11)   38.84 (38.98)   38.84 (38.77)   38.64 (39.08)   39.04 (39.04)   39.04 (38.80)   39.04 (38.99)   39.04 (38.73)   38.64 (39.04)  39.04 (39.25)    38.44 (38.90)   39.04 (39.13)   38.84 (38.99)   38.44 (39.03)   39.04 (38.62)   39.04 (39.18)   38.84 (38.90)   38.84 (39.07)   37.84 (38.84)   39.04 (39.24)   39.04 (39.09)   39.04 (39.14)   38.64 (38.78)   38.64 (39.03)   39.04 (38.95)   38.84 (39.05)   38.64 (39.14)   38.24 (39.08)
** INFO: <bus_callback:186>: Received EOS. Exiting ...
App run successful

The output video file, out.mp4, is under the current folder and can be played after download.

Extending the architecture further, you can make use of AWS Batch to execute an event-driven pipeline.

Here, the input file from S3 triggers an Amazon CloudWatch event, standing up a G4 instance with a DeepStream Docker image, sourced in Amazon ECR, to process the pipeline. The video and ML analytics results can be pushed back to S3 for further processing.


With this basic architecture in place, you can execute a video analytics and ML inferencing pipeline. Future work can also include integration with Kinesis and cataloging DeepStream results. Let us know how it goes working with DeepStream and the NVIDIA container stack on AWS.


Scalable deep learning training using multi-node parallel jobs with AWS Batch and Amazon FSx for Lustre

Post Syndicated from Geoff Murase original

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

How easy is it to take an AWS reference architecture and implement a production solution? At re:Invent 2018, Toyota Research Institute presented their production DL HPC architecture. This was based on a reference architecture for a scalable, deep learning, high performance computing solution, released earlier in the year.  The architecture was designed to run ImageNet and ResNet-50 benchmarks on Apache MXNet and TensorFlow machine learning (ML) frameworks. It used cloud best practices to take advantage of the scale and elasticity that AWS offers.

With the pace of innovation at AWS, I can now show an evolution of that deep learning solution with new services.

A three-component HPC cluster is common in tightly coupled, multi-node distributed training solutions. The base layer is a high-performance file system optimized for reading the images packed as TFRecords or RecordIO as well as in its original form. The reference architecture originally referenced BeeGFS. In this post, I use the high performance Amazon FSx for Lustre file system, announced at re:Invent 2018. The second layer is the scalable compute, which originally used p3.16xl instances containing eight NVIDIA Tesla V100 per node. Finally, a job scheduler is the third layer for managing multiuser access to plan and distribute the workload across the available nodes.

In this post, I demonstrate how to create a fully managed HPC infrastructure, execute the distributed training job, and collapse it using native AWS services. In the three-component HPC design, the scheduler and compute layers are achieved by using AWS Batch as a managed service built to run thousands of batch computing jobs. AWS Batch dynamically provisions compute resources based on the specific job requirements of the distributed training job.

AWS Batch recently started supporting multi-node parallel jobs, allowing tightly coupled jobs to be executed. This compute layer can be coupled with the FSx for Lustre file system.

FSx for Lustre is a fully managed, parallel file system based on Lustre that can scale to millions of IOPS, and hundreds of gigabytes per second throughput. FSx for Lustre is seamlessly integrated with Amazon S3 to parallelize the ingestion of data from the object store.


Coupled together, this provides a core compute solution for running workloads requiring high performance layers. One additional benefit is that AWS Batch and FSx for Lustre are API-driven services and can be programmatically orchestrated.

The goal of this post is to showcase an innovative architecture, replacing self-managed roll-your-own file system and compute to platform managed services using FSx for Lustre and AWS Batch running containerized applications, hence reducing complexity and maintenance. This can also serve as a template for other HPC applications requiring similar compute/networking and storage topologies. With that in mind, benchmarks related to distributed deep learning are out of scope. As you see at the end of this post, I achieved linear scalability over a broad range (8 – 160) of GPUs spanning 1–20 p3.16xlarge nodes.


Much of the deployment was covered in a previous post, Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch. However, some feature updates since then have simplified the initial deployment.

In brief, you provision the following resources:

  • A FSx for Lustre file system hydrated from a S3 bucket that provides the source ImageNet 2012 images
  • A new Ubuntu 16.04 ECS instance:
    • Lustre kernel driver and FS mount
    • CUDA 10 with NVIDIA Tesla 410 driver
    • Docker 18.09-ce including nvidia-docker2
    • A multi-node parallel batch–compatible TensorFlow container with the following stack:
      • Ubuntu 18.04 container image
      • HOROVOD_VERSION=0.15.2
      • NCCL_VERSION=2.3.7-1+cuda10.0
      • OPENMPI 4.0.0

FSx for Lustre setup

First, create a file system in the FSx for Lustre console. The default minimum file system size of 3600 GiB is sufficient.

  • File system name: ImageNet2012 dataset
  • Storage capacity: 3600 (GiB)

In the console, ensure that you have specified the appropriate network access and security groups so that clients can access the FSx for Lustre file system. For this post, find the scripts to prepare the dataset in the deep-learning-models GitHub repo.

  • Data repository type: Amazon S3
  • Import path: Point to an S3 bucket holding the ImageNet 2012 dataset.

While the FSx for Lustre layer is being provisioned, spin up an instance in the Amazon EC2 console with the Ubuntu 16.04 ECS AMI using a p3.2xlarge instance type. One modification required, when preparing the ecs-agent systemd file. Replace the ExecStart= stanza with the following:

ExecStart=docker run --name ecs-agent \
  --init \
  --restart=on-failure:10 \
  --volume=/var/run:/var/run \
  --volume=/var/log/ecs/:/log \
  --volume=/var/lib/ecs/data:/data \
  --volume=/etc/ecs:/etc/ecs \
  --volume=/sbin:/sbin \
  --volume=/lib:/lib \
  --volume=/lib64:/lib64 \
  --volume=/usr/lib:/usr/lib \
  --volume=/proc:/host/proc \
  --volume=/sys/fs/cgroup:/sys/fs/cgroup \
  --volume=/var/lib/ecs/dhclient:/var/lib/dhclient \
  --net=host \
  --env ECS_LOGFILE=/log/ecs-agent.log \
  --env ECS_DATADIR=/data \
  --env ECS_UPDATES_ENABLED=false \
  --env ECS_AVAILABLE_LOGGING_DRIVERS='["json-file","syslog","awslogs"]' \
  --env ECS_UPDATES_ENABLED=true \
  --env ECS_ENABLE_TASK_ENI=true \
  --env-file=/etc/ecs/ecs.config \
  --cap-add=sys_admin \
  --cap-add=net_admin \
  -d \

During the provisioning workflow, add a 500 GB SSD (gp2) Amazon EBS volume. For ease of installation, install the Lustre kernel driver first. Also, modify the kernel for compatibility. Install the dkms package first.

sudo apt install -y dkms git

Follow the instructions for Ubuntu 16.04.

Install the CUDA 10 and NVIDIA 410 driver branch according to the instructions provided by NVIDIA. It’s important that the dkms system is installed with the kernel modules being built against the kernel installed earlier.

When complete, install the latest Docker release, as well as nvidia-docker2, according to the instructions in the nvidia-docker GitHub repo, setting the default runtime to “nvidia.”

    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
    "default-runtime": "nvidia"

At this stage, you can create this AMI and keep it for future deployments. This saves time in bootstrapping, as the generic AMI can be used for a diverse set of applications.

When the FSx for Lustre file system is complete, add the file system information into /etc/fstab:

<file_system_dns_name>@tcp:/fsx /fsx lustre defaults,_netdev 0 0

Confirm that the mounting is successful by using the following command:

sudo mkdir /fsx && sudo mount -a

Building the multi-node parallel batch TensorFlow Docker image

Now, set up the multi-node TensorFlow container image. Keep in mind that this process takes approximately two hours to build on a p3.2xlarge. Use the Dockerfile build scripts for setting up multinode parallel batch jobs.

git clone
cd aws-mnpbatch-template
docker build -t nvidia/mnp-batch-tensorflow .

As part of the Docker container’s ENTRYPOINT, use the script from the Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch post. Optimize it for running the TensorFlow distributed training as follows:

 export INTERFACE=eth0
 export MODEL_HOME=/root/deep-learning-models/models/resnet/tensorflow
 /opt/openmpi/bin/mpirun --allow-run-as-root -np $MPI_GPUS --machinefile ${HOST_FILE_PATH}-deduped -mca plm_rsh_no_tree_spawn 1 \
                        -bind-to socket -map-by slot \
                        $EXTRA_MPI_PARAMS -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
                        -x NCCL_SOCKET_IFNAME=$INTERFACE -mca btl_tcp_if_include $INTERFACE \
                        -x TF_CPP_MIN_LOG_LEVEL=0 \
                        python3 -W ignore $MODEL_HOME/ \
                        --data_dir $JOB_DIR --num_epochs 90 -b $BATCH_SIZE \
                        --lr_decay_mode poly --warmup_epochs 10 --clear_log

There are some undefined environment variables in the startup command. Those are filled in when you create the multi-node batch job definition file in later stages of this post.

Upon successfully building the Docker image, commit this image to the Amazon ECR registry, to be pulled later. Consult the ECR push commands in the registry by selecting the registry and choose View Push Commands.

One additional tip:  Notice that the Docker image is approximately 12 GB, to ensure that your container instance starts up quickly. I would cache this image in the Docker cache so that incremental layer updates can be pulled from ECR instead of pulling the entire image, which takes more time.

Finally, you should be ready to create this AMI for the AWS Batch compute environment phase of the workflow. In the AWS Batch console, choose Compute environment and create an environment with the following parameters.

Compute environment

  • Compute environment type:  Managed
  • Compute environment name:  tensorflow-gpu-fsx-ce
  • Service role:  AWSBatchServiceRole
  • EC2 instance role:  ecsInstanceRole

Compute resources

Set the minimum and desired vCPUs at 0. When a job is submitted, the underlying AWS Batch service recruits the nodes, taking advantage of the elasticity and scale offered on AWS.

  • Provisioning model: On-Demand
  • Allowed instance types: p3 family, p3dn.24xlarge
  • Minimum vCPUs: 0
  • Desired vCPUs: 0
  • Maximum vCPUs: 4096
  • User-specified AMI: Use the Amazon Linux 2 AMI mentioned earlier.


AWS Batch makes it easy to specify the placement groups. If you do this, the internode communication between instances has the lowest latencies possible, which is a requirement when running tightly coupled workloads.

  • VPC Id: Choose a VPC that allows access to the FSx cluster created earlier.
  • Security groups: FSx security group, Cluster security group
  • Placement group: tf-group (Create the placement group.)

EC2 tags

  • Key: Name
  • Value: tensorflow-gpu-fsx-processor

Associate this compute environment with a queue called tf-queue. Finally, create a job definition that ties the process together and executes the container.

The following parameters in JSON format sets up the mnp-tensorflow job definition.

    "jobDefinitionName": "mnptensorflow-gpu-mnp1",
    "jobDefinitionArn": "arn:aws:batch:us-east-2:<accountid>:job-definition/mnptensorflow-gpu-mnp1:1",
    "revision": 2,
    "status": "ACTIVE",
    "type": "multinode",
    "parameters": {},
    "retryStrategy": {
        "attempts": 1
    "nodeProperties": {
        "numNodes": 20,
        "mainNode": 0,
        "nodeRangeProperties": [
                "targetNodes": "0:19",
                "container": {
                    "image": "<accountid>",
                    "vcpus": 62,
                    "memory": 424000,
                    "command": [],
                    "jobRoleArn": "arn:aws:iam::<accountid>:role/ecsTaskExecutionRole",
                    "volumes": [
                            "host": {
                                "sourcePath": "/scratch"
                            "name": "scratch"
                            "host": {
                                "sourcePath": "/fsx"
                            "name": "fsx"
                    "environment": [
                            "name": "SCRATCH_DIR",
                            "value": "/scratch"
                            "name": "JOB_DIR",
                            "value": "/fsx/resized"
                            "name": "BATCH_SIZE",
                            "value": "256"
                            "name": "EXTRA_MPI_PARAMS",
                            "name": "MPI_GPUS",
                            "value": "160"
                    "mountPoints": [
                            "containerPath": "/fsx",
                            "sourceVolume": "fsx"
                            "containerPath": "/scratch",
                            "sourceVolume": "scratch"
                    "ulimits": [],
                    "instanceType": "p3.16xlarge"


Total number of GPUs in the cluster. In this case, it’s 20 x p3.16xlarge = 160.


Number of images of per GPU to load at time for training on 16 GB of memory per GPU = 256.


Location of the TFrecords prepared earlier optimized for the number of shards = /fsx/resized.


Path to the model outputs = /scratch.

One additional tip:  You have the freedom to expose additional parameters in the job definition. This means that you can also expose model training hyperparameters, which opens the door to multi-parameter optimization (MPO) studies on the AWS Batch layer.

With the job definition created, submit a new job sourcing this job definition, executing on the tf-queue created earlier. This spawns the compute environment.

The AWS Batch service only launches the requested number of nodes. You don’t pay for the running EC2 instances until all requested nodes are launched in your compute environment.

After the job enters the RUNNING state, you can monitor the main container:0 for activity with the CloudWatch log stream created for this job. Some of key entries are as follows, with the 20 nodes joining the cluster. One additional tip: It is possible to use this infrastructure to push the model parameters and training performance to a Tensorboard for additional monitoring.

The next log screenshot shows the main TensorFlow and Horovod workflow starting up. 

Performance monitoring

On 20 p3.16xl nodes, I achieved a comparable speed of approximately 100k images/sec, with close to 90-100% GPU utilization across all 160 GPUs with the containerized Horovod TensorFlow Docker image.

When you have this implemented, try out the cluster using the recently announced p3dn.24xlarge, a 32-GB NVIDIA Tesla V100 memory variant of the p3.16xl with 100-Gbps networking. To take advantage of the full GPU memory of the p3dn in the job definition, increase the BATCH_SIZEenvironmental variable.


With the evolution of a scalable, deep learning–focused, high performance computing environment, you can now use a cloud-native approach. Focus on your code and training while AWS handles the undifferentiated heavy lifting.

As mentioned earlier, this reference architecture has an API interface, thus an event-driven workflow can further extend this work. For example, you can integrate this core compute in an AWS Step Functions workflow to stand up the FSx for Lustre layer. Submit the batch job and collapse the FSx for Lustre layer.

Or through an API Gateway, create a web application for the job submission. Integrate with on-premises resources to transfer data to the S3 bucket and hydrate the FSx for Lustre file system.

If you have any questions about this deployment or how to integrate with a longer AWS posture, please comment below. Now go power up your deep learning workloads with a fully managed, high performance compute framework!

Amazon ElastiCache performance boost with Amazon EC2 M5 and R5 instances

Post Syndicated from Geoff Murase original

Contributed by Ruchita Arora, Sr. Product Manager, Allen Farris, Software Dev Engineer, and Itay Maoz, Sr. Software Engineering Manager

Earlier this year, Amazon EC2 introduced two exciting new instance families, M5 and R5. These instances are based on the new AWS Nitro system, a combination of dedicated hardware and lightweight hypervisor that aims to deliver performance indistinguishable from bare-metal performance. These new instance families deliver up to 25 Gbps of aggregate network bandwidth, with enhanced networking based on the Elastic Network Adapter (ENA).

R5 and M5 instances feature custom hardware and custom Intel Xeon Scalable processors to enable a sustained all core frequency of up to 3.1 GHz and support Intel Advanced Vector Extension 512 (AVX-512). The latest fifth generation EC2 instances offer up to 50% more vCPUs and 60% more memory over the previous generation instances, and larger r5.24xlarge and m5.24xlarge instances.

Amazon ElastiCache

Amazon ElastiCache offers a Redis or Memcached-compatible, fully managed, in-memory data store and caching service in the cloud. The service embodies much of what makes fast data a reality for customers who are looking to process a high volume of data at incredible rates, faster than traditional databases.

As part of adding support for M5 and R5 instances in ElastiCache, we spent the time to take advantage of the AWS Nitro-based system and optimize these instances for ElastiCache for Redis. Developers love the performance, simplicity, and in-memory capabilities of Redis, making it among the most popular NoSQL key-value stores. Redis’s microsecond latency has made it a default choice for caching. The support for advanced data structures (for example, lists, sets, and sorted sets) also enables a variety of in-memory use cases such as leaderboards, in-memory analytics, messaging, and more.

Optimizing performance for ElastiCache for Redis

We started with the M5 and R5 instances and tuned performance by optimizing the Amazon Linux operating system configuration on these instances to maximize network performance for running in-memory workloads.

Using the open-source benchmarking tool rpc-perf, we ran a Redis benchmark with 14.7 million unique keys, 200-byte string values, 80% gets, 20% sets, and no command pipelining. We ran this benchmark on 20 client instances connecting to an optimized R5 instance in the same Availability Zone. We saw up to 30% more transactions per second than running ElastiCache for Redis on the same size instance with the default Linux configuration. For details, see the following table.

Vanilla R4Vanilla R5Tuned R5Vanilla R4 to Tuned R5 Improvement
large88,000 RPS179,000 RPS215,000 RPS144%
xlarge93,000 RPS180,000 RPS207,000 RPS122%
2xlarge107,000 RPS187,000 RPS217,000 RPS102%
4xlarge131,000 RPS208,000 RPS225,000 RPS71%
8xlarge/12xlarge128,000 RPS211,000 RPS247,000 RPS92%
16xlarge/24xlarge149,000 RPS181,000 RPS237,000 RPS59%

We also reduced average (p50) and tail (p99) latencies up to 23%, resulting in average latencies as low as 350 microseconds after these optimizations. The optimized M5 instances yielded 9%-42% incremental requests per second and better CPU utilization for ElastiCache for Redis workloads.

For the same caching use case scenario, ElastiCache for Redis optimized R5 instances benefited from a significant performance improvement over self-managed Redis on R4 instances. The optimized R5 instances supported 59%-144% more transactions per second than similarly sized R4 instances.

We saw similar incremental performance improvements on optimized M5 instances relative to previous generation M4 instances. The optimized M5 instances benefited from throughput improvements of up to 356% relative to previous generation M4 instances.

Among the M5 instances, the most significant improvements were in the smaller size of the M5 family. They take advantage of ENA performance with burst networking up to 10 Gbps for the m5.large through m5.4xlarge sizes, which is useful for handling infrequent traffic spikes.


We are excited to bring these new instances to customers. You benefit from less hypervisor overhead and better networking, but you also see a dramatic upside from the performance tuning work that the ElastiCache team did to take advantage of the AWS Nitro system. This is just the beginning.

Our performance team is continuing to enhance the full system for optimal ElastiCache for Redis performance, which we are rolling out in the coming months. To get started with ElastiCache on the new M5 and R5 EC2 instances, see the AWS Management Console.

Deploying a Burstable and Event-driven HPC Cluster on AWS Using SLURM, Part 2

Post Syndicated from Geoff Murase original

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

In part 1 of this series, you deployed the base components to create the HPC cluster. This unique deployment stands up the SLURM headnode. For every job submitted to the queue, the headnode provisions the needed compute resources to run the job, based on job submission parameters.

By provisioning the compute nodes dynamically, you can immediately see the benefit of elasticity, scale, and optimized operational compute costs. As new technologies are released, you can take advantage of heterogeneous deployments, such as scaling high, tightly coupled, CPU-bound workloads independently from high memory or distributed GPU-based workloads.

To further extend a cloud-native approach to designing HPC architectures, you can integrate with existing AWS services and provide additional benefits by abstracting the underlying compute resources. It is possible for the HPC cluster to be event-driven in response to requests from a web application or from direct API calls.

Additional frontend components can be added to take advantage of an API-instantiated execution of an HPC workload. The following reference architecture describes the pattern.


The difference from the previous reference architecture in Part 1 is that the user submits the job described as JSON through an HTTP call to Amazon API Gateway, which is then processed by an AWS Lambda function to submit the job.


I recommend that you start this section after completing the deployment in Part I . Write down the private IP address of the SLURM controller.

In the Amazon EC2 console, select the SLURM headnode and retrieve the private IPv4 address. In the Lambda console, create a new function based on Python 2.7 authored from scratch.

Under the environment variables, add a new entry for “HEADNODE”, “SLURM_BUCKET_S3”, “SLURM_KEY_S3” and set the value to the private IPv4 address of the SLURM controller noted earlier, plus the bucket and key pair. This allows the Lambda function to connect to the instance using SSH.

In the AWS GitHub repo that you cloned in part 1, find the lambda/ file and upload the contents to the Function Code section of the Lambda function. A derivative of this function was referenced by Puneet Agarwal, in the Scheduling SSH jobs using AWS Lambda post.

The Lambda function needs to launch in the VPC as the SLURM node and have the same security groups as the SLURM headnode. This is because the Lambda function connects to the SLURM controller using SSH. Ignore the error about creating the Lambda function across two Availability Zones for high availability (HA).

The default memory settings, with a timeout of 20 seconds, are sufficient. The Lambda execution role needs access to Amazon EC2, Amazon CloudWatch, and Amazon S3.

In the API Gateway console, create a new API from scratch and name it “hpc.” Under Resources, create a new resource as “hpc.” Then, create a new method under the “hpc” resource for POST.

Under the POST method, set the integration method to the Lambda function created earlier.

Under the resource “hpc”, choose to deploy the API for staging, calling the endpoint “dev.” You get an endpoint to execute:

curl -H "Content-Type: application/json" -X POST https://<endpoint> -d @test.json

Then, create a JSON file with the following code.

    "username": "awsuser", 
    "jobname": "hpc_test", 
    "nodes": 2, 
    "tasks-per-node": 1, 
    "cpus-per-task": 4, 
    "feature": "us-west-2a|us-west-2b|us-west-2c", 
        [{"workdir": "/home/centos/job123"},
         {"input": "s3://ar-job-input/test.input"},
         {"output": "s3://ar-job-output"}],
    "launch": "env && sleep 60"

Next, in the API Gateway console, watch the following four events happen:

  1. The API gateway passes the input JSON to the Lambda function.
  2. The Lambda function writes out a SLURM sbatch job submission file.
  3. The job is executed and held until the instance is provisioned
  4. After the instance is running, the job script executes, copies data from S3, and completes the job.

In the response body of the API call, you return the job ID.

"body": "{\"error\": \"\", \"name\": \"awsuser\", \"jobid\": \"Submitted batch job 5\\n\"}",
"statusCode": 200

When the job completes, the instance is held for 60 seconds in case another job is submitted. If no jobs are submitted, the instance is terminated by the SLURM cluster.


End-to-end scalable job submission and instance provisioning is one way to execute your HPC workloads in a scalable and elastic fashion. Now, go power your HPC workloads on AWS!

Deploying a Burstable and Event-driven HPC Cluster on AWS Using SLURM, Part 1

Post Syndicated from Geoff Murase original

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

When you execute high performance computing (HPC) workflows on AWS, you can take advantage of the elasticity and concomitant scale associated with recruiting resources for your computational workloads. AWS offers a variety of services, solutions, and open source tools to deploy, manage, and dynamically destroy compute resources for running various types of HPC workloads.

Best practices in deploying HPC resources on AWS include creating much of the infrastructure on-demand, and making it as ephemeral and dynamic as possible. Traditional HPC clusters use a resource scheduler that maintains a set of computational resources and distributes those resources over a collection of queued jobs.

With a central resource scheduler, all users have a single point of entry to a broad range of compute. Traditionally, many of these schedulers managed on-premises systems. They weren’t offered dynamically as much as cloud-based HPC clusters, and they usually only needed to manage a largely static set of resources.

However, many schedulers now support the ability to burst into AWS or manage a dynamically changing HPC environment through plugins, connectors, and custom scripting. Some of the more common resource schedulers include:


Simple Linux Resource Manager (SLURM) by SchedMD is one such popular scheduler. Using a derivative of SLURM’s elastic power plugin, you can coordinate the launch of a set of compute nodes with the appropriate CPU/disk/GPU/network topologies. You standup the compute resources for the job, instead of trying to fit a job within a set of pre-existing compute topologies.

We have recently released an example implementation of the SLURM bursting capability in the AWS Samples GitHub repo.

The following diagram shows the reference architecture.


Download the aws-plugin-for-slurm directory locally. Use the following AWS CLI commands to sync the directory with an S3 bucket to be referenced later. For more detailed instructions on the deployment follow the README within the GitHub.

git clone 
aws s3 sync aws-plugin-for-slurm/ s3://<bucket-name>

Included is a CloudFormation script, which you use to stand up the VPC and subnets, as well as the headnode. In AWS CloudFormation, choose Create Stack and import the slurm_headnode-clouformation.yml script.

The CloudFormation script lays down the landing zone with the appropriate network topology. The headnode is based on the publicly available CentOS 7.5 available in the AWS Marketplace. The latest security packages are installed with the dependencies needed to install SLURM.

I have found that scheduling performance is best if the source is compiled at runtime, which the CloudFormation script takes care of. The script sets up the headnode as a single controller. However, with minor modifications, it can be set up in a highly available manner with a backup SLURM controller.

After the deployment moves to CREATE_COMPLETEstatus in CloudFormation, use SSH to connect to the slurm headnode:

ssh -i <path/to/private/key.pem> [email protected]<public-ip-address>

Create a new sbatch job submission file by running the following commands in the vi/nano text editor:

#SBATCH —nodes=2
#SBATCH —ntasks-per-node=1
#SBATCH —cpus-per-task=4
#SBATCH —constraint=[us-west-2a]

sleep 60

This job submission script requests two nodes to be allocated, running one task per node and using four CPUs. The constraint is optional but allows SLURM to allocate the job among the available zones.

The elasticity of the cluster comes in setting the slurm.conf parameters SuspendProgram and ResumeProgram in the slurm.conf file.


You can set the responsiveness of the scaling on AWS by modifying SuspendTime. Do not set a value for ResumeRateor SuspendRate, as the underlying SuspendProgram and ResumeProgramscripts have API calls that impose their own rate limits. If you find that your API call rate limit is reached at scale (approximately 1000 nodes/sec), you can set ResumeRate and SuspendRate accordingly.

If you are familiar with SLURM’s configuration options, you can make further modifications by editing the /nfs/slurm/etc/slurm.conf.d/slurm_nodes.conffile. That file contains the node definitions, with some minor modifications. You can schedule GPU-based workloads separate from CPU to instantiate a heterogeneous cluster layout. You also get more flexibility running tightly coupled workloads alongside loosely coupled jobs, as well as job array support. For additional commands administrating the SLURM cluster, see the SchedMD SLURM documentation.

The initial state of the cluster shows that no compute resources are available. Run the sinfocommand:

[[email protected] ~]$ sinfo
all*         up  infinite     0   n/a 
gpu          up  infinite     0   n/a 
[[email protected] ~]$ 

The job described earlier is submitted with the sbatchcommand:

sbatch test.sbatch

The power plugin allocates the requested number of nodes based on your job definition and runs the Amazon EC2 API operations to request those instances.

[[email protected] ~]$ sinfo
all*      up    infinite      2 alloc# ip-10-0-1-[6-7]
gpu       up    infinite      2 alloc# ip-10-0-1-[6-7]
[[email protected] ~]$ 

The log file is located at /var/log/power_save.log.

Wed Sep 12 18:37:45 UTC 2018 Resume invoked 
/nfs/slurm/bin/ ip-10-0-1-[6-7]

After the request job is complete, the compute nodes remain idle for the duration of the SuspendTime=60value in the slurm.conffile.

[[email protected] ~]$ sinfo
all*         up  infinite     2  idle ip-10-0-1-[6-7]
gpu          up  infinite     2  idle ip-10-0-1-[6-7]
[[email protected] ~]$ 

Ideally, ensure that other queued jobs have an opportunity to run on the current infrastructure, assuming that the job requirements are fulfilled by the compute nodes.

If the job requirements are not fulfilled and there are no more jobs in the queue, the aws-slurm shutdown script takes over and terminates the instance. That’s one of the benefits of an elastic cluster.

Wed Sep 12 18:42:38 UTC 2018 Suspend invoked /nfs/slurm/bin/ ip-10-0-1-[6-7]
    "TerminatingInstances": [
            "InstanceId": "i-0b4c6ec4945afe52e", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
    "TerminatingInstances": [
            "InstanceId": "i-0f3139a52f2602c60", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            "PreviousState": {
                "Code": 16, 
                "Name": "running"

The SLURM elastic compute plugin provisions the compute resources based on the scheduler queue load. In this example implementation, you are distributing a set of compute nodes to take advantage of scale and capacity across all Availability Zones within an AWS Region.

With a minor modification on the VPC layer, you can use this same plugin to stand up compute resources across multiple Regions. With this implementation, you can truly take advantage of a global HPC footprint.

“Imagine creating your own custom cluster mix of GPUs, CPUs, storage, memory, and networking – just the way you want it, then running your experiment, getting the results, and then tearing it all down.” — InsideHPC. Innovation Unbound: What Would You do with a Million Cores?

Part 2

In Part 2 of this post series, you integrate this cluster with AWS native services to handle scalable job submission and monitoring using Amazon API Gateway, AWS Lambda, and Amazon CloudWatch.

Improving application performance and reducing costs with Amazon EBS-Optimized Instance burst capability

Post Syndicated from Geoff Murase original

Contributed by Sooraj Prasannan, Senior Product Manager, Amazon Elastic Block Store

In November 2017, Amazon EC2 introduced C5 compute-intensive instances and M5 general-purpose instances. In the first half of 2018, we released EC2 C5d instances and M5d instances by adding high-speed, ultra-low latency local NVMe storage to the EC2 C5 and M5 instance families. EC2 C5/C5d and M5/M5d instances are built on the Nitro system. This collection of AWS-built hardware and software components enables high performance, high availability, high security, and bare metal capabilities to reduce virtualization overhead.

During the design of the Nitro system, we analyzed real-world workloads and recognized the need for smaller instance sizes to drive higher performance from their Amazon EBS volumes. We found that the majority of application storage needs are bursty, with short, intense periods of high I/O and plenty of idle time between bursts. To improve the experience for these workloads, we developed burst capability for smaller instance sizes. Available on EC2 C5/C5d and M5/M5d instances, this feature enables large, xlarge, and 2xlarge instance sizes to drive the same performance as the 4xlarge instance for 30 minutes each day.

For applications with spiky Amazon EBS demand, you can right-size your instances based on your CPU and memory requirements and still meet your EBS-optimized instance performance requirements. This higher performance also enables you to speed up sections of your workflow dependent on EBS-optimized instance performance. Faster workflows result in quicker job completions and improved resource utilization. The burst capability ultimately enables you to reduce costs by right-sizing your instance and improving total resource usage.

With this performance increase, you will be able to handle unplanned spikes in demand without any impact to your application performance. You can now size your instances based on historical average trends. This burst capability gives you more performance to absorb spikes without affecting your customer experience.

Using Amazon CloudWatch metrics to monitor burst usage

For better visibility into your performance, instances based on the Nitro system provide Amazon CloudWatch metrics to help profile your usage. Based on the usage profile, you can decide if smaller instances meet your requirements.

These instances give you the ability to monitor your usage via instance level CloudWatch metrics for operations (EBSReadOpsandEBSWriteOps) and bytes transferred (EBSReadBytesand EBSWriteBytes). For more information on these metrics, see List of available CloudWatch metrics for your instances. These metrics support basic monitoring (five-minute frequency) by default, but you can enable detailed monitoring (one-minute frequency) for an additional cost. For more information, see Amazon CloudWatch pricing.

For large, xlarge, and 2xlarge instances, we also provide burst balance metrics. EBSIOBalance% monitors the instance I/O burst bucket, and EBSByteBalance% monitors the instance byte burst bucket. These metrics give information about the percentage of I/O or bytes credits remaining in the respective burst buckets. The metrics are expressed as a percentage, where 100% means that the instance has accumulated the maximum number of credits. You can set up an alarm that triggers if the balance gets too low.

To demonstrate these metrics, we launched an m5.large instance. We then attached a 500GB io1 Amazon EBS volume with 32,000 provisioned IOPS to the instance. Amazon EBS volumes attached to instances based on the Nitro system are exposed as NVMe devices.

First, we ran a large block (128 KiB) test using fio to /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol02f2f9a66c2ebfd66 and monitored both EBSIOBalance% and EBSByteBalance%.

$ sudo fio --filename= /dev/disk/by-id/nvme-
Amazon_Elastic_Block_Store_vol02f2f9a66c2ebfd66 --rw=randread --
bs=128k --runtime=2400 --time_based=1 --iodepth=32 --
ioengine=libaio --direct=1 --name=large-block-test 

Because this is a large block workload, it’s not driving enough IOPS to deplete EBSIOBalance%. It depletes EBSByteBalance% instead, as shown in the following image.

Then we ran a small block test to understand how it affects EBSIOBalance% and EBSByteBalance%.

$ sudo fio --filename= /dev/disk/by-id/nvme-
Amazon_Elastic_Block_Store_vol02f2f9a66c2ebfd66 --rw=randread --
bs=16k --runtime=2400 --time_based=1 --iodepth=32 --
ioengine=libaio --direct=1 --name=small-block-test 

Because this is a small block test, it drives higher IOPS than bytes/second. Hence, EBSIOBalance% drops faster than EBSByteBalance%, as shown in the following image.

As long as EBSIOBalance% and EBSByteBalance% are above 0%, the instance can drive the burst performance. When the instance I/O activity is below the baseline rate, the burst buckets refill. After the tests finished, we paused all I/O from the instance. This period of inactivity allows the instance burst buckets to refill, as EBSIOBalance% and EBSByteBalance% show in the following image.

The refill rate for a burst bucket is the difference between the baseline rate and the instance I/O activity. For example, m5.large has a baseline throughput rate of 60 MB/s and a baseline IOPS rate of 3600 IOPS. Suppose the instance I/O activity is 10 MB/s and 1000 IOPS. The byte bucket fills at the rate of 50 MB/s (60 MB/s minus 10 MB/s). The IOPS bucket fills at the rate of 2600 IOPS (3600 IOPS minus 1000 IOPS). For the baseline rates for the different instances, see Amazon EBS–optimized instances. In addition, we top off the burst buckets every 24 hours, which means that the instance has burst performance available for 30 minutes each day.

Performance enhancements

We have continued to make enhancements to the Nitro system. With the latest set of enhancements, we have increased the maximum burst bandwidth on the large, xlarge, and 2xlarge EC2 C5/C5d and M5/M5d instances to 3.5 Gbps, up from 2.25 Gbps and 2.12 Gbps, respectively. We have also increased the maximum burst IOPS for EC2 C5/C5d to 20,000 IOPS and to 18,750 IOPS for M5/M5d, up from 16,000 IOPS for both. All new EC2 C5/C5d and M5/M5d smaller instances can take advantage of this performance increase at no additional cost.

For the latest list of instances based on the Nitro system that support this burst feature and their corresponding performance numbers, see Amazon EBS–optimized instances.

Deploy an 8K HEVC pipeline using Amazon EC2 P3 instances with AWS Batch

Post Syndicated from Geoff Murase original

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

AWS provides several managed services for file- and streaming-based media encoding options.

Currently, these services offer up to 4K encoding. Recent developments and the growing popularity of 8K content has now increased the need to distribute higher resolution content.

In this solution, you use an Amazon EC2 P3 instance to create a file-based encoding pipeline utilizing AWS Batch by first uploading a sample 8K (7680×4320) file to Amazon S3.

AWS Batch

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

P3 instances for video transcoding workloads

The P3 instance comes equipped with the NVIDIA Tesla V100 GPU. The V100 is a 16 GB 5,120 CUDA Core-GPU based on the latest Volta architecture; well suited for video coding workloads. The largest instance size in that family, p3.16xlarge, has 64 vCPU, 488 GB of RAM, 8 NVIDIA Tesla V100 GPUs, and 25 Gbps networking bandwidth.

Other than being a mainstay in computational workloads the V100 offers enhanced hardware-based encoding/decoding (NVENC/NVDEC). The following tables summarize the NVENC/NVDEC options available compared to other GPUs offered at AWS.

NVENC Support Matrix

AWS GPU instance
GPU FAMILYGPUH.264 (AVCHD) YUV 4:2:0H.264 (AVCHD) YUV 4:4:4H.264 (AVCHD) LosslessH.265 (HEVC) 4K YUV 4:2:0H.265 (HEVC) 4K YUV 4:4:4H.265 (HEVC) 4K LosslessH.265 (HEVC) 8k
G2KeplerGRID K520YES
P2Kepler (2nd Gen)Tesla K80YES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES

NVDEC Support Matrix

P2Kepler (2nd Gen)Tesla K80YESYESYES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES

Cinematic 8K encoding is supported using the Tesla V100 (P3 instance family) either in landscape or portrait orientations using the HEVC codec. 

Tesla M60+++4096x
Tesla V100+++4096x


To follow along with these procedures, ensure that you have the following:

  • An AWS account with permissions to create IAM roles and policies, as well as read and write access to S3
  • Registration with the NVIDIA Developer Network
  • Familiarity with Docker


For deployment, you containerize the encoding pipeline. After building the underlying P3 container instance, you then use nvidia-docker2 to build the video-encoding Docker image, which is registered with Amazon Elastic Container Registry (Amazon ECR).

As shown in the following diagram, the pipeline reads an input raw YUV file from S3, then pulls the containerized encoding application to execute at scale on the P3 container instance. The encoded video file is then be transferred to S3.

The nvidia-docker2 image video encoding stack contains the following components:

  • FFMPEG 4.0
  • NVIDIA Video Codec SDK 8.1

This is a relatively lengthy procedure. However, after it’s built, the underlying instance and Docker image are reusable and can be quickly deployed as part of a high performance computing (HPC) pipeline.

Creating the ECS container instance

The underlying instance can be built by selecting the Amazon Linux AMI with the p3.2xlarge instance type in a public subnet. Additionally, add an EBS volume (150 GB), which is used for the 8k input, raw yuv, and output files. Scale the storage amount for larger input files. Persist the mount in /etc/fstab. Connect to the instance over SSH and install any OS updates as well as the EPEL Release and support packages as well as the base docker-ce.

sudo yum update -y
sudo yum install yum-utils \
                 device-mapper-persistent-data \
                 lvm2 \

sudo yum-config-manager --add-repo
sudo yum install epel-release-latest-7.noarch.rpm
sudo yum update
sudo yum install docker-ce -y

The NVIDIA/CUDA stack can be installed using the cuda-repo-rhel7.rpm file. The CUDA framework installs the NVIDIA driver dependencies.

sudo yum install cuda -y

Next, install nvidia-docker2 as provided in the NVIDIA GitHub repo.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

sudo tee /etc/docker/daemon.json <<EOF
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
    "default-runtime": "nvidia"

sudo systemctl restart docker

With the base components in place, make this instance compatible with the ECS service:

sudo yum install ecs-init -y

Create the /etc/ecs/ecs.config file with the following template:

cat << EOF > /etc/ecs/ecs.config

Iptables and packet forwarding rules need to be created to pass IAM roles into task operations:

sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf
sudo iptables -t nat -A PREROUTING -p tcp -d --dport 80 -j DNAT --to-destination
sudo iptables -t nat -A OUTPUT -d -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
sudo sh -c 'iptables-save > /etc/sysconfig/iptables'

Finally, a systemd unit file needs to be created:

sudo cat << EOF > /etc/systemd/system/[email protected]
Description=Docker Container %I

ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--privileged \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log:Z \
--volume=/var/lib/ecs/data:/data:Z \
--volume=/etc/ecs:/etc/ecs \
--net=host \
--env-file=/etc/ecs/ecs.config \
ExecStop=/usr/bin/docker stop %i


sudo systemctl enable [email protected]
sudo systemctl start [email protected]
sudo systemctl status [email protected]

Ensure that the [email protected] service starts successfully.

Creating the NVIDIA-Docker image

With Docker installed, pull the latest nvidia/cuda:latest image from DockerHub.

docker pull nvidia/cuda:latest

It is best at this point to run the Docker container in interactive mode. However, a Docker build file can be created afterwards. At the time of publication, only CUDA 9.0 is installed. NVIDIA has already provided the necessary repositories. Install CUDA 9.2, and support packages, inside the Docker container, referenced by the (docker)  label:

docker run -it --runtime=nvidia --rm nvidia/cuda
(docker) apt update
(docker) apt install pkg-config build-essential wget curl nasm unzip \
                     git libglew-dev cuda-toolkit-9-2 python3-pip -y
(docker) pip3 install awscli

Next, download the FFMPEG 4.0, nv-codec-headers, and the Video Codec SDK 8.1 from the NVIDIA Developer platform.

First, extract the nv-codec-headers and into the directory:

(docker) make
(docker) make install

Extract the ffmpeg-4.0 directory and compile and install FFmpeg:

(docker) ./configure --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
(docker) make -j 4
(docker) make install

Download and extract the NVIDIA Video Codec SDK 8.1. The “Samples” directory has a preconfigured Makefile that compiles the binaries in the SDK. After it’s successful, confirm that the binaries are correctly set up.

(docker): ~/Video_Codec_SDK_8.1.24/Samples/AppEncode/AppEncCuda$ ./AppEncCuda -h
-i Input file path
-o Output file path
-s Input resolution in this form: WxH
-if Input format: iyuv nv12 yuv444 p010 yuv444p16 bgra bgra10 ayuv abgr abgr10
-gpu Ordinal of GPU to use
-codec Codec: h264 hevc
-preset Preset: default hp hq bd ll ll_hp ll_hq lossless lossless_hp
-profile H264: baseline main high high444; HEVC: main main10 frext
-444 (Only for RGB input) YUV444 encode
-rc Rate control mode: constqp vbr cbr cbr_ll_hq cbr_hq vbr_hq
-fps Frame rate
-gop Length of GOP (Group of Pictures)
-bf Number of consecutive B-frames
-bitrate Average bit rate, can be in unit of 1, K, M
-maxbitrate Max bit rate, can be in unit of 1, K, M
-vbvbufsize VBV buffer size in bits, can be in unit of 1, K, M
-vbvinit VBV initial delay in bits, can be in unit of 1, K, M
-aq Enable spatial AQ and set its stength (range 1-15, 0-auto)
-temporalaq (No value) Enable temporal AQ
-lookahead Maximum depth of lookahead (range 0-32)
-cq Target constant quality level for VBR mode (range 1-51, 0-auto)
-qmin Min QP value
-qmax Max QP value
-initqp Initial QP value
-constqp QP value for constqp rate control mode
Note: QP value can be in the form of qp_of_P_B_I or qp_P,qp_B,qp_I (no space)

Encoder Capability
# GPU H264 H264_444 H264_ME H264_WxH HEVC HEVC_Main10 HEVC_Lossless HEVC_SAO HEVC_444 HEVC_ME HEVC_WxH
0 Tesla V100-SXM2-16GB + + + 4096x4096 + + + + + + 8192x8192

Create a small script to be used for the 8K-encoding test inside the Docker container. Save the file as /root/ In the basic form, this script encodes using a single thread. For comparison, the same file is encoded using four threads.

#!/bin/bash -xe
time aws s3 cp $S3_INPUT /mnt/8k.webm

time /usr/local/bin/ffmpeg -y -hwaccel cuda -i /mnt/8k.webm -c:v rawvideo -pix_fmt yuv420p /mnt/8k.yuv
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncCuda/AppEncCuda -i /mnt/8k.yuv -o /mnt/8k.hevc -s 7680x4320 -codec hevc
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncPerf/AppEncPerf -i /mnt/8k.yuv -s 7680x4320 -thread 4 -codec hevc

time aws s3 cp /mnt/8k.hevc $S3_OUTPUT

This script downloads a file from S3 and processes it through FFmpeg. Using the AppEncCuda and AppEncPerf methods, create the 8K-encoded file to be uploaded back to S3. Commit your Docker container into a new Docker image:

docker commit -m "creating hvec-processor image" <containerid> nvidia-hvec:latest

Ensure that a Docker repo has been created in Amazon ECS. Choose Repositories, Create Repository. After you open the repository, choose View Push Commands. Commit the new created image to your ECR repo.

After confirming that your image is in your ECR repo, delete all images locally in the instance:

docker rmi -f $(docker images -a -q)

Before stopping the instance, remove the ECS agent checkpoint file:

sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json

Create an AMI from the instance, maintaining the attached EBS volume. Note the AMI ID.

Creating IAM role permissions

To ensure that access to ECS is controlled and to allow AWS Batch to be called, create two IAM roles:

  • BatchServiceRole allows AWS Batch to call services on your behalf.
  • ecsInstanceRole is specific to this workflow and adds permissions for S3FullAccess. This allows the container to read from and write to your S3 bucket. The following screenshot shows the example policy stack.

In AWS Batch, select the compute environment and create a managed compute environment. Assign a cluster name and min and max vCPUs values. Use the AMI ID, and IAM roles created earlier. Use the Spot pricing model with a consideration of running at 60% of the On-Demand price. Look at the current Spot price to see if more aggressive discounts are possible.

Note the cluster name. In Amazon ECS, you should see the cluster created. Next, create a job queue and associate this job queue with the compute environment created earlier. Note the job queue name.

Next, create a job definition file. This provides the job parameters to be used including mounting paths, CPU, and memory requirements.

    "containerProperties": {
        "mountPoints": [
                "sourceVolume": "codec-data",
                "readOnly": false,
                "containerPath": "/mnt"
        "image": "<accountnumber>",
        "command": ["/root/"],
        "volumes": [
                "host": {"sourcePath": "/mnt"},
                "name": "codec-data"
        "memory": 32768,
        "vcpus": 8,
        "privileged": true,
        "environment": [
                "name": "S3_INPUT",
                "value": "s3://<bucket>/<key_name>"
                "name": "S3_OUTPUT",
                "value": "s3://<bucket>"
        "ulimits": []
    "type": "container",
    "jobDefinitionName": "nvenc-test"

Save the file as nvenc-test.json and register the job in AWS Batch.

aws batch register-job-definition --cli-input-json file://nvenc-test.json

In the AWS Batch console, create a job queue assigning a priority of 1 to the compute environment created earlier. Create a job assigning a job name, with the job definition file, and job queue. Add additional environment variables for the S3 bucket. Ensure that these buckets and input file are created.

S3_INPUT = s3://<bucket>/<key_name> 
S3_OUTPUT = s3://<bucket> 

Execute the job. In a few moments, the job should be in the Running state. Check the CloudWatch logs for an updated status of the job progression. Open the job record information and scroll down to CloudWatch metrics. The events are logged in a new AWS Batch log stream.

A 1-minute 8K YUV 4:2:0 file took approximately 10 minutes single-threaded (top panel), and 58 seconds using four threads (bottom panel). The script serves as a basic implementation of 8K encoding. Explore the options provided by the NVIDIA Video Codec SDK for additional encoding/decoding and transcoding options.


With AWS Batch, a customized container instance, and a dockerized NVIDIA video encoding platform, AWS can provide your HD, 4K, and now 8K media distribution. I invite you to incorporate this into your automated pipeline.

With some minor modification, it’s possible to trigger this pipeline after a new file is uploaded into S3. Then, execute through AWS Lambda or as part of an AWS Step Functions workflow.






Building a GPU workstation for visual effects with AWS

Post Syndicated from Geoff Murase original

Contributed by Mike Owen, Solutions Architect, AWS Thinkbox

The elasticity, scalability, and cost effectiveness of the cloud value proposition is attractive to media customers. One of the key design patterns in media and entertainment (M&E) workloads is using the cloud as a content lake and bringing the underlying processes closer without having to synchronize data. In this high-end graphics visualization business, a pixel-perfect, color-accurate, fully interactive native desktop experience is required for both Windows and Linux platforms. Visual effects (VFX) artists also require input peripherals such as latest-generation Wacom 8K pressure-sensitive tablets and Wacom Cintiq monitors to work as seamlessly as they do on-premises.

AWS offers Amazon EC2 G3 instances backed by NVIDIA Tesla M60 GPUs with powerful graphics capabilities: OpenGL 4.6, DirectX 12, CUDA 9.2, GRID 6.1. You can combine these instances with the Teradici streaming protocol via their Cloud Access Software (CAS) agent to enable a high-end desktop experience on either Windows or Linux with an on-demand pricing model to fit your business needs. Teradici PCoIP is a popular protocol in the M&E industry, where Teradici have delivered a custom silicon accelerated zero-client hardware device to deliver secure pixel streaming to on-premises monitors. AWS also enables customers to create managed virtual desktop environments with Amazon WorkSpaces Graphics bundles (Windows and Linux) or Amazon AppStream 2.0 (Windows). Both solutions offer a managed environment with GPU-backed instances. This blog describes how you can set up an unmanaged VFX desktop using Amazon EC2 G3 instances combined with high-performance storage and scalable compute options such as Amazon EC2 Spot Instances.


The following diagram describes a typical Windows and Linux configuration. In this setup, you use a Teradici PCoIP Zero Client over a dedicated network connection from your on-premises location via your chosen network provider to their nearest AWS Region containing an Amazon EC2 G3 instance. AWS Direct Connect provides a low-latency, high-bandwidth dedicated connection that doesn’t traverse the public internet. With the Windows instance, you might use a creative pen display such as a Wacom Cintiq monitor or, on a Linux instance, the latest generation of Wacom 8K pressure-sensitive tablets. You can connect both types of environments to dual 2K monitors and be ready for film VFX work.

Once built, the g3.4xl instance runs your custom Amazon Machine Image (AMI) with encrypted volume(s) in Amazon Elastic Block Storage (EBS) containing all your software, pulling floating licenses from your on-premises license servers where necessary. For Linux, you have the option of centrally installing your software via a fast NVMe SSD–based i3 instance type and building a minimal-sized boot AMI. In both cases, you can add encrypted Amazon EBS SSD volumes for increased local storage. The Teradici CAS agent runs on each individual G3 instance and can be provisioned, brokered, and managed by the optional Teradici Cloud Access Manager (CAM) solution. Finally, Amazon WorkSpaces Graphics bundles are compatible with a Teradici zero client, providing easy access to a fully managed Windows desktop. This might be useful for Linux-based studios that require ad hoc Windows usage such as Adobe Creative Cloud.

In this configuration, a Teradici zero client interacts with the provisioned desktop (served on a G3 instance) in the cloud. The Teradici CAS agent captures the frame buffer and sends it in real time to the zero client over the network via UDP using the PCoIP protocol. A smooth, reliable experience depends on a low-latency and high-bandwidth connection to the Amazon EC2 instance hosting the desktop. Bandwidth requirements depend on the number of monitors used, resolution, frame rate, and lossless quality of the desktop experience. For Wacom tablet support, Teradici CAS 2.12 requires the latency level to be less than 25 ms. You can use or to check the latency time of public pings between your location and your closest AWS Region. Ideally, you will provision an AWS Direct Connect connection for private (doesn’t traverse the public internet) and fast (low-latency) connectivity to the AWS Region from your location. You can also use a public internet connection for initial testing. In both cases, you can route traffic over a VPN for added security.


Instead of doing a manual build, you can visit the AWS Marketplace and subscribe to a Teradici-provided pre-built AMI. It already has the NVIDIA GRID driver and Teradici CAS software installed, configured, and licensed as part of the overall usage cost. See the following offerings on AWS Marketplace:


Make sure that everything in the following list is in place before deploying to either platform:

  • Create an AWS account.
  • Ensure that your AWS account has an EC2 key-pair associated with it by going to the AWS Management Console and checking Key Pairs under Network and Security in the applicable AWS Region.
  • Set up an AWS account <ACCESS KEY> and <SECRET ACCESS KEY> to access the NVIDIA GRID driver from an Amazon S3 bucket. The deployment instructions explain how to install and set up the AWS Command-Line Interface (AWS CLI).
  • Minimum version: CentOS 7.2 or Windows 2016.
  • Recommended Teradici PCoIP Zero Client firmware version: 6.0. Contact Teradici to download.
  • Contact Teradici who will provide a 60-day trial license: <TERADICI LICENSE CODE> for Cloud Access Software. You should receive your license within 1 business day. If you don’t receive your license, please contact [email protected].
  • You must have superuser (root) or Administrator privileges to the AMI.
  • The Amazon EC2 security group provides a stateful firewall on each instance via a set of rules. The following inbound ports must be available on the Amazon EC2 instance from a specific client’s source IP address (restrictive access).
TypeProtocolPort RangeSourceDescriptionPlatform

Deploying the desktop on Linux

For our Linux deployment, we use the latest CentOS 7.5 AMI from AWS Marketplace and install the NVIDIA/Xorg/KDE/Wacom stack to create a fully functioning VFX Linux desktop environment. This stack contains the following components:

  • CentOS 7.5.1804_2 AMI
  • NVIDIA Grid 6.1 (390.57 May 2018) driver
  • Teradici CAS 2.12
  • Wacom 0.40 driver

Feel free to use your own CentOS 7.2+ AMI and modify the step by step instructions accordingly.

Setting up the desktop on Linux

To launch a g3.4xl instance in the closest AWS Region in your AWS account using the created key-pair and security group, use an AMI ID from the ones in the following table. For reference, search for the AMI using the keywords CentOS Linux 7 x86_64 HVM EBS 1804_2.

US East (N. Virginia)us-east-1ami-d5bf2caa
US East (Ohio)us-east-2ami-77724e12
US West (N. California)us-west-1ami-3b89905b
US West (Oregon)us-west-2ami-5490ed2c
EU (Frankfurt)eu-central-1ami-9a183671
EU (Ireland)eu-west-1ami-4c457735
Asia Pacific (Tokyo)ap-northeast-1ami-3185744e
Asia Pacific (Singapore)ap-southeast-1ami-da6151a6
Asia Pacific (Sydney)ap-southeast-2ami-0d13c26f

Once the g3.4xl instance has passed its EC2 instance 2/2 status checks, we can build in true AWS style.

First, log in to the instance and set up the environment.

# ssh into running Amazon EC2 instance
ssh [email protected]<IP-ADDRESS>.<AWS-REGION>
# yes

# set a password for your user
sudo passwd centos

# disable selinux
sudo sed -ir 's/SELINUX=\(disabled\|enforcing\|permissive\)/SELINUX=disabled/' /etc/selinux/config

# install the EPEL repository
sudo yum install wget -y
sudo wget
sudo rpm -i epel-release-latest-7.noarch.rpm

# run yum update to make sure all packages are up-to-date
sudo yum update -y

# install the "Server with GUI" group
sudo yum groupinstall "Server with GUI" -y

# prefer KDE desktop? (optional)
sudo yum groupinstall -y "KDE Plasma Workspaces"
sudo systemctl set-default
echo "exec startkde" >> ~/.xinitrc

# uninstall KDE (optional)
# sudo yum groupremove -y "KDE Plasma Workspaces"
# sudo yum autoremove -y
# sudo reboot

# reboot to make sure the latest installed kernel is running
sudo reboot

# install kernel-devel
sudo yum install kernel-devel -y

Next, install and register the Teradici CAS 2.12 software.

# import the Teradici signing key
sudo rpm --import

# grab the PCoIP repo file
sudo curl -o /etc/yum.repos.d/pcoip.repo

# install PCoIP agent package
sudo yum install pcoip-agent-graphics -y

# load vhci-hcd kernel modules
sudo modprobe -a usb-vhci-hcd usb-vhci-iocifc

# register with the licensing service
pcoip-register-host --registration-code=<TERADICI LICENSE CODE>

# set up PCoIP agent config to enable USB
echo """pcoip.grid_diff_map = 0 pcoip.enable_usb = 1 pcoip.usb_auth_table = "23XXXXXX" pcoip.usb_unauth_table = "" """ | sudo tee /etc/pcoip-agent/pcoip-agent.conf

# make sure you're running latest pcoip-agent version
sudo yum update pcoip-agent-graphics

Then install the NVIDIA GRID graphics driver and apply performance optimization to its configuration.

# NVIDIA GRID driver

# install nano editor
sudo yum install nano -y

# remove any old NVIDIA drivers/CUDA
sudo yum erase nvidia cuda

# disable the nouveau open source driver for NVIDIA graphics cards
sudo touch /etc/modprobe.d/blacklist.conf

# paste the following lines in one go into your shell
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

# edit the /etc/default/grub file and add the line:
sudo nano /etc/default/grub

# rebuild grub2 config
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

# install pip
curl -O
python --user

# install AWS CLI
pip install awscli --upgrade --user

# configure AWS CLI credentials
aws configure

# AWS Access Key ID [None]: <ACCESS KEY>
# AWS Secret Access Key [None]: <SECRET ACCESS KEY>
# Default Region name [None]: <AWS REGION>
# Default output format [None]: <enter>

# 390.57 driver
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
chmod +x

sudo /bin/bash ./

# respond to the NVIDIA installer prompts as follows:
    # <accept> the EULA
    # <Yes> to register kernel module sources with DKMS
    # <No> to installing 32-bit libraries
    # <No> to modifying the file at end of install
    # <OK> to complete the installer

# check driver installed
nvidia-smi -q | head

# g3/NVIDIA optimization settings
sudo nvidia-persistenced
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,1177

sudo reboot

Install CUDA if required by any of your VFX software such as Autodesk Maya or SideFX Houdini:

# install CUDA and OpenCL

mv cuda_9.2.88_396.26_linux

# don't install the actual graphics driver, just CUDA 9.2 toolkit, sym-link
sudo /bin/sh

Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26?
(y)es/(n)o/(q)uit: n

Install the CUDA 9.2 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
[ default is /usr/local/cuda-9.2 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.2 Samples?
(y)es/(n)o/(q)uit: n

Installing the CUDA Toolkit in /usr/local/cuda-9.2 ...

# CUDA Patch 1 (Released May 16, 2018)
mv cuda_9.2.88.1_linux
sudo /bin/sh

# Ensure these ENV VARs are present: /etc/profile.d
export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Finally, install Wacom drivers.

# install Wacom driver
cd ~
tar jxf input-wacom-0.40.0.tar.bz2
cd input-wacom-0.40.0
sudo su
make && make install
modprobe wacom
dracut --force
sudo touch /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf

# edit Wacom conf file as follows
sudo nano /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf

Section "InputClass"
    Identifier "Wacom pressure compatibility"
    MatchDriver "wacom"
    Option "Pressure2K" "true"

# check Elastic Network Adapter (ENA) is running on your instance
modinfo ena
ethtool -i eth0
aws ec2 describe-images --image-id <AMI-ID> --query 'Images[].EnaSupport'

# if that command returns false, proceed to enable it
# make sure that you have AWS CLI installed with AWS credentials on your local machine
sudo shutdown now
aws ec2 modify-instance-attribute --instance-id <CURRENT EC2 INSTANCE ID> --ena-support

# if you're using a pre-existing Linux AMI, you need to install the ENA driver yourself

sudo reboot

Deploying the desktop on Windows

We use the latest AWS-provided Windows 2016 AMI for our deployment and install the NVIDIA/Teradici/Wacom stack to create a fully functioning VFX Windows desktop environment. This stack contains the following components:

  • Windows Server 2016 Base 2018.04.11
  • NVIDIA Grid 6.1 (391.58 May 2018) driver
  • Teradici CAS 2.12
  • Latest Wacom driver

Feel free to use your own Windows 2016 AMI and modify the step by step instructions accordingly.

Windows Instructions

To launch a g3.4xl instance in the closest AWS Region in your AWS account using the created key-pair and security group, use an AMI ID from the ones in the following table. For reference, the AMI name is Microsoft Windows Server 2016 Base 2018.04.11.

US East (N. Virginia)us-east-1ami-3633b149
US East (Ohio)us-east-2ami-5984b43c
US West (N. California)us-west-1ami-3dd1c25d
US West (Oregon)us-west-2ami-f3dcbc8b
EU (Frankfurt)eu-central-1ami-b5530b5e
EU (Ireland)eu-west-1ami-4cc09a35
Asia Pacific (Tokyo)ap-northeast-1ami-0e809272
Asia Pacific (Singapore)ap-southeast-1ami-00a2847c
Asia Pacific (Sydney)ap-southeast-2ami-7279b010

Once the g3.4xl instance has passed its Amazon EC2 instance 2/2 status checks, let’s go build:

# use AWS Management Console to right-click EC2 instance and "Get Windows Password" -> <RDP PASSWORD>

# RDP into machine
# address: ec2-<IP-ADDRESS>.<AWS-REGION>
# username: Administrator
# password: <RDP PASSWORD>

# set a password in command prompt
net user Administrator <NEW PASSWORD>

# configure Powershell - Allow ExecutionPolicy of Powershell scripts
Set-ExecutionPolicy -ExecutionPolicy AllSigned

# enable Software Secure Attention Sequence (SAS) setting
Open gpedit.msc
Expand Computer Configuration > Administrative Templates > Windows Components
Select Windows Logon Options
Double-click Disable or enable software Secure Attention Sequence
Select Enabled
Select Services from the drop down list in the bottom left pane
Click OK

# install AWS CLI
# download and install:

# configure AWS CLI credentials in Powershell
aws configure

# AWS Access Key ID [None]: <ACCESS KEY>
# AWS Secret Access Key [None]: <SECRET ACCESS KEY>
# Default Region name [None]: <AWS REGION>
# Default output format [None]: <enter>

# download NVIDIA GRID driver from Amazon S3
# right-click Powershell, Run As Administrator, paste following into Powershell

$Bucket = "ec2-windows-nvidia-drivers"
$KeyPrefix = "latest"
$LocalPath = "C:\Users\Administrator\Desktop\NVIDIA"
$Objects = Get-S3Object -BucketName $Bucket -KeyPrefix $KeyPrefix -Region us-east-1
foreach ($Object in $Objects) {
    $LocalFileName = $Object.Key
    if ($LocalFileName -ne '' -and $Object.Size -ne 0) {
        $LocalFilePath = Join-Path $LocalPath $LocalFileName
        Copy-S3Object -BucketName $Bucket -Key $Object.Key -LocalFile $LocalFilePath -Region us-east-1

# run NVIDIA GRID installer

# reboot machine via command prompt
cmd shutdown /r

# Optimize GPU settings (follow these instructions)

# via Powershell
cd "C:\Program Files\NVIDIA Corporation\NVSMI"
.\nvidia-smi --auto-boost-default=0
.\nvidia-smi -ac "2505,1177"

# go to, create account, and request access from Teradici via support ticket
# download Teradici PCoIP CAS software: PCoIP Graphics Agent 2.12 for Windows or later

# install PCoIP graphics agent package via GUI based installer
enter <TERADICI LICENSE CODE> via GUI installer
reboot machine

# download and install latest Wacom drivers from Wacom website

# double-check the Elastic Network Adapter (ENA) is running
# ensure you have AWS CLI installed with AWS credentials on your local machine
aws ec2 describe-instances --instance-ids <CURRENT EC2 INSTANCE ID> --query "Reservations[].Instances[].EnaSupport"

# if the check returns false, install ENA drivers

# if you're using a pre-existing Windows AMI, you need to install the ENA driver yourself

Validating the desktop

Finally, take your new Linux or Windows VFX workstation for a spin. Using a zero client:

# connect Wacom tablet to zero-client and start a PCoIP session...
# ensure you configure zero-client to connect via:
# “Auto-Detect” in local z/c connection settings

# install any other software you need...

# don't forget to configure your floating license servers...

# finally, create a new AMI to capture your new custom VFX workstation image in your account

Teradici provides a software client for Windows and macOS that you can use to validate the setup of your Windows or Linux desktop. It’s also handy for system administrators who need to access a graphics workstation for artist technical support.

Testing the desktop

For testing, let’s run Autodesk 3ds Max on Windows and Autodesk Maya on Linux.

In 3ds Max, we have a 35-million-poly scene from the GPU-accelerated renderer Redshift, fully interactive and able to use the NVIDIA card to perform CUDA-based GPU final rendering.

In Maya, we show the 16 vCPUs and 120 GB of RAM available to this 3D scene file. The file takes 10 minutes to final render at HD resolution on a g3.4xl instance or, if you decide to offload the CUDA rendering to the Amazon EC2 P3.16xl instance type, just 19 seconds!


The Amazon EC2 G3 instance type is purpose-built to provide a high-end professional graphics infrastructure for visual computing applications. With remote protocols like Teradici PCoIP, G3 instances are the next-generation VFX cloud desktops that can deliver outstanding performance. With many studios already taking advantage of elastic cloud scaling for rendering, now is a great time to deploy cloud desktops for your business.