Tag Archives: high performance computing

Monitoring dashboard for AWS ParallelCluster

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/

This post is contributed by Nicola Venuti, Sr. HPC Solutions Architect.

AWS ParallelCluster is an AWS-supported, open source cluster management tool that makes it easy to deploy and manage High Performance Computing (HPC) clusters on AWS. While AWS ParallelCluster includes many benefits for its users, it has not provided straightforward support for monitoring your workloads. In this post, I walk you through an add-on extension that you can use to help monitor your cloud resources with AWS ParallelCluster.

Product overview

AWS ParallelCluster offers many benefits that hide the complexity of the underlying platform and processes. Some of these benefits include:

  • Automatic resource scaling
  • A batch scheduler
  • Easy cluster management that allows you to build and rebuild your infrastructure without the need for manual actions
  • Seamless migration to the cloud supporting a wide variety of operating systems.

While all of these benefits streamline processes for you, one crucial task for HPC stakeholders that customers have found challenging with AWS ParallelCluster is monitoring.

Resources in an HPC cluster like schedulers, instances, storage, and networking are important assets to monitor, especially when you pay for what you use. Organizing and displaying all these metrics becomes challenging as the scale of the infrastructure increases or changes over time which is the typical scenario in the Cloud.

Given these challenges, AWS wants to provide AWS ParallelCluster users with two additional benefits: 1/ facilitate the optimization of price performance and 2/ visualize and monitor the components of cost for their workloads.

The HPC Solution Architects team has created an AWS ParallelCluster add-on that is easy to use and customize. In this blog post, I demonstrate how Grafana – a platform for monitoring and observability – can run on AWS ParallelCluster to enable infrastructure monitoring.

AWS ParallelCluster integrated dashboard and Grafana add-on dashboards

Many customers want a tool that consolidates information coming from different AWS services and makes it easy to monitor the computing infrastructure created and managed by AWS ParallelCluster. The AWS ParallelCluster team – just like every other service team at AWS – is open and keen to listen from our customers.

This feedback helps us inform our product roadmap and build new, customer-focused features.

We recently released AWS ParallelCluster 2.10.0 based on customer feedback. Here are some key features have been released with it:

  • Support for the CentOS 8 Operating System: Customers can now choose CentOS 8 as their base operating system of choice to run their clustersfor both x86 and Arm architectures.
  • Support for P4d instances along with NVIDIA GPUDirect Remote Direct Memory Access (RDMA).
  • FSx for Lustre enhancements (support for  FSx AutoImport and HDD-based support filesystem options)
  • And finally, a new CloudWatch Dashboards designed to help aggregating cluster information, metrics, and logging already available on CloudWatch.

The last of these features above, integrated CloudWatch Dashboards, are designed to help customers face the challenge of cluster monitoring mentioned above. In addition, the Grafana dashboards demonstrated later in this blog are a complementary add-on to this new, CloudWatch-based, dashboard. There are some key differences between the two, summarized in the table below.

The latter does not require any additional component or agent running on either the head or the compute nodes. It aggregates the metrics already pushed by AWS ParallelCluster on CloudWatch into a single dashboard. This translates into zero overhead for the cluster, but at the expense of less flexibility and expandability.

Instead, the Grafana-based dashboard offers additional flexibility and customizability and requires a few lightweight components installed on either the head or the compute nodes. Another key difference between the two monitoring dashboards is that the CloudWatch based one requires AWS credentials, IAM User and Roles configured and access to the AWS web-console, while the Grafana-based one has its-own built-in authentication and authorization system unrelated from the AWS account, end-users or HPC admins might not have permissions (or simply are not willing) to access the AWS Management Console in order to monitor their clusters.

CloudWatch based dashboard Grafana-based monitoring add-on
No additional component Grafana + Prometheus
No overhead Minimal overhead
Little to no expandability Support full customizability
Requires AWS credentials and IAM configured Custom credentials, no AWS access required
Custom user-interface

Grafana add-on dashboards on GitHub

There are many components of an HPC cluster to monitor. Moreover, the cluster is built on a system that is continuously evolving: AWS ParallelCluster and AWS services in general are updated and new features are released often. Because of this, we wanted to have a monitoring solution that is developed on flexible components, that can evolve rapidly. So, we released a Grafana add-on as an open-source project onto this GitHub repository.

Releasing it as an open-source project allows us to more easily and frequently release new updates and enhancements. It also enables users to customize and extend its functionalities by adding new dashboards, or extending the dashboards functionalities (like GPU monitoring or track EC2 Spot Instance interruptions).

At the moment, this new solution is composed of the following open-source components:

  • Grafana is an open-source platform for monitoring and observing. Grafana allows you to query, visualize, alert, and understand your metrics. It also helps you create, explore, and share dashboards fostering a data-driven culture.
  • Prometheus is an open source project for systems and service monitoring from the Cloud Native Computing Foundation. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
  • The Prometheus Pushgateway is an open source tool that allows ephemeral and batch jobs to expose their metrics to Prometheus.
  • NGINX is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server.
  • Prometheus-Slurm-Exporter is a Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
  • Node_exporter is a Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.

Note: while almost all components are under the Apache2 license, only Prometheus-Slurm-Exporter is licensed under GPLv3. You should be aware of this license and accept its terms before proceeding and installing this component.

The dashboards

I demonstrate a few different Grafana dashboards in this post. These dashboards are available for you in the AWS Samples GitHub repository. In addition, two dashboards – still under development – are proposed in beta. The first shows the cluster logs coming from Amazon CloudWatch Logs. The second one shows the costs associated to each AWS service utilized.

All these dashboards can be used as they are or customized as you need:

  • AWS ParallelCluster Summary – This is the main dashboard that shows general monitoring info and metrics for the whole cluster. It includes: Slurm metrics, compute related metrics, storage performance metrics, and network usage.

ParallelCluster dashboard

  • Master Node Details – This dashboard shows detailed metrics for the head node, including CPU, memory, network, and storage utilization.

Master node details

  • Compute Node List – This dashboard shows the list of the available compute nodes. Each entry is a link to a more detailed dashboard: the compute node details (see the following image).

Compute node list

  • Compute Node Details – Similar to the head node details, this dashboard shows detailed metrics for the compute nodes.
  • Cluster Logs – This dashboard (still under development) shows all the logs of your HPC cluster. The logs are pushed by AWS ParallelCluster to Amazon CloudWatch Logs and are reported here.

Cluster logs

  • Cluster Costs (also under development) – This dashboard shows the cost associated to AWS Service utilized by your cluster. It includes: EC2, Amazon EBS, FSx, Amazon S3, Amazon EFS. as well as an aggregation of all the costs of every single component.

Cluster costs

How to deploy it

You can simply use the post-install script that you can find in this GitHub repo as it is, or customize it as you need. For instance, you might want to change your Grafana password to something more secure and meaningful for you, or you might want to customize some dashboards by adding additional components to monitor.

#Load AWS Parallelcluster environment variables
. /etc/parallelcluster/cfnconfig
#get GitHub repo to clone and the installation script
monitoring_url=$(echo ${cfn_postinstall_args}| cut -d ',' -f 1 )
monitoring_dir_name=$(echo ${cfn_postinstall_args}| cut -d ',' -f 2 )
setup_command=$(echo ${cfn_postinstall_args}| cut -d ',' -f 3 )
case ${cfn_node_type} in
        wget ${monitoring_url} -O ${monitoring_tarball}
        mkdir -p ${monitoring_home}
        tar xvf ${monitoring_tarball} -C ${monitoring_home} --strip-components 1
#Execute the monitoring installation script
bash -x "${monitoring_home}/parallelcluster-setup/${setup_command}" >/tmp/monitoring-setup.log 2>&1
exit $?

The proposed post-install script takes care of installing and configuring everything for you. Although a few additional parameters are needed in the AWS ParallelCluster configuration file: the post-install argument, additional IAM policies, custom security group, and a tag. You can find an AWS ParallelCluster template here.

Please note that, at the moment, the post install script has only been tested using Amazon Linux 2.

Add the following parameters to your AWS ParallelCluster configuration file, and then build up your cluster:

base_os = alinux2
post_install = s3://<my-bucket-name>/post-install.sh
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
tags = {"Grafana" : "true"}

Make sure that port 80 and port 443 of your head node are accessible from the internet or from your network. You can achieve this by creating the appropriate security group via AWS Management Console or via Command Line Interface (AWS CLI), see the following example:

aws ec2 create-security-group --group-name my-grafana-sg --description "Open Grafana dashboard ports" —vpc-id vpc-1a2b3c4d
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 443 —cidr
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 80 —cidr

There is more information on how to create your security groups here.

Finally, set the additional_sg parameter in the [VPC] section of your AWS ParallelCluster configuration file.

After your cluster is created, you can open a web-browser and connect to https://your_public_ip. You should see a landing page with links to the Prometheus database service and the Grafana dashboards.

Size your compute and head nodes

We looked into resource utilization of the components required for building this monitoring solution. In particular, the Prometheus node exporter installed in the compute nodes uses a small (almost negligible) number of CPU cycles, memory, and network.

Depending on the size of your cluster the components installed on the head node (see the list in the chapter “Solution components” of this blog) it might require additional CPU, memory and network capabilities. In particular, if you expect to run a large-scale cluster (hundreds of instances) because of the higher volume of network traffic due to the compute nodes continuously pushing metrics into the head node, we recommend you use an instance type bigger than what you planned.

We cannot advise you exactly how to size your head node because there are many factors that can influence resource utilization. The best recommendation we could give you is to use the Grafana dashboard itself to monitor the CPU, memory, disk, and most importantly, network utilization, and then resize your head node (or other components) accordingly.


This monitoring solution for AWS ParallelCluster is an add-on for your HPC Cluster on AWS and a complement to new features recently released in AWS ParallelCluster 2.10. This blog aimed to provide you with instructions and basic tooling that can be customized based on your needs and that can be adapted quickly as the underlying AWS services evolve.

We are open to hear from you, receive feedback, issues, and pull requests to extend its functionalities.

Amazon EC2 P4d instances deep dive

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/

This post is contributed by Amr Ragab, Senior Solutions Architect, Amazon EC2


AWS is excited to announce that the new Amazon EC2 P4d instances are now generally available. This instance type brings additional benefits with 2.5x higher deep learning performance; adding to the accelerated instances portfolio, new features, and technical breakthroughs that our customers can benefit from with this latest technology. This blog post details some of those key features and how to integrate them into your current workloads and architectures.


P4d instances

As you can see from the generalized block diagram above, the p4d comes with dual socket Intel Cascade Lake 8275CL processors totaling 96 vCPUs at 3.0 GHz with 1.1 TB of RAM and 8 TB of NVMe local storage. P4d also comes with 8 x 40 GB NVIDIA Tesla A100 GPUs with NVSwitch and 400 Gbps Elastic Fabric Adapter (EFA) enabled networking. This instance configuration represents the latest generation of computing for our customers spanning Machine Learning (ML), High Performance Computing (HPC), and analytics.

One of the improvements of the p4d is in the networking stack.  This new instance type has 400 Gbps with support for EFA and GPUDirect RDMA. Now, on AWS, you can take advantage of point-to-point GPU to GPU communication (across nodes), bypassing the CPU. Look out for additional blogs and webinars detailing use cases of GPUDirect and how this feature helps decrease latency and improve performance for certain workloads.

Let’s look at some new features and performance metrics for the P4d instances.


Local ephemeral NVMe storage
The p4d instance type comes with 8 TB of local NVMe storage. Each device has a maximum read/write throughput of 2.7 GB/s. To create a local namespace and staging area for input into the GPUs, you can create a local RAID 0 of all the drives. This results in aggregate read throughput of about 16 GB/s. The following table summarizes the I/O tests on the NVMe drives in this configuration.

FIO – Test Block Size Threads Bandwidth
1 Sequential Read 128k 96 16.4 GiB/s
2 Sequential Write 128k 96 8.2 GiB/s
3 Random Read 128k 96 16.3 GiB/s
4 Random Write 128k 96 8.1 GiB/s


Introduced with the p4d instance type is NVSwitch. Every GPU in the node is connected to each other in a full mesh topology up to 600 GB/s bidirectional bandwidth. ML frameworks and HPC applications that use NVIDIA communication collectives library (NCCL) can take full advantage of this all-to-all communication layer.

P4d GPU to GPU bandwidth

P3 GPU to GPU bandwidth

P4d uses a full mesh NVLink topology for optimized all-to-all communication, compared to the previous generation P3/P3dn instances, which have all-to-all communication across various data path domains (NUMA, PCIe switch, NVLink).  This new topology accessed via NCCL will improve performance for multiGPU workloads.
To make optimal use of the NVSwitch ensure that in your instance, all GPUs application boost clocks are set to its maximum values:

sudo nvidia-smi -ac 1215,1410

Multi-Instance GPU (MIG)

It’s now possible, at the user level, to have control of fractionating a GPU into multiple GPU slices, with each GPU slice isolated from each other. This enables multiple users to run different workloads on the same GPU without impacting performance. I walk you through an example implementation of MIG in the following steps:

With every newly launched instance, MIG is disabled. So, you must enable it with the following command:

[email protected]:~# sudo nvidia-smi -mig 1 

Enabled MIG Mode for GPU 00000000:10:1C.0
You can get a list of supported MIG profiles.
Next, you can create seven slices, and create compute instances for each slice.
[email protected]:~# sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19) 
Successfully created GPU instance ID 14 on GPU 0 using profile MIG 1g.5gb (ID 19)
[email protected]:~# nvidia-smi mig -cci -gi 7,8,9,11,12,13,14 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0) 
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 14 using profile MIG 1g.5gb (ID 0)

You can split a GPU into a maximum of seven slices. To pass the GPU through into a docker container, you can specify the index pair at runtime:

docker run -it --gpus '"device=1:0"' nvcr.io/nvidia/tensorflow:20.09-tf1-py3

With MIG, you can run multiple smaller workloads on the same GPU without compromising performance. We will follow up with additional blogs on this feature as we integrate it with additional AWS services.


For workloads optimized for multiGPU capabilities, we introduced GPUDirect over EFA fabric. This allows direct GPU-GPU communication across multiple p4d nodes for decreased latency and improved performance. Follow this user guide to get started with installing the EFA driver and setting up the environment. The code sample below can be used as a template to use GPUDirect RDMA over EFA.

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -N ${NUM_PROCS_NODE} \
     -x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
     --hostfile ${HOSTS_FILE} \
     --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1 -c 1 -n 100

Machine learning Optimizations

You can quickly get started with the all benefits mentioned earlier for the p4d by using our latest Deep Learning AMI (DLAMI). The DLAMI now comes with CUDA11 and the latest NVLink and cuDNN SDKs and drivers to take advantage of the p4d.

TensorFloat32 – TF32

TF32 is a new 19 bit precision datatype from NVIDIA introduced for the first time for the p4d.24xlarge instance. This datatype improves performance with little to no loss of training and validation accuracy for most mainstream models. We have more detailed benchmarks for individual algorithms. But, on the p4d.24xlarge you can achieve approximately a 2.5 fold increase compared to FP32 on the p3dn.24xlarge for mainstream deep learning models.

We have updated our machine learning models here to show examples (see the following chart) of popular algorithms our customers are using today including general DNNs and Bert.

DNN P3dn FP32 (imgs/sec) P3dn FP16 (imgs/sec) P4d Throughput TF32 (imgs/sec) P4d Throughput FP16 (imgs/sec) P4d over p3dn TF32/FP32 P4d over P3dn FP16
1 Resnet50 3057 7413 6841 15621 2.2 2.1
2 Resnet152 1145 2644 2823 5700 2.5 2.2
3 Inception3 2010 4969 4808 10433 2.4 2.1
4 Inception4 847 1778 2025 3811 2.4 2.1
5 VGG16 1202 2092 4532 7240 3.8 3.5
6 Alexnet 32198 50708 82192 133068 2.6 2.6
7 SSD300 1554 2918 3467 6016 2.2 2.1

BERT Large – Wikipedia/Books Corpus

GPUs Sequence Length Batch size / GPU: mixed precision, TF32 Gradient Accumulation: mixed precision, TF32 Throughput – mixed precision
1 1 128 64,64 1024,1024 372
2 4 128 64,64 256,256 1493
3 8 128 64,64 128,128 2936
4 1 512 16,8 2048,4096 77
5 4 512 16,8 512,1024 303
6 8 512 16,8 256,512 596

You can find other code examples at github.com/NVIDIA/DeepLearningExamples.

If you want to builld your own AMI or extend an AMI maintained by your organization you can use the github repo, which provides Packer scripts to build AMIs for Amazon Linux 2 or Ubuntu 18.04 versions.


The stack includes the following components:

  • NVIDIA Driver 450.80.02
  • CUDA 11
  • NVIDIA Fabric Manager
  • cuDNN 8
  • NCCL 2.7.8
  • EFA latest driver
  • FSx kernel and client driver and utilities
  • Intel OneDNN
  • NVIDIA-runtime Docker


Get started with the new P4d instances with support on Amazon EKS, AWS Batch, and Amazon Sagemaker. We are excited to hear about what you develop and run with the new P4d instances. If you have any questions please reach out to your account team. Now, go power up your ML and HPC workloads with NVIDIA Tesla A100s and the P4d instances.

New – GPU-Equipped EC2 P4 Instances for Machine Learning & HPC

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-gpu-equipped-ec2-p4-instances-for-machine-learning-hpc/

The Amazon EC2 team has been providing our customers with GPU-equipped instances for nearly a decade. The first-generation Cluster GPU instances were launched in late 2010, followed by the G2 (2013), P2 (2016), P3 (2017), G3 (2017), P3dn (2018), and G4 (2019) instances. Each successive generation incorporates increasingly-capable GPUs, along with enough CPU power, memory, […]

Introducing retry strategies for AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/introducing-retry-strategies-for-aws-batch/

This post is contributed by Christian Kniep, Sr. Developer Advocate, HPC and AWS Batch.

Scientists, researchers, and engineers are using AWS Batch to run workloads reliably at scale, and to offload the undifferentiated heavy lifting in their day-to-day work. But even with a slight chance of failure in the stack, the act of mitigating these failures reminds customers that infrastructure, middleware and software are not error proof.

Many customers use Amazon EC2 Spot Instances to save up to 90% on their computing cost by leveraging unused EC2 capacity. If unused EC2 capacity is unavailable, an EC2 Spot Instance can be reclaimed by EC2. While AWS Batch takes care of rescheduling the job on a different instance, this rescheduling should not be handled differently depending on whether it is an application failure or some infrastructure event interrupting the job.

Starting today, customers can define how many retries are performed in cases where a task does not finish correctly. AWS Batch now allows customers define custom retry conditions, so that failures like an interruption of an instance or an infrastructure agent are handled differently, and do not just exhaust the number of retries attempted.

In this blog, I show the benefits of custom retry with AWS Batch by using different error codes from a job to control whether it should be retried. I will also demonstrate how to handle infrastructure events like a failing container image download, or an EC2 Spot interruption.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (AWS CLI) to set up the following:

  1. IAMroles, policies, and profiles to grant access and permissions
  2. A compute environment (CE) to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on the CE
  4. Job definitions with different retry strategies,which use a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit jobs to show how you can handle different scenarios, such as infrastructure failure, application handling via error code or a middleware event.


To make things easier, I first set up a couple of environment variables to have the information available for later use. I use the following code to set up the environment variables:

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')


When using the AWS Management Console, I must create IAM roles manually.

Trust policies

IAM roles are defined to be used by an individual service. In the simplest case, I want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. The definition of which entity is able to use an IAM role is called a Trust Policy. To set up a Trust Policy for an IAM role, I use the following code snippet:

cat > ec2-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    "Action": "sts:AssumeRole"

Instance role

With the IAM trust policy, I can now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. I can set up this role with the following logic:

cat > svc-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    "Action": "sts:AssumeRole"
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

At this point, I have created the IAM roles and policies so that the instances and services are able to interact with the AWS API operations, including trust policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a CE, which will launch instances to run the example jobs.

cat > compute-environment.json << EOF
  "computeEnvironmentName": "compute-0",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 2,
    "maxvCpus": 32,
    "desiredvCpus": 4,
    "instanceTypes": [ "m5.xlarge","m5.2xlarge","m4.xlarge","m4.2xlarge","m5a.xlarge","m5a.2xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-instances"},
    "ec2KeyPair": "batch-ssh-key",
    "bidPercentage": 0
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
aws batch create-compute-environment --cli-input-json file:// compute-environment.json 

Once this is complete, my compute environment begins to launch instances. This takes a few minutes. I can use the following command to check on the status of the compute environment whenever I want:

aws batch describe-compute-environments |jq '.computeEnvironments[] |select(.computeEnvironmentName=="compute-0")'

The command uses jq to filter the output to only show the compute environment I just created.

Job queue

Now that I have my compute environment up and running, I can create a job queue, which accepts job submissions and schedules the jobs to the compute environment.

cat > job-queue.json << EOF
  "jobQueueName": "queue-0",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "compute-0"
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. It is referenced in a job submission to specify the defaults of a job configuration, while some of the parameters can be overwritten when you submit.

Within the job definition, different retry strategies can be configured along with a maximum number of attempts for the job.
Three possible conditions can be used:

  • onExitCode will evaluate non-zero exit codes
  • onReason matched against middleware errors
  • onStatusReason can be used to react to infrastructure events such as an instance termination

Different conditions are assigned an action to either EXIT or RETRY the job. Important to note, that a job finishing with an exit code of zero will EXIT the job and not evaluate the retry condition. The default behavior for all non-zero exit code is the following:

  "onExitCode" : ""
  "onStatusReason" : ""
  "onReason" : "*"
  "action": retry

This condition retries every job that does not succeed (exit code 0) until the attempts are exhausted.

Spot Instance interruptions

AWS Batch works great with Spot Instances and customers are using this to reduce their compute cost. If Spot Instances become unavailable, instances are reclaimed by EC2, which can lead to one or more of my hosts being shut down. When this happens, the jobs running on those hosts are shut down due to an infrastructure event, not an application failure. Previously, separating these kinds of events from one another was only possible by catching the notification on the instance itself or through CloudWatch Events. Now with customer retry, you don’t have to rely on instance notifications or CloudWatch Events.

Using the job definition below, the job is restarted if the instance running the job gets shut down, which includes the termination due to a Spot Instance reclaim. The additional condition makes sure that the job exits whenever the exit code is not zero, otherwise the job would be rescheduled until the attempts are exhausted (see default behavior above).

cat > jdef-spot..json << EOF
    "jobDefinitionName": "spot",
    "type": "container",
    "containerProperties": {
        "image": "alpine:latest",
        "vcpus": 2,
        "memory": 256,
        "command":  ["sleep","600"],
        "readonlyRootFilesystem": false
    "retryStrategy": { 
        "attempts": 5,
            "onStatusReason" :"Host EC2*",
            "action": "RETRY"
  		  "onReason" : "*"
            "action": "EXIT"
aws batch register-job-definition --cli-input-json file://jdef-spot.json

To simulate a Spot Instances reclaim, I submit a job, and manually shut down the host the job is running on. This triggers my condition to ask AWS Batch to make 5 attempts to finish the job before it marks the job a failure.

When I use the AWS CLI to describe my job, it displays the number of attempts to retry.

By shutting down my instance, the job returns to the status RUNNABLE and will be scheduled again until it succeeds or reaches the maximum attempts defined.

Exit code mitigation

I can also use the exit code to decide which mitigation I want to use based on the exit code of the job script or application itself.

To illustrate this, I can create a new job definition that uses a container image that exits on a random exit code between 0 and 3. Traditionally, an exit code of 0 means success, and won’t trigger this retry strategy. For all other (nonzero) exit codes the retry strategy is evaluated. In my example, 1 or 2 reflect situations where a retry is needed, but an exit code of 3 means that AWS Batch should let the job fail.

cat > jdef-randomEC.json << EOF
    "jobDefinitionName": "randomEC",
    "type": "container",
    "containerProperties": {
        "image": "qnib/random-ec:2020-10-13.3",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    "retryStrategy": { 
        "attempts": 10,
            "onExitCode": "1",
            "action": "RETRY"
            "onExitCode": "2",
            "action": "RETRY"
            "onExitCode": "3",
            "action": "EXIT"
aws batch register-job-definition --cli-input-json file://jdef-randomEC.json

A submitted job retries until the exit code 0 is successful, 3 for a failure or the attempts are exhausted (in this case, 10 of them).

aws batch submit-job  --job-name randomEC-$(date +"%F_%H-%M-%S") --job-queue queue-0   --job-definition randomEC:1

The output of a job submission shows the job name and the job id.

In case the exit code is 1, and the job will be requeued.

Container image pull failure

The first example showed an error on the infrastructure layer and the second showed how to handle errors on the application layer. In this last example, I show how to handle errors that are introduced in the middleware layer, in this case: the container daemon.

It might happen if your Docker registry is down or having issues. To demonstrate this, I used an image name that is not present in the registry. In that case, the job should not get rescheduled to fail again immediately.

The following job definition again defines 10 attempts for a job, except when the container cannot be pulled. This leads to a direct failure of the job.

cat > jdef-noContainer.json << EOF
    "jobDefinitionName": "noContainer",
    "type": "container",
    "containerProperties": {
        "image": "no-container-image",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    "retryStrategy": { 
        "attempts": 10,
            "onReason": "CannotPullContainerError:*",
            "action": "EXIT"
aws batch register-job-definition --cli-input-json file://jdef-noContainer.json

Note that the job defines an image name (“no-container-image”) which is not present in the registry. The job is set up to fail when trying to download the image, and will do so repeatedly, if AWS Batch keeps trying.

Even though the job definition has 10 attempts configured for this job, it fell straight through to FAILED as the retry strategy sets the action exit when a CannotPullContainerError occurs. Many of the error codes I can create conditions for are documented in the Amazon ECS user guide (e.g. task error codes / container pull error).


In this blog post, I showed three different scenarios that leverage the new custom retry features in AWS Batch to control when a job should exit or get rescheduled.

By defining retry strategies you can react to an infrastructure event (like an EC2 Spot interruption), an application signal (via the exit code), or an event within the middleware (like a container image not being available).

This new feature allows you to have fine grained control over how your jobs react to different error scenarios.

Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/

This post was written by By Kevin Tuil, AWS HPC consultant 

Modeling fires is key for many industries, from the design of new buildings, defining evacuation procedures for trains, planes and ships, and even the spread of wildfires. Modeling these fires is complex. It involves both the need to model the three-dimensional unsteady turbulent flow of the fire and the many potential chemical reactions. To achieve this, the fire modeling community has moved to higher-fidelity turbulence modeling approaches such as the Large Eddy Simulation, which requires both significant temporal and spatial resolution. It means that the computational cost for these simulations is typically in the order of days to weeks on a single workstation.
While there are a number of software packages, one of the most popular is the open-source code: Fire Dynamics Simulation (FDS) developed by National Institute of Standards and Technology (NIST).

In this blog, I focus on how AWS High Performance Computing (HPC) resources (e.g AWS ParallelCluster, Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and Amazon S3) allow FDS users to scale up beyond a single workstation to hundreds of cores to achieve simulation times of hours rather than days or weeks. In this blog, I outline the architecture needed, providing scripts and templates to compile FDS and run your simulation.

Service and solution overview

AWS ParallelCluster

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use, even if you are new to the cloud. AWS released AWS ParallelCluster 2.9.1 and its user guide – which is the version I use in this blog.

These three AWS HPC resources are optimal for Fire Dynamics Simulation. Together, they provide easy deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel file system.

Elastic Fabric Adapter

EFA is a critical service that provides low latency and high-bandwidth 100 Gbps network communication. EFA allows applications to scale at the level of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud. Computational Fluid Dynamics (CFD), among other tightly coupled applications, is an excellent candidate for the use of EFA.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides sub-millisecond latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleash innovation at an unparalleled pace. This blog post uses the latest version of Amazon FSx for Lustre, which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket.

Solution and steps

The overall solution that I deploy in this blog is represented in the following diagram:

solution overview diagram

Step 1: Access to AWS Cloud9 terminal and upload data

There are two ways to start using AWS ParallelCluster. You can either install AWS CLI or turn on AWS Cloud9, which is a cloud-based integrated development environment (IDE) that includes a terminal. For simplicity, I use AWS Cloud9 to create the HPC cluster. Please refer to this link to proceed to AWS Cloud9 set up and to this link for AWS CLI setup.

Once logged into your AWS Cloud9 instance, the first thing you want to create is the S3 bucket. This bucket is key to exchange user data in and out from the corporate data center and the AWS HPC cluster. Please make sure that your bucket name is unique globally, meaning there is only one worldwide across all AWS Regions.

aws s3 mb s3://fds-smv-bucket-unique
make_bucket: fds-smv-bucket-unique

Download the latest FDS-SMV Linux version package from the official NIST website. It looks something like: FDS6.7.4_SMV6.7.14_lnx.sh

For the geometry, it should be renamed to “geometry.fds”, and must be uploaded to your AWS Cloud9 or directly to your S3 bucket.

Please note that once the FDS-SMV package has been downloaded locally to the instance, you must upload it to the S3 bucket using the following command.

aws s3 cp FDS6.7.4_SMV6.7.14_lnx.sh s3://fds-smv-bucket-unique
aws s3 cp geometry.fds s3://fds-smv-bucket-unique

You use the same S3 bucket to install FDS-SMV later on with the Amazon FSx for Lustre File System.

Step 2: Set up AWS ParallelCluster

You can install AWS ParallelCluster running the following command from your AWS Cloud9 instance:

sudo pip install aws-parallelcluster

Once it is installed, you can run the following command to check the version:

pcluster version 

At the time of writing this blog, 2.9.1 is the most up-to-date version.

Then use the text editor of your choice and open the configuration file as follows:

vim ~/.parallelcluster/config

Replace the bolded section, if not yet filled in, by your own information and save the configuration file.

aws_region_name = <AWS-REGION>

sanity_check = true
cluster_template = fds-smv-cluster
update_check = true

[vpc public]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<SUBNET-ID>

[cluster fds-smv-cluster]
key_name = <Key-Name>
vpc_settings = public
initial_queue_size = 0
max_queue_size = 100
cluster_type = ondemand
placement_group = DYNAMIC
placement = compute
base_os = alinux2
tags = {"Name" : "fds-smv"}
disable_hyperthreading = true
fsx_settings = fsxshared
enable_efa = compute
dcv_settings = hpc-dcv

[dcv hpc-dcv]
enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://fds-smv-bucket-unique
imported_file_chunk_size = 1024
export_path = s3://fds-smv-bucket-unique

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Let’s review the different sections of the configuration file and explain their role:

  • scheduler: Supported job schedulers are SGE, TORQUE, SLURM and AWS Batch. I have selected SLURM for this example.
  • cluster_type: You have the choice between On-Demand (ondemand) or Spot Instances (spot) for your compute instances. For On-Demand, instances are available for use without condition (if available in the Region selected) at a certain price per hour with the pay-as-you-go model, meaning that as soon as they are started, they are reserved for your utilization. For Spot Instances, you can take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as HPC, for more information about Spot Instances, feel free to visit this webpage.
  • s3_read_write_resource: This parameter allows you to read and write objects directly on your S3 bucket from the cluster you created without additional permissions. It acts as a role for your cluster, allowing you access to your specified S3 bucket.  
  • placement_groupUse DYNAMIC to ensure that your instances are located as physically close to one another as possible. Close placement minimizes the latency between compute nodes and takes advantage of EFA’s low latency networking.
  • placement: By selecting compute you only enforce compute instances to be placed within the same placement group, leaving the head node placement free.
  • compute_instance_type:Select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC applications. Note that EFA is supported only for specific instance types. Please visit currently supported instances for more information.
  • master_instance_type:This can be any instance type. As traffic between head and compute nodes is relatively small, and the head node runs during the entire lifetime of the cluster, I use c5.xlarge because it is inexpensive and is a good fit for this use case.
  • initial_queue_size:You start with no compute instances after the HPC cluster is up. This means that any new job submitted has some delay (time for the nodes to be powered on) before they are seen as available by the job scheduler. This helps you pay for what you use and keeps costs as low as possible.
  • max_queue_size:Limit the maximum compute fleet to 100 instances. This allows you room to scale your jobs up to a large number of cores, while putting a limit on the number of compute nodes to help control costs.
  • base_osFor this blog, select Amazon Linux 2 (alinux2) as a base OS. Currently we also support Amazon Linux (alinux), CentOS 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.
  • disable_hyperthreading: This setting turns off hyperthreading (true) on your cluster, which is the right configuration in this use case.[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory is mounted, the storage capacity for the file system, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.
  • enable_efa: Mark as (true) in this use case since it is a tightly coupled CFD simulation use case.
  • dcv_settings:With AWS ParallelCluster, you can use NICE DCV to support your remote visualization needs.
  • [dcv hpc-dcv]:This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.
  • import_path: This parameter enables all the objects on the S3 bucket available when creating the cluster to be seen directly from the FSx for Lustre file system. In this case, you are able to access the FDS-SMV package and the geometry under the /fsx mounted folder.
  • export_path: This parameter is useful for backup purposes using the Data Repository Tasks. I share more details about this in step 7 (optional).

Step 3: Create the HPC cluster and log in

Now, you can create the HPC cluster, named fds-smv. It takes around 10 minutes to complete and you can see the status changing (going through the different AWS CloudFormation template steps). At the end of creation, two IP addresses are prompted, a public IP and/or a private IP depending on your network choice.

pcluster create fds-smv
Creating stack named: parallelcluster-fds-smv
Status: parallelcluster-fds-smv - CREATE_COMPLETE                               
MasterPublicIP: X.X.X.X
ClusterUser: ec2-user
MasterPrivateIP: X.X.X.X

In order to log in, you must use the key you specified in the AWS ParallelCluster configuration file before creating the cluster:

pcluster ssh fds-smv -i <Key-Name>

You should now be logged in as an ec2-user (since we are using Amazon Linux 2 base OS).

Step 4: Install FDS-SMV package

Now that the HPC cluster using AWS ParallelCluster is set up, it is time to install the FDS-SMV package.  In the prior steps, you uploaded both the FDS-SMV package and the geometry to your S3 bucket. Since you enabled “import_path” to that bucket, they are already available on the Amazon FSx for Lustre storage under /fsx.

Run the script as follows and select /fsx/fds-smv as final target for installation:

cd /fsx
[[email protected] fsx]$ ./FDS6.7.4_SMV6.7.14_lnx.sh 

Installing FDS and Smokeview  for Linux

  1) Press <Enter> to begin installation [default]
  2) Type "extract" to copy the installation files to:

FDS install options:
  Press 1 to install in /home/ec2-user/FDS/FDS6 [default]
  Press 2 to install in /opt/FDS/FDS6
  Press 3 to install in /usr/local/bin/FDS/FDS6
  Enter a directory path to install elsewhere

It is important to source the following scripts as part of the installed packages to check if the installation is successful with the correct versions. Here is the correct output you should get:

[[email protected] ~]$ source /fsx/fds-smv/bin/SMV6VARS.sh 
[[email protected] ~]$ source /fsx/fds-smv/bin/FDS6VARS.sh 
[[email protected] ~]$ fds -version
FDS revision       : FDS6.7.4-0-gbfaa110-release
MPI library version: Intel(R) MPI Library 2019 Update 4 for Linux* OS

[[email protected] ~]$ smokeview -version

Smokeview  SMV6.7.14-0-g568693b-release - Mar  9 2020

Revision         : SMV6.7.14-0-g568693b-release
Revision Date    : Wed Mar 4 23:13:42 2020 -0500
Compilation Date : Mar  9 2020 16:31:22
Compiler         : Intel C/C++
Checksum(SHA1)   : e801eace7c6597dc187739e51ba6f546bfde4e48
Platform         : LINUX64

Important notes:

The way FDS-SMV package has been installed is the default installation. Binaries are already compiled and Intel MPI libraries are embedded as part of the installation package. It is what one would call a self-contained application. For further builds and source codes, please visit this webpage.

Step 5: Running the fire dynamics simulation using FDS

Now that everything is installed, it is time to create the SLURM submission script. In this step, you take advantage of the FSx for Lustre File System, the compute-optimized instance, and the EFA network to maximize simulation performance.

cd /fsx/
vi fds-smv.sbatch

Here is the information you should specify in your submission script:

#SBATCH --job-name=fds-smv-job
#SBATCH --ntasks=<Total number of MPI processes>
#SBATCH --ntasks-per-node=36
#SBATCH --output=%x_%j.out

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

module load intelmpi 

export I_MPI_PIN_DOMAIN=omp

cd /fsx/<results>

time mpirun -ppn 36 -np <Total number of MPI processes>  fds geometry.fds

Replace the <results> with the one of your choice, and don’t forget to copy the geometry.fds file in it before submitting your job. Once ready, save the file and submit the job using the following command:

sbatch fds-smv.sbatch 

If you decided to build your HPC cluster with c5n.18xlarge instances, the number of MPI processes per node is 36 since you turned off the hyperthreading, and that the instance has 36 physical cores. That is the meaning of the “#SBATCH --ntasks-per-node=36” line.

For any run exceeding 36 MPI processes, the job is split among multiple instances and take advantage of EFA for internode communication.

It is important to note that FDS only allows the number of MPI processes to be equal to the number of meshes in the input geometry (geometry.fds in this scenario). In case the number of meshes in the input geometry cannot be modified, OpenMP threads can be enabled and efficiently increase performance. Do this using up to four OpenMP Threads across four CPU cores attached to one MPI process.

Please read best practices provided by NIST for that topic on their user guide.

In order to take advantage of the distributed computing capability of FDS, it is mandatory to work first on the input geometry, and divide it into the appropriate number of meshes. It is also highly advised to evenly distribute the number of cells/elements per mesh across all meshes. This best practice optimizes the load balancing for each CPU core.

Step 6: Visualizing the results using NICE DCV and SMV

In order to visualize results, you must connect to the head node using NICE DCV streaming protocol.

As a reminder, the current instance type for the head node is a c5.xlarge, which is not a graphics-accelerated instance. For heavy and GPU intensive visualization, it is important to set up a more appropriate instance such as the G4 instance group.

Go back to your AWS Cloud9 instance, open a new terminal side by side to your session connected to your AWS HPC cluster, and enter the following command in the terminal:

pcluster dcv connect fds-smv -k <Key-Name>

You are provided a one-time HTTPS URL available for a short period of time in order to connect to your head node using the NICE DCV protocol.

Once connected, open the terminal inside your session and source the FDS-SMV scripts as before:

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

Navigate to your <results> folder and start SMV with your result.

I have selected one of the geometries named fire_whirl_pool.fds in the Examples folder, part of the default FDS-SMV installation package located here:


You can find other scenarios under the Examples folder to run some more use cases if you did not already choose your geometry.fds file.

Now you can run SMV and visualize your results:

smokeview fire_whirl_pool.smv

SMV (smokeview) takes as an input .smv extension files, please replace with your appropriate file. If you have already chosen your geometry.fds, then run the following command:

smokeview geometry.smv

The application then open as follows, and you can visualize the results. The following image is an output of the SOOT DENSITY of the 3D smoke.

fire simulation picture

Step 7 (optional): Back up your FDS-SMV results to an S3 bucket

First update the AWS CLI to its most recent version. It is compatible with 1.16.309 and above.

After running your FDS-SMV simulation, you can back up your data in /fsx to the S3 bucket you used earlier to upload the installation package, and input files using Data Repository Tasks.

Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

Open your AWS Cloud9 terminal and exit the HPC head node cluster. Retrieve your Amazon FSx for Lustre ID using:

aws fsx describe-file-systems

It looks something like, fs-0533eebf1148fc8dd. Then create a backup of the data as follows:

aws fsx create-data-repository-task --file-system-id fs-0533eebf1148fc8dd --type EXPORT_TO_REPOSITORY --paths results --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://fds-smv-bucket-unique/

The following are definitions about the command parameters:

  • file-system-id: Your file system ID.
  • type EXPORT_TO_REPOSITORY: Exports the data back to the S3 bucket.
  • paths results: The directory you want to export to your S3 bucket. If you have more than one folder to back up, use a comma-separated notation such as: results1,results2,…
  • Format=REPORT_CSV_20191124: Note this is only the name the Amazon FSx Lustre supports. Please keep it the same.

You can check the backup status by running:

aws fsx describe-data-repository-tasks

Please wait for the copy to be achieved, once finished you should see on the Lifecycle line "Lifecycle": "SUCCEEDED"

Also go back to your S3 bucket, and your folder(s) should appear with all the files correctly uploaded from your /fsx folder you specified.

In terms of data management, Amazon S3 is an important service. You started by uploading installation package and geometry files from an external source, such as your laptop or an on-premises system. Then made these files available to the AWS HPC cluster under the Amazon FSx for Lustre file system and ran the simulation. Finally, you backed up the results from the Amazon FSx for Lustre to Amazon S3. You can also decide to download the results on Amazon S3 back to your local system if needed.

Step 8: Delete your AWS resources created during the deployment of this blog

After your run is completed and your data backed up successfully (Step 7 is optional) on your S3 bucket, you can then delete your cluster by using the following command in your Cloud9 terminal:

pcluster delete fds-smv


If you run the command above all resources you created during this blog are automatically deleted beside your Cloud9 session and your data on your S3 bucket you created earlier.

Your S3 bucket still contains your input “geometry.fds” and your installation package “FDS6.7.4_SMV6.7.14_lnx.sh” files.

If you selected to back up your data during Step 7 (optional), then your S3 bucket also contains that data on top of the two previous files mentioned above.

If you want to delete your S3 bucket and all data mentioned above, go to your AWS Management Console, select S3 service then select your S3 bucket and hit delete on the top section.

If you want to terminate your Cloud9 session, go to your AWS Management Console, select Cloud9 service then select your session and hit delete on the top right section.

After performing these operations, there will be no more resources running on AWS related to this blog.


I showed that AWS ParallelCluster, Amazon FSx for Lustre, EFA, and Amazon S3 are key AWS services and features for HPC workloads such as CFD and in particular for FDS.

You can achieve simulation times of hours on AWS rather than days or weeks on a single workstation.

Please visit this workshop  for a more in-depth tutorial on running Fire Dynamics Simulation on AWS and our HPC dedicated homepage.


How to run 3D interactive applications with NICE DCV in AWS Batch

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/how-to-run-3d-interactive-applications-with-nice-dcv-in-aws-batch/

This post is contributed by Alberto Falzone, Consultant, HPC and Roberto Meda, Senior Consultant, HPC.

High Performance Computing (HPC) workflows across industry verticals such as Design and Engineering, Oil and Gas, and Life Sciences often require GPU-based 3D/OpenGL rendering. Setting up drivers and applications for these types of workflows can require significant effort.

Similar GPU intensive workloads, such as AI/ML, are heavily using containers to package software stacks and reduce the complexity of installing and setting up the required binaries and scripts to download and run a simple container image. This approach is rarely used in the visualization of previously mentioned pre- and post-processing steps due to the complexity of using a graphical user interface within a container.

This post describes how to reduce the complexity of installing and configuring a GPU accelerated application while maintaining performance by using NICE DCV. NICE DCV is a high-performance remote display protocol that provides customers with a secure way to deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions.

With remote server-side graphical rendering, and optimized streaming technology over network, huge volume data can be analyzed easily without moving or downloading on client, saving on data transfer costs.

Services and solution overview

This post provides a step-by-step guide on how to build a container able to run accelerated graphical applications using NICE DCV, and setup AWS Batch to run it. Finally, I will showcase how to submit an AWS Batch job that will provision the compute environment (CE) that contains a set of managed or unmanaged compute resources that are used to run jobs, launch the application in a container, and how to connect to the application with NICE DCV.


Before reviewing the solution, below are the AWS services and products you will use to run your application:

  • AWS Batch (AWS Batch) plans, schedules, and runs batch workloads on Amazon Elastic Container Service (ECS), dynamically provisioning the defined CE with Amazon EC2
  • Amazon Elastic Container Registry (Amazon ECR) is a fully managed Docker container registry that simplifies how developers store, manage, and deploy Docker container images. In this example, you use it to register the Docker image with all the required software stack that will be used from AWS Batch to submit batch jobs.
  • NICE DCV (NICE DCV) is a high-performance remote display protocol that delivers remote desktops and application streaming from any cloud or data center to any device, over varying network conditions. With NICE DCV and Amazon EC2, customers can run graphics-intensive applications remotely on G3/G4 EC2 instances, and stream the results to client machines not provided with a GPU.
  • AWS Secrets Manager (AWS Secrets Manager) helps you to securely encrypt, store, and retrieve credentials for your databases and other services. Instead of hardcoding credentials in your apps, you can make calls to Secrets Manager to retrieve your credentials whenever needed.
  • AWS Systems Manager (AWS Systems Manager) gives you visibility and control of your infrastructure on AWS, and provides a unified user interface so you can view operational data from multiple AWS services. It also allows you to automate operational tasks across your AWS resources. Here it is used to retrieve a public parameter.
  • Amazon Simple Notification Service (Amazon SNS) enables applications, end-users, and devices to instantly send and receive notifications from the cloud. You can send notifications by email to the user who has created a valid and verified subscription.


The goal of this solution is to run an interactive Linux desktop session in a single Amazon ECS container, with support for GPU rendering, and connect remotely through NICE DCV protocol. AWS Batch will dynamically provision EC2 instances, with or without GPU (e.g. G3/G4 instances).

Solution scheme

You will build and register the DCV Container image to be used for the DCV Desktop Sessions. In AWS Batch, we will set up a managed CE starting from the Amazon ECS GPU-optimized AMI, which comes with the NVIDIA drivers and Amazon ECS agent already installed. Also, you will use Amazon Secrets Manager to safely store user credentials and Amazon SNS to automatically notify the user that the interactive job is ready.


As a Computational Fluid Dynamics (CFD) visualization application example you will use Paraview.

This blog post goes through the following steps:

  1. Prepare required components
    • Launch temporary EC2 instance to build a DCV container image
    • Store user’s credentials and notification data
    • Create required roles
  2. Build DCV container image
  3. Create a repository on Amazon ECR
    • Push the DCV container image
  4. Configure AWS Batch
    • Create a managed CE
    • Create a related job queue
    • Create its Job Definition
  5. Submit a batch job
  6. Connect to the interactive desktop session using NICE DCV
    • Run the Paraview application to visualize results of a job simulation


  • An Amazon Linux 2 instance as a Docker host, launched from the latest Amazon ECS GPU-optimized AMI
  • In order to connect to desktop sessions, inbound DCV port must be opened (by default DCV port is 8443)
  • AWS account credentials with the necessary access permissions
  • AWS Command Line Interface (CLI) installed and configured with the same AWS credentials
  • To easily install third-party/open source required software, assume that the Docker host has outbound internet access allowed

Step 1. Required components

In this step you’ll create a temporary EC2 instance dedicated to a Docker image, and create the IAM policies required for the next steps. Next create the secrets in AWS Secrets Manager service to store sensible data like credentials and SNS topic ARN, and apply and verify the required system settings.

1.1 Launch the temporary EC2 instance for Docker image building

Launch the EC2 instance that becomes your Docker host from the Amazon ECS GPU-optimized AMI. Retrieve its AMI ID. For cost saving, you can use one of t3* family instance type for this stage (e.g. t3.medium).

1.2 Store user credentials and notification data

As an example of avoiding hardcoded credentials or keys into scripts used in next stages, we’ll use AWS Secrets Manager to safely store final user’s OS credentials and other sensible data.

  • In the AWS Management Console select Secrets Manager, create a new secret, select type Other type of secrets, and specify key pair. Store the user login name as a key, e.g.: user001, and the password as value, then name the secret as Run_DCV_in_Batch, or alternatively you can use the commands. Note xxxxxxxxxx is your chosen password.

aws secretsmanager  create-secret --secret-id Run_DCV_in_Batch
aws secretsmanager put-secret-value --secret-id Run_DCV_in_Batch --secret-string '{"user001":"xxxxxxxxxx"}'

  • Create an SNS Topic to send email notifications to the user when a DCV session is ready for connection:
  • In the AWS Management Console select Secrets Manager service to create a new secret named DCV_Session_Ready_Notification, with type other type of secrets and key pair values. Store the string sns_topic_arn as a key and the SNS Topic ARN as value:

aws secretsmanager  create-secret --secret-id DCV_Session_Ready_Notification
aws secretsmanager put-secret-value --secret-id DCV_Session_Ready_Notification --secret-string '{"sns_topic_arn":"<put here your SNS Topic ARN>"}'

1.3 Create required role and policy

To simplify, define a single role named dcv-ecs-batch-role gathering all the necessary policies. This role will be associated to the EC2 instance that launches from an AWS Batch job submission, so it is included inside the CE definition later.

To allow DCV sessions, push images into Amazon ECR and AWS Batch operations, create the role and include the following AWS managed and custom policies:

  • AmazonEC2ContainerRegistryFullAccess
  • AmazonEC2ContainerServiceforEC2Role
  • SecretsManagerReadWrite
  • AmazonSNSFullAccess
  • AmazonECSTaskExecutionRolePolicy

To reach the NICE DCV licenses stored in Amazon S3 (see licensing the NICE DCV server for more details), define a custom policy named DCVLicensePolicy (the following policy is for eu-west-1 Region, you might also use us-east-1):

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::dcv-license.eu-west-1/*"

create role

Note: If needed, you can add additional policies to allow the copy data from/to S3 bucket.

Update the Trust relationships of the same role in order to allow the Amazon ECS tasks execution and use this role from the AWS Batch Job definition as well:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      "Action": "sts:AssumeRole"
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      "Action": "sts:AssumeRole"

Trusted relationships and Trusted entities

1.4 Create required Security Group

In the AWS Management Console, access EC2, and create a Security Group, named dcv-sg, that is open to DCV sessions and DCV clients by enabling tcp port 8443 in Inbound.

Step 2. DCV container image

Now you will build a container that provides OpenGL acceleration via NICE DCV. You’ll write the Dockerfile starting from Amazon Linux 2 base image, and add DCV with its related requirements.

2.1 Define the Dockerfile

The base software packages in the Dockerfile will contain: NVIDIA libraries, X server and GNOME desktop and some external scripts to manage the DCV service startup and email notification for the user.

Starting from the base image just pulled, our Dockerfile does install all required (and optional) system tools and libraries, desktop manager packages, manage the Prerequisites for Linux NICE DCV Servers , Install the NICE DCV Server on Linux and Paraview application for 2D/3D data visualization.

The final contents of the Dockerfile is available here; in the same repository, you can also find scripts that manage the DCV service system script, the notification message sent to the User, the creation of local User at startup and the run script for the DCV container.

2.2 Build Dockerfile

Install required tools both to unpack archives and perform command on AWS:

sudo yum install -y unzip awscli

Download the Git archive within the EC2 instance, and unpack on a temporary directory:

curl -s -L -o - https://github.com/aws-samples/aws-batch-using-nice-dcv/archive/latest.tar.gz | tar zxvf -

From inside the folder containing aws-batch-using-nice-dcv.dockerfile, let’s build the Docker image:

docker build -t dcv -f aws-batch-using-nice-dcv.dockerfile .

The first time it takes a while since it has to download and install all the required packages and related dependencies. After the command completes, check it has been built and tagged correctly with the command:

docker images

Step 3. Amazon ECR configuration

In this step, you’ll push/archive our newly built DCV container AMI into Amazon ECR. Having this image in Amazon ECR allows you to use it inside Amazon ECS and AWS Batch.

3.1 Push DCV image into Amazon ECR repository

Set a desired name for your new repository, e.g. dcv, and push your latest dcv image into it. The push procedure is described in Amazon ECR by selecting your repository, and clicking on the top-right button View push commands.

Install the required tool to manage content in JSON format:

sudo yum install -y jq

Amazon ECR push commands to run include:

  • Login command to authenticate your Docker client to Amazon ECS registry. Using the AWS CLI:

AWS_REGION="$(curl -s | jq -r .region)"
eval $(aws ecr get-login --no-include-email --region "${AWS_REGION}") Note: If you receive an “Unknown options: –no-include-email” error when using the AWS CLI, ensure that you have the latest version installed. Learn more.

  • Create the repository:

aws ecr create-repository --repository-name=dcv —region "${AWS_REGION}"DCV_REPOSITORY=$(aws ecr describe-repositories --repository-names=dcv --region "${AWS_REGION}"| jq -r '.repositories[0].repositoryUri')

  • Tag the image to push the image to the Amazon ECR repository:

docker build -t "${DCV_REPOSITORY}:$(date +%F)" -f aws-batch-using-nice-dcv.dockerfile .

  • Push command:

docker push "${DCV_REPOSITORY}:$(date +%F)"

Step 4. AWS Batch configuration

The final step is to set up AWS Batch to manage your DCV containers. The link to all previous steps is the use of our DCV container image inside the AWS Batch CE.

4.1 Compute environment

Create an AWS Batch CE using othe newly created AMI.

  • Log into the AWS Management Console, select AWS Batch, select ‘get started’, and skip the wizard on next page.
  • Choose Compute Environments on the left, and click on Create Environment.
  • Specify all your desired settings, e.g.:
      • Managed type
      • Name: DCV-GPU-CE
      • Service role: AWSBatchServiceRole
      • Instance role: dcv-ecs-batch-role
  • Since you want OpenGL acceleration, choose an instance type with GPU (e.g. g4dn.xlarge).
  • Choose an allocation strategy. In this example I choose BEST_FIT_PROGRESSIVE
  • Assign the security group dcv-sg, created previously at step 1.4 that keeps DCV port 8443 open.
  • Add a Nametag with the value e.g. “DCV-GPU-Batch-Instance”; to assign it to the EC2 instances started by AWS Batch automatically, so you can recognize it if needed.

4.2 Job Queue

Time to create a Job Queue for DCV with your preferred settings.

  • Select Job Queues from the left menu, then select Create queue (naming, for instance, e.g. DCV-GPU-Queue)
  • Specify a required Priority integer value.
  • Associate to this queue the CE you defined in the previous step (e.g. DCV-GPU-CE).

4.3 Job Definition

Now, we create a Job Definition by selecting the related item in the left menu, and select Create. 

We’ll use, listed per section:

  • Job Definition name (e.g. DCV-GPU-JD)
  • Execution timeout to 1h: 3600
  • Parameter section:
    • Add the Parameter named command with value: --network=host
      • Note: This parameter is required and equivalent to specify the same option to the docker run.Learn more.
  • Environment section:
    • Job role: dcv-ecs-batch-role
    • Container image: Use the ECR repository previously created, e.g. dkr.ecr.eu-west-1.amazonaws.com/dcv. If you don’t remember the Amazon ECR image URI, just return to Amazon ECR -> Repository -> Images.
    • vCPUs: 8
      • Note: Value equal to the vCPUs of the chosen instance type (in this example: gdn4.2xlarge), having one job per node to avoid conflicts on usage of TCP ports required by NICE DCV daemons.
    • Memory (MiB): 2048
  • Security section:
    • Check Privileged
    • Set user root (run as root)
  • Environment Variables section:
    • DISPLAY: 0

Note: Amazon ECS provides a GPU-optimized AMI that comes ready with pre-configured NVIDIA kernel drivers and a Docker GPU runtime, learn more; the variables above make available the required graphic device(s) inside the container.

4.4 Create and submit a Job

We can finally, create an AWS Batch job, by selecting Batch → Jobs → Submit Job.
Let’s specify the job queue and job definition defined in the previous steps. Leave the command filed as pre-filled from job definition.

Running DCV job on AWS Batch

4.5 Connect to sessions

Once the job is in RUNNING state, go to the AWS Batch dashboard, you can get the IP address/DNS in several ways as noted in How do I get the ID or IP address of an Amazon EC2 instance for an AWS Batch job. For example, assuming the tag Name set on CE is DCV-GPU-Batch-Instance:

aws ec2 describe-instances --filters Name=instance-state-name,Values=running Name=tag:Name,Values="DCV-GPU-Batch-Instance" --query "Reservations[].Instances[].{id: InstanceId, tm: LaunchTime, ip: PublicIpAddress}" | jq -r 'sort_by(.tm) | reverse | .[0]' | jq -r .ip

Note: It could be required to add the EC2 policy to the list of instances in the IAM role. If the AWS SNS Topic is properly configured, as mentioned in subsection 1.2, you receive the notification email message with the URL link to connect to the interactive graphical DCV session.

Email from SNS

Finally, connect to it:

  • https://<ip address>:8443

Note: You might need to wait for the host to report as running on EC2 in AWS Management Console.

Below is a NICE DCV session running inside a container using the web browser, or equivalently the NICE DCV native client as well, running Paraview visualization application. It shows the basic elbow results coming from an external OpenFoam simulation, which data has been previously copied over from an S3 bucket; and the dcvgltest as well:

DCV Client connected to a running session


Once you’ve finished running the application, avoid incurring future charges by navigating to the AWS Batch console and terminate the job, set CE parameter Minimum vCPUs and Desired vCPUs equal to 0. Also, navigate to Amazon EC2 and stop the temporary EC2 instance used to build the Docker image.

For a full cleanup of all of the configurations and resources used, delete: the job definition, the job queue and the CE (AWS Batch), the Docker image and ECR Repository (Amazon ECR), the role dcv-ecs-batch-role (Amazon IAM), the security group dcv-sg (Amazon EC2), the Topic DCV_Session_Ready_Notification (AWS SNS), and the secret Run_DCV_in_Batch (Amazon Secrets Manager).


This blog post demonstrates how AWS Batch enables innovative approaches to run HPC workflows including not only batch jobs, but also pre-/post-analysis steps done through interactive graphical OpenGL/3D applications.

You are now ready to start interactive applications with AWS Batch and NICE DCV on G-series instance types with dedicated 3D hardware. This allows you to take advantage of graphical remote rendering on optimized infrastructure without moving data to save costs.

Custom logging with AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch. 

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

  1. IAM roles, policies, and profiles to grant access and permissions
  2. A compute environment to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on a compute environment
  4. A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.


To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')


When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    "Action": "sts:AssumeRole"

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role.  You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    "Action": "sts:AssumeRole"
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
aws batch create-compute-environment --cli-input-json file://compute-environment.json  

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch  describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events


Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk


This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.


EFA-enabled C5n instances to scale Simcenter STAR-CCM+

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/efa-enabled-c5n-instances-to-scale-simcenter-star-ccm/

This post was contributed by Dnyanesh Digraskar, Senior Partner SA, High Performance Computing; Linda Hedges, Principal SA, High Performance Computing

In this blog, we define and demonstrate the scalability metrics for a typical real-world application using Computational Fluid Dynamics (CFD) software from Siemens, Simcenter STAR-CCM+, running on a High Performance Computing (HPC) cluster on Amazon Web Services (AWS). This scenario demonstrates the scaling of an external aerodynamics CFD case with 97 million cells to over 4,000 cores of Amazon EC2 C5n.18xlarge instances using the Simcenter STAR-CCM+ software. We also discuss the effects of scaling on efficiency, simulation turn-around time, and total simulation costs. TLG Aerospace, a Seattle-based aerospace engineering services company, contributed the data used in this blog. For a detailed case study describing TLG Aerospace’s experience and the results they achieved, see the TLG Aerospace case study.

For HPC workloads that use multiple nodes, the cluster setup including the network is at the heart of scalability concerns. Some of the most common concerns from CFD or HPC engineers are “how well will my application scale on AWS?”, “how do I optimize the associated costs for best performance of my application on AWS?”, “what are the best practices in setting up an HPC cluster on AWS to reduce the simulation turn-around time and maintain high efficiency?” This post aims to answer these concerns by defining and explaining important scalability-related parameters by illustrating the results from the CFD case. For detailed HPC-specific information, see visit the High Performance Computing page and download the CFD whitepaper, Computational Fluid Dynamics on AWS.

CFD scaling on AWS


HPC applications, such as CFD, depend heavily on the applications’ ability to scale compute tasks efficiently in parallel across multiple compute resources. We often evaluate parallel performance by determining an application’s scale-up. Scale-up – a function of the number of processors used – is the time to complete a run on one processor, divided by the time to complete the same run on the number of processors used for the parallel run.

Scale-up formula

In addition to characterizing the scale-up of an application, scalability can be further characterized as “strong” or “weak”. Strong scaling offers a traditional view of application scaling, where a problem size is fixed and spread over an increasing number of processors. As more processors are added to the calculation, good strong scaling means that the time to complete the calculation decreases proportionally with increasing processor count. In comparison, weak scaling does not fix the problem size used in the evaluation, but purposely increases the problem size as the number of processors also increases. An application demonstrates good weak scaling when the time to complete the calculation remains constant as the ratio of compute effort to the number of processors is held constant. Weak scaling offers insight into how an application behaves with varying case size.

Figure 1, the following image, shows scale-up as a function of increasing processor count for the Simcenter STAR-CCM+ case data provided by TLG Aerospace. This is a demonstration of “strong” scalability. The blue line shows what ideal or perfect scalability looks like. The purple triangles show the actual scale-up for the case as a function of increasing processor count. The closeness of these two curves demonstrates excellent scaling to well over 3,000 processors for this mid-to-large-sized 97M cell case. This example was run on Amazon EC2 C5n.18xlarge Intel Skylake instances, 3.0 GHz, each providing 36 cores with Hyper-Threading disabled.

Figure 1. Strong scaling demonstrated for a 97M cell Simcenter STAR-CCM+ CFD calculation


Now that you understand the variation of scale-up with the number of processors, we discuss the relation of scale-up with number of grid cells per processor, which determines the efficiency of the parallel simulation. Efficiency is the scale-up divided by the number of processors used in the calculation. By plotting grid cells per processor, as in Figure 2, scaling estimates can be made for simulations with different grid sizes with Simcenter STAR-CCM+. The purple line in Figure 2 shows scale-up as a function of grid cells per processor. The vertical axis for scale-up is on the left-hand side of the graph as indicated by the purple arrow. The green line in Figure 2 shows efficiency as a function of grid cells per processor. The vertical axis for efficiency is on the right side of the graph and is indicated by the green arrow.

Figure 2. Scale-up and efficiency as a function of cells per processor.

Fewer grid cells per processor means reduced computational effort per processor. Maintaining efficiency while reducing cells per processor demonstrates the strong scalability of Simcenter STAR-CCM+ on AWS.

Efficiency remains at about 100% between approximately 700,000 cells per processor core and 60,000 cells per processor core. Efficiency starts to fall off at about 60,000 cells per core. An efficiency of at least 80% is maintained until 25,000 cells per core. Decreasing cells per core leads to decreased efficiency because the total computational effort per processor core is reduced. The goal of achieving more than 100% efficiency (here, at about 250,000 cells per core) is common in scaling studies, is case-specific, and often related to smaller effects such as timing variation and memory caching.

Turn-around time and cost

Case turn-around time and cost is what really matters to most HPC users. A plot of turn-around time versus CPU cost for this case is shown in Figure 3. As the number of cores increases, the total turn-around time decreases. But as the number of cores increases, the inefficiency also increases, which leads to increased costs. The cost, represented by solid blue curve, is based on the On-Demand price for the C5n.18xlarge, and only includes the computational costs. Small costs are also incurred for data storage. Minimum cost and turn-around time are achieved with approximately 60,000 cells per core.

Figure 3. Cost per run for: On-Demand pricing ($3.888 per hour for C5n.18xlarge in US-East-1) with and without the Simcenter STAR-CCM+ POD license cost as a function of turn-around time [Blue]; 3-yr all-upfront pricing ($1.475 per hour for C5n.18xlarge in US-East-1) [Green]

Many users choose a cell count per core count to achieve the lowest possible cost. Others may choose a cell count per core count to achieve the fastest turn-around time. If a run is desired in 1/3rd the time of the lowest price point, it can be achieved with approximately 25,000 cells per core.

Additional information about the test scenario

TLG Aerospace has used the Simcenter STAR-CCM+ Power-On-Demand (POD) license for running the simulations for this case. POD license enables flexible On-Demand usage of the software on unlimited cores for a fixed price of $22 per hour. The total cost per run, which includes the computational cost, plus the POD license cost is represented in Figure 3 by the dashed blue curve. As POD license is charged per hour, the total cost per run increases for higher turn-around times. Note that many users run Simcenter STAR-CCM+ with fewer cells per core than this case. While this increases the compute cost, other concerns—such as license costs or schedules—can be overriding factors. However, many find the reduced turn-around time well worth the price of the additional instances.

AWS also offers Savings Plans, which are a flexible pricing model offering substantially lower price on EC2 instances compared to On-Demand prices for a committed usage of 1- or 3-year term. For example, the 3-year all-upfront pricing of C5n.18xlarge instance is 62% cheaper than the On-Demand pricing. The total cost per run using the 3-year all-upfront pricing model is illustrated in Figure 3 by solid green line. The 3-year all-upfront pricing plan offers a substantial reduction in price for running the simulations.

Amazon Linux is optimized to run on AWS and offers excellent performance for running HPC applications. For the case presented here, the operating system used was Amazon Linux 2. While other Linux distributions are also performant, we strongly recommend that for Linux HPC applications, you use a current Linux kernel.

Amazon Elastic Block Store (Amazon EBS) is a persistent, block-level storage device that is often used for cluster storage on AWS. A standard EBS General Purpose SSD (gp2) volume was used for this scenario. For other HPC applications that may require faster I/O to prevent data writes from being a bottleneck to turn-around speed, we recommend FSx for Lustre. FSx for Lustre seamlessly integrates with Amazon S3, allowing users for efficient data interaction with Amazon S3.

AWS customers can choose to run their applications on either threads or cores. With hyper-threading, a single CPU physical core appears as two logical CPUs to the operating system. For an application like Simcenter STAR-CCM+, excellent linear scaling can be seen when using either threads or cores, though we generally recommend disabling hyper-threading. Most HPC applications benefit from disabling hyper-threading, and therefore, it tends to be the preferred environment for running HPC workloads. For more information, see Well-Architected Framework HPC Lens.

Elastic Fabric Adapter (EFA)

Elastic Fabric Adapter (EFA) is a network device that can be attached to Amazon EC2 instances to accelerate HPC applications by providing lower and consistent latency and higher throughput than the Transmission Control Protocol (TCP) transport. C5n.18xlarge instances used for running Simcenter STAR-CCM+ for this case support EFA technology, which is generally recommended for best scaling.


This post demonstrates the scalability of a commercial CFD software Simcenter STAR-CCM+ for an external aerodynamics simulation performed on the Amazon EC2 C5n.18xlarge instances. The availability of EFA, a high-performing network device on these instances result in excellent scalability of the application. The case turn-around time and associated costs of running Simcenter STAR-CCM+ on AWS hardware are discussed. In general, excellent performance can be achieved on AWS for most HPC applications. In addition to low cost and quick turn-around time, important considerations for HPC also include throughput and availability. AWS offers high throughput, scalability, security, cost-savings, and high availability, decreasing a long queue time and reducing the case turn-around time.

OpenFOAM on Amazon EC2 C6g Arm-based Graviton2 Instances – up to 37% better price/performance

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/c6g-openfoam-better-price-performance/

This post is contributed to by: Neil Ashton (AWS) – Principal CFD Specialist SA, Karthik Raman (AWS) – Senior HPC Specialist SA, Oliver Perks (Arm) – Principal HPC Engineer

Over the past 30 years, Computational Fluid Dynamics (CFD) has become a key part of many engineering design processes. From aircraft design to modelling the blood flow inside our bodies, the ability to understand the behavior of fluids has enabled countless innovations and improved the time to market for many products. Typically, the modelling of the underlying fluid flow equations remains the major limit to accuracy; however, the scale and availability of HPC resources is arguably the main bottleneck for the typical CFD user. The need for more HPC resources has accelerated over recent years due to the move to higher-fidelity approaches, in addition to the growing use of optimization techniques and machine learning driven workflows. These workflows require the simulation of thousands or even millions of small jobs to explore a design space or train an improved turbulence model. AWS enables engineers to overcome this HPC bottleneck by allowing the quick deployment of a supercomputer. This can scale to practically any size and with the right hardware to match your needs, for example, CPUs, GPUs.

CFD users have a broad choice for instance types on AWS.  Typically the majority of users select the compute-optimized Amazon EC2 C5 family of instances. These offer a better price/performance over the Amazon EC2 M5 family of instances for the majority of CFD cases due to a need for memory bandwidth over total memory needs. This broad set of available instances allows users to easily incorporate Amazon EC2 M5 or Amazon EC2 R5 instances when higher memory is needed, for example, for pre or post-processing.

There are typically two competing needs for the typical CFD user: turn-around time and cost. Depending on the underlying numerical methods, the speed of a simulation does not linearly increase with an increased core count. At some point, the cost of network communication means a non-linear increase. This means the cost of increasing the number of cores is not matched by a similar decrease in runtime. Therefore, the user typically faces a choice between the fastest possible runtime or the most efficient cost for the simulation.

The recently announced Amazon EC2 C6g instances powered by AWS-designed Arm-based Graviton2 processors bring in a new instance option. So, the logical question for any AWS customer is whether this is suitable for their CFD workload. In this blog, we provide benchmarking results for the open-source CFD code OpenFOAM that can be easily compiled and run with the Amazon EC2 C6g instance. We demonstrate that you can achieve 37% better price/performance for both single-node and multi-node cases on more than 200M cells on thousands of cores.


AWS Graviton2-based Instances

AWS Graviton2 processors are custom built by AWS using the 64-bit Arm Neoverse cores to deliver great price performance for your cloud workloads running in Amazon EC2. The Graviton2-powered EC2 instances were announced at re:Invent 2019. The new Amazon EC2 C6g compute-optimized instances are available as part of the sixth generation EC2 offering. These instances are powered by AWS Graviton2 processors that use 64-bit Arm Neoverse N1 cores and custom silicon designed by AWS, built using advanced 7-nanometer manufacturing technology.

Graviton2 processor cores feature 64 KB L1 cache and 1 MB L2 cache, includes dual SIMD units to double the floating-point performance versus first-generation Graviton processors.  This targets high performance computing workloads and also support int8/fp16 number formats to accelerate machine learning inference workloads. Every vCPU is a physical core (that is, no simultaneous multithreading – SMT). The instances are single socket, so there are no NUMA concerns since every core sees the same path to memory and other cores. There are 8x DDR4 memory channels running at 3200 MT/s delivering over 200 GB/s of peak memory bandwidth.

The compute-optimized Amazon EC2 C6g instances are available in 9 sizes 1, 2, 4, 8, 16, 32, 48, and 64 vCPUs, or as bare metal instances. The compute-optimized Amazon EC2 C6g instances support configurations with up to 128 GB of DDR4 memory or 2 GB/ CPU. The instances support up-to 25 Gbps of network bandwidth and 19 Gbps of Amazon EBS bandwidth. These instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor.

Benchmarking Results

Single-Node Performance

In order to demonstrate the performance of the Amazon EC2 C6g instance, we have taken two test-cases that reflect the range of cases that CFD users run. For the first case, we simulate a 4 million cell motorbike case that is part of the standard OpenFOAM tutorial suite. It might seem that these cases do not represent the complexity of some production CFD workflows. However, these were chosen to allow readers to easily replicate the results. This case uses the OpenFOAM simpleFoam solver, which uses a steady-state incompressible segregated SIMPLE approach combined with a multigrid method for the linear solver. The k- SST model is used with second order upwinding for both momentum and turbulent equations. The scotch domain decomposition approach is used to ensure best parallel efficiency. However, for this first case the focus is on the single-node performance, given the low cell count.

Figures 1 and 2 show the time taken to run 5000 iterations as well as the cost for that simulation based upon on-demand pricing. It can be seen that the C6g.16xlarge offers a clear cost per simulation benefit over both the C5n.18xlarge (37%), and C5.24xlarge (29%) instances. There is however an increase in total run time compared to the C5.24xlarge (33%) and C5n.18xlarge (13%) which may mean that the C5.24xlarge or C5n.18xlarge will still be preferred for those prioritizing turn-around time.


Figure 1 - Single-node OpenFOAM cost per simulation

Figure 1 – Single-node OpenFOAM cost per simulation

Figure 2 - Single-node OpenFOAM run-time performance

Figure 2 – Single-node OpenFOAM run-time performance

Single-Node Performance Profile

Profiling performed on the simpleFoam solver (used in the motorbike example) showed that it is limited by the sustained memory bandwidth on the instance. We therefore measured the peak sustainable memory bandwidth performance (Figure 3) on the three instance types (C5n.18xlarge, C6g.16xlarge, C5.24xlarge) using STREAM Triad. Then calculated the corresponding bandwidth efficiencies, which are shown in Figure 4.


Figure 3 - Peak memory bandwidth using STREAM

Figure 3 – Peak memory bandwidth using STREAM

Figure 4 - Memory bandwidth efficiency using STREAM Triad

Figure 4 – Memory bandwidth efficiency using STREAM Triad

The C5n.18xlarge and C5.24xlarge instances have higher peak memory bandwidth due to more memory channels (dual socket) compared to C6g.16xlarge instance. However, the Graviton2 based processor can achieve a higher percentage of peak (~86% bandwidth efficiency) compared to the x86 processors (~79% on C5.24xlarge). This in addition to the cost benefits (47% vs. C5.24xlarge, 44% vs. C5n.18xlarge) is the reason for the improved cost per simulation with the C6g.16xlarge instance as shown in Figure 1.

Multi-Node Performance

A further, larger mesh of 222 million cells was studied on the same motorbike geometry with the same underlying numerical methods. We did this to assess whether the same performance trends continue for much larger multi-node simulations. Figure 5 shows the number of iterations possible per minute for different number of cores for the C5.24xlarge, C5n.18xlarge, and C6g.16xlarge instances. You can see in Figure 5 that the scaling of the C5n.18xlarge instances is much better than C5.24xlarge and C6g.16xlarge instances due to the use of the Elastic Fabric Adapter (EFA), which enables optimum scaling thanks to low latency, high-bandwidth  communication. However, while the scaling is much improved for C5n.18xlarge instances with EFA, the cost per simulation (Figure 6) shows up to 37% better price/performance for the C6g.16xlarge instances over the C5n.18xlarge and C5n.24xlarge instances.

Figure 5 - Multi-node OpenFOAM Scaling Performance

Figure 5 – Multi-node OpenFOAM Scaling Performance

Figure 6 - Multi-node OpenFOAM cost per simulation

Figure 6 – Multi-node OpenFOAM cost per simulation


This blog has demonstrated the ease with which both single-node and multi-node CFD simulations can be ported to the new Amazon EC2 C6g instances powered by Arm-based AWS Graviton2 processors. With no code modifications, we have been able to port an x86 workload to the new C6g instances and achieve competitive single node and multi-node performance with up to 37% better price/performance over other C family instances. We encourage you to test out your applications for yourself and reach out to us if you have any questions!

OpenFOAM compilation on AWS Graviton2 Instances

OpenFOAM v1912 builds on modern Arm-based systems, with support in-place for both the GCC compiler and the Arm Compiler for Linux (ACfL). No source code modifications are required to obtain a working OpenFOAM binary. The process for building OpenFOAM on Arm based systems is equivalent to that on x86 systems, by specifying the desired compiler in OpenFOAM bashrc file.

The only external dependencies are a working MPI library. For this experiment, we used Open MPI version 4.0.3, built with UCX 1.8 and GCC 9.2. We note that we have not applied any additional hardware-specific performance optimizations. Although deploying OpenFOAM on the Amazon EC2 C6g instances was trivial, there is documentation on the Arm community GitLab pages. This documentation covers the installation of different OpenFOAM versions and with different compilers.

For the C5n.18xlarge and C5.24xlarge simulations, OpenFOAM v1912 was compiled using GCC 8.2 and IntelMPI 2019.7. For all simulations Amazon Linux 2 was the operating system. The HPC environment for all tests was created using AWS ParallelCluster, which will soon include official support for AWS’ latest Graviton2 instances, including C6g. You can stay up to date on the latest information and releases of AWS ParallelCluster on GitHub.

Instance Details:

Model vCPU Memory (GiB) Instance Storage (GiB) Network Bandwidth (Gbps) EBS Bandwidth (Mbps)
C5n.18xlarge 72 192 EBS-only 100 19,000
C6g.16xlarge 64 128 EBS-only 25 19,000
C5.24xlarge 96 192 EBS-only 25 19,000


Using AWS ParallelCluster with a serverless API

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/using-aws-parallelcluster-with-a-serverless-api/

This post is contributed by Dario La Porta, AWS Senior Consultant – HPC

AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda automatically runs your code without requiring you to provision or manage servers.

In this post, I create a serverless API of the AWS ParallelCluster command line interface using these services. With this API, you can create, monitor, and destroy your clusters. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you may have running on-premises or in the AWS Cloud.

The serverless integration of AWS ParallelCluster can enable a cleaner and more reproducible infrastructure as code paradigm to legacy HPC environments.

Taking this serverless, infrastructure as code approach enables several new types of functionality for HPC environments. For example, you can build on-demand clusters from an API when on-premises resources cannot handle the workload. AWS ParallelCluster can extend on-premises resources for running elastic and large-scale HPC on AWS’ virtually unlimited infrastructure.

You can also create an event-driven workflow in which new clusters are created when new data is stored in an S3 bucket. With event-driven workflows, you can be creative in finding new ways to build HPC infrastructure easily. It also helps optimize time for researchers.

Security is paramount in HPC environments because customers are performing scientific analyses that are central to their businesses. By using a serverless API, this solution can improve security by removing the need to run the AWS ParallelCluster CLI in a user environment. This help keep customer environments secure and more easily control the IAM roles and security groups that researchers have access to.

Additionally, the Amazon API Gateway for HPC job submission post explains how to submit a job in the cluster using the API. You can use this instead of connecting to the master node via SSH.

This diagram shows the components required to create the cluster and interact with the solution.

Cluster Architecture

Cluster Architecture

Cost of the solution

You can deploy the solution in this blog post within the AWS Free Tier. Make sure that your AWS ParallelCluster configuration uses the t2.micro instance type for the cluster’s master and compute instances. This is the default instance type for AWS ParallelCluster configuration.

For real-world HPC use cases, you most likely want to use a different instance type, such as C5 or C5n. C5n in particular can work well for HPC workloads because it includes the option to use the Elastic Fabric Adapter (EFA) network interface. This makes it possible to scale tightly coupled workloads to more compute instances and reduce communications latency when using protocols such as MPI.

To stay within the AWS Free Tier allowance, be sure to destroy the created resources as described in the teardown section of this post.

VPC configuration

Choose Launch Stack to create the VPC used for this configuration in your account:

The stack creates the VPC, the public subnets, and the private subnet required for the cluster in the eu-west-1 Region.

Stack Outputs

Stack Outputs

You can also use an existing VPC that complies with the AWS ParallelCluster network requirements.

Deploy the API with AWS SAM

The AWS Serverless Application Model (AWS SAM) is an open-source framework that you can use to build serverless applications on AWS. You use AWS SAM to simplify the setup of the serverless architecture.

In this case, the framework automates the manual configuration of setting up the API Gateway and Lambda function. Instead you can focus more on how the API works with AWS ParallelCluster. It improves security and provides a simple, alternative method for cluster lifecycle management.

You can install the AWS SAM CLI by following the Installing the AWS SAM CLI documentation. You can download the code used in this example from this repo. Inside the repo:

  • the sam-app folder in the aws-sample repository contains the code required to build the AWS ParallelCluster serverless API.
  • sam-app/template.yml contains the policy required for the Lambda function for the creation of the cluster. Be sure to modify <AWS ACCOUNT ID> to match the value for your account.

The AWS Identity and Access Management Roles in AWS ParallelCluster document contains the latest version of the policy, in the ParallelClusterUserPolicy section.

To deploy the application, run the following commands:

cd sam-app
sam build
sam deploy --guided

From here, provide parameter values for the SAM deployment wizard that are appropriate for your Region and AWS account. After the deployment, take a note of the Outputs:

SAM deploying

SAM deploying

SAM Stack Outputs

SAM Stack Outputs

The API Gateway endpoint URL is used to interact with the API, and has the following format:

https:// <ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster

AWS ParallelCluster configuration file

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS Cloud. AWS ParallelCluster uses a configuration file to build the cluster and its syntax is explained in the documentation guide. The pcluster.conf configuration file can be created in a directory of your local file system.

The configuration file has been tested with AWS ParallelCluster v2.6.0. The master_subnet_id contains the id of the created public subnet and the compute_subnet_id contains the private one.

Deploy the cluster with the pcluster API

The pcluster API created in the previous steps requires some parameters:

  • command – the pcluster command to execute. A detailed list is available commands is available in the AWS ParallelCluster CLI commands page.
  • cluster_name – the name of the cluster.
  • –data-binary “$(base64 /path/to/pcluster/config)” – parameter used to pass the local AWS ParallelCluster configuration file to the API.
  • -H “additional_parameters: <param1> <param2> <…>” – used to pass additional parameters to the pcluster cli.

The following command creates a cluster named “cluster1”:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=create&cluster_name=cluster1"

Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1

The cluster creation status can be queried with the following:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"


When the cluster is in the “CREATE_COMPLETE” state, you can retrieve the master node IP address using the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"


$ curl --request POST -H "additional_parameters: "  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"

MasterServer: RUNNING
ClusterUser: ec2-user

When the cluster is not needed anymore, destroy it with the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=delete&cluster_name=cluster1"

Deleting: cluster1

The additional_parameters: —nowait prevents waiting for stack events after executing a stack command and avoids triggering the Lambda function timeout. The Amazon API Gateway for HPC job submission post explains how you can submit a job in the cluster using the API, instead of connecting to the master node via SSH.

The authentication to the API can be managed by following the Controlling and Managing Access to a REST API in API Gateway Documentation.


You can destroy the resources by deleting the CloudFormation stacks created during installation. Deleting a Stack on the AWS CloudFormation Console explains the required steps.


In this post, I show how to integrate AWS ParallelCluster with Amazon API Gateway and manage the lifecycle of an HPC cluster using this API. Using Amazon API Gateway and AWS Lambda, you can run a serverless implementation of the AWS ParallelCluster CLI. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you run on-premise or in the AWS Cloud.

This solution can help you improve the security of your HPC environment by simplifying the IAM roles and security groups that must be granted to individual users to successfully create HPC clusters. With this implementation, researchers no longer must run the AWS ParallelCluster CLI in their own user environment. As a result, by simplifying the security management of your HPC clusters’ lifecycle management, you can better ensure that important research is safe and secure.

To learn more, read more about how to use AWS ParallelCluster.

Enabling job accounting for HPC with AWS ParallelCluster and Amazon RDS

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/

This post is written by Nicola Venuti, HPC Specialist SA, and contributed to by Rex Chen, Software Development Engineer.


Accounting, reporting, and advanced analytics used for data-driven planning and decision making are key areas of focus for High Performance Computing (HPC) Administrators. In the cloud, these areas are more relevant to the costs of the services, which directly impact budgeting and forecasting of expenses. With the growth of new HPC services that perform analyses and corrective actions, you can better optimize for performance, which reduces cost.

Solution Overview

In this blog post, we walk through an easy way to collect accounting information for evert job and step executed in a cluster with job scheduling. This post uses a new feature in the latest version (2.6.0) of AWS ParallelCluster, which makes this process easier than before, and Slurm.  Accounting records are saved into a relational database for both currently executing jobs and jobs, which have already terminated.


This tutorial assumes you already have a cluster in AWS ParallelCluster. If  you don’t, refer to the AWS ParallelCluster documentation, a getting started blog post, or a how-to blog post.


Choose your architecture

There are two common architectures to save job accounting information into a database:

  1. Installing and directly managing a DBMS in the master node of your cluster (or in an additional EC2 instance dedicated to it)
  2. Using a fully managed service like Amazon Relational Database Service (RDS)

While the first option might appear to be the most economical solution, it requires heavy lifting. You must install and manage the database, which is not a core part of running your HPC workloads.  Alternatively, Amazon RDS reduces this burden of installing updates, managing security patches, and properly allocating resources.  Additionally, Amazon RDS Free Tier can get you started with a managed database service in the cloud for free. Refer to the hyperlink for a list of free resources.

Amazon RDS is my preferred choice, and the following sections implement this architecture. Bear in mind, however, that the settings and the customizations required in the AWS ParallelCluster environment are the same, regardless of which architecture you prefer.


Set up and configure your database

Now, with your architecture determined, let’s configure it.  First, go to Amazon RDS’ console.  Select the same Region where your AWS ParallelCluster is deployed, then click on Create Database.

There are two database instances to consider: Amazon Aurora and MySQL.

Amazon Aurora has many benefits compared to MySQL. However, in this blog post, I use MySQL to show how to build your HPC accounting database within the Free-tier offering.

The following steps are the same regardless of your database choice. So, if you’re interested in one of the many features that differentiate Amazon Aurora from MySQL, feel free to use. Check out Amazon Aurora’s landing page to learn more about its benefits, such as its faster performance and cost effectiveness.

To configure your database, you must complete the following steps:

  1. Name the database
  2. Establish credential settings
  3. Select the DB instance size
  4. Identify storage type
  5. Allocate amount of storage

The following images show the settings that I chose for storage options and the “Free tier” template.  Feel free to change it accordingly to the scope and the usage you expect.

Make sure you select the corresponding VPC to wherever your “compute fleet” is deployed by AWS ParallelCluster, and wherever the Security Group of your compute fleet is selected.  You can access information for your “compute fleet” in your AWS ParallelCluster config file. The Security Group should look something like this: “parallelcluster-XXX-ComputeSecurityGroup-XYZ”.

At this stage, you can click on Create database and wait until the Database status moves from the Creating to Available in the Amazon RDS Dashboard.

The last step for this section is to grant privileges on the database.

  1. Connect to your database. Use the master node of your AWS ParallelCluster as a client.
  2. Install the MySQL client by running sudo yum install mysql on AmazonLinux and CentOS and sudo apt-get install mysql-client on Ubuntu.
  3. Connect to your MySQL RDS database using the following code: mysql --host=<your_rds_endpoint> --port=3306 -u admin -p The following screenshot shows how to find your RDS endpoint and port.


4. Run GRANT ALL ON `%`.* TO [email protected]`%`; to grant the required privileges.

The following code demonstrates these steps together:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 251
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> GRANT ALL ON `%`.* TO [email protected]`%`;

Note: typically this command is run as GRANT ALL ON *.* TO 'admin'@'%';   With Amazon RDS , for security reasons, this is not possible as the master account does not have access to the MySQL database. Using *.*  triggers an error. To work around this, I use the _ and % wildcards that are permitted. To look at the actual grants, you can run the following: SHOW GRANTS;

Enable Slurm Database logging

Now, your database is fully configured. The next step is to enable Slurm as a workload manager.

A few steps must occur to let Slurm log its job accounting information on an external database. The following code demonstrates the steps you must make.

  1. Add the DB configuration file, slurmdbd.conf after /opt/slurm/etc/ 
  2. Slurm’s slurm.conf file requires a few modifications. These changes are noted after the following code examples.


Note: You do not need to configure each and every compute node because AWS ParallelCluster installs Slurm in a shared directory. All of these nodes share this directory, and, thus the same configuration files with the master node of your cluster.

Below, you can find two example configuration files that you can use just by modifying a few parameters accordingly to your setup.

For more information about all the possible settings of configuration parameters, please refer to the official Slurm documentation, and in particular to the accounting section.

Add the DB configuration file

## Sample /opt/slurm/etc/slurmdbd.conf
DbdHost=ip-10-0-16-243  #YOUR_MASTER_IP_ADDRESS_OR_NAME
StorageHost=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com # Endpoint from RDS console
StoragePort=3306                                                                # Port from RDS console

See below for key values that you should plug into the example configuration file:

  • DbdHost: the name of the machine where the Slurm Database Daemon is executed. This is typically the master node of your AWS ParallelCluster. You can run hostname -s on your master node to get this value.
  • DbdPort: The port number that the Slurm Database Daemon (slurmdbd) listens to for work. 6819 is the default value.
  • StorageUser: Define the user name used to connect to the database. This has been defined during the Amazon RDS configuration as shown in the second step of the previous section.
  • StoragePass: Define the password used to gain access to the database. Defined as the user name during the Amazon RDS configuration.
  • StorageHost: Define the name of the host running the database. You can find this value in the Amazon RDS console, under “Connectivity & security”.
  • StoragePort: Define the port on which the database is listening. You can find this value in the Amazon RDS console, under “Connectivity & security”. (see the screenshot below for more information).

Modify the file

Add the following lines at the end of the slurm configuration file:

## /opt/slurm/etc/slurm.conf

Modify the following:

  • AccountingStorageHost: The hostname or address of the host where SlurmDBD executes. In our case this is again the master node of our AWS ParallelCluster, you can get this value by running hostname -s again.
  • AccountingStoragePort: The network port that SlurmDBD accepts communication on. It must be the same as DbdPort specified in /opt/slurm/etc/slurmdbd.conf
  • AccountingStorageUser: it must be the same as in /opt/slurm/etc/slurmdbd.conf (specified in the “Credential Settings” of your Amazon RDS database).


Restart the Slurm service and start the SlurmDB demon on the master node


Depending on the operating system you are running, this would look like:

  • Amazon Linux / Amazon Linux 2
[[email protected]]$ sudo /etc/init.d/slurm restart                                                                                                                                                                                                                                             
stopping slurmctld:                                        [  OK  ]
slurmctld is stopped
slurmctld is stopped
starting slurmctld:                                        [  OK  ]
[[email protected]]$ 
[[email protected]]$ sudo /opt/slurm/sbin/slurmdbd
  • CentOS7 and Ubuntu 16/18
[[email protected] ~]$ sudo /opt/slurm/sbin/slurmdbd
[[email protected] ~]$ sudo systemctl restart slurmctld

Note: even if you have jobs running, restarting the daemons will not affect them.

Check to see if your cluster is already in the Slurm Database:

/opt/slurm/bin/sacctmgr list cluster

And if it is not (see below):

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
[[email protected]]$ 

You can add it as follows:

sudo /opt/slurm/bin/sacctmgr add cluster parallelcluster

You should now see something like the following:

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+         6817  8704         1                                                                                           normal           
[[email protected]]$ 

At this stage, you should be all set with your AWS ParallelCluster accounting configured to be stored in the Amazon RDS database.

Replicate the process on multiple clusters

The same database instance can be easily used for multiple clusters to log its accounting data in. To do this, repeat the last configuration step for your clusters built using AWS ParallelCluster that you want to share the same database.

The additional steps to follow are:

  • Ensure that all the clusters are in the same VPC (or, if you prefer to use multiple VPCs, you can choose to set up VPC-Peering)
  • Add the SecurityGroup of your new compute fleets (“parallelcluster-XXX-ComputeSecurityGroup-XYZ”) to your RDS database
  • Change the cluster name parameter at the very top of the file. This is in addition to the slurm configuration file ( /opt/slurm/etc/slurm.conf) editing explained prior.  By default, your cluster is called “parallelcluster.” You may want to change that to clearly identify other clusters using the same database. For instance: ClusterName=parallelcluster2

Once these additional steps are complete, you can run /opt/slurm/bin/sacctmgr list cluster  again. Now, you should see two (or multiple) clusters:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+                            0  8704         1                                                                                           normal           
parallelc+         6817  8704         1                                                                                           normal           
[[email protected] ]#   

If you want to see the full name of your clusters, run the following:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster format=cluster%30
[[email protected] ]# 

Note: If you check the Slurm logs (under /var/log/slurm*), you may see this error:

error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout

This error refers to default parameters that Amazon RDS sets for you on your MySQL database. You can change them by setting new “group parameters” as explained in the official documentation and in this support article. Please also note that the innodb_buffer_pool_size is related to the amount of memory available on your instance, so you may want to use a different instance type with higher memory to avoid this warning.

Run your first job and check the accounting

Now that the application is installed and configured, you can test it! Submit a job to Slurm, query your database, and check your job accounting information.

If you are using a brand new cluster, test it with a simple hostname job as follows:

[[email protected]]$ sbatch -N2 <<EOF
> #!/bin/sh
> srun hostname |sort
> srun sleep 10
Submitted batch job 31
[[email protected]]$

Immediately after you have submitted the job, you should see it with a state of “pending”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2    PENDING      0:0 
[[email protected]]$ 

And, after a while the job should be “completed”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
38.batch           batch                                1  COMPLETED      0:0 
38.0           hostname                                2  COMPLETED      0:0 
[[email protected]]$

Now that you know your cluster works, you can build complex queries using sacct. See few examples below, and refer to the official documentation for more details:

[[email protected]]$ sacct --format=jobid,elapsed,ncpus,ntasks,state
       JobID    Elapsed      NCPUS   NTasks      State 
------------ ---------- ---------- -------- ---------- 
38             00:00:00          2           COMPLETED 
38.batch       00:00:00          1        1  COMPLETED 
38.0           00:00:00          2        2  COMPLETED 
39             00:00:00          2           COMPLETED 
39.batch       00:00:00          1        1  COMPLETED 
39.0           00:00:00          2        2  COMPLETED 
40             00:00:10          2           COMPLETED 
40.batch       00:00:10          1        1  COMPLETED 
40.0           00:00:00          2        2  COMPLETED 
40.1           00:00:10          2        2  COMPLETED 
[[email protected]]$ sacct --allocations
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
39               sbatch    compute                     2  COMPLETED      0:0 
40               sbatch    compute                     2  COMPLETED      0:0 
[[email protected]]$ sacct -S2020-02-17 -E2020-02-20 -X -ojobid,start,end,state
       JobID               Start                 End      State 
------------ ------------------- ------------------- ---------- 
38           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
39           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
40           2020-02-17T12:27:59 2020-02-17T12:28:09  COMPLETED 
[[email protected]]$ 

If you have configured your cluster(s) for multiple users, you may want to look at the accounting info for all of these.  If you want to configure your clusters with multiple users, follow this blog post. It demonstrates how to configure AWS ParallelCluster with AWS Directory Services to create a multiuser, POSIX-compliant system with centralized authentication.

Each and every user can only look at his own accounting data. However, Slurm admins (or root) can see accounting info for every user. The following code shows accounting data coming from two clusters (parallelcluster and parallelcluster2) and from two users (ec2-user and nicola):

[[email protected] ~]# sacct -S 2020-01-01 --clusters=parallelcluster,parallelcluster2 --format=jobid,elapsed,ncpus,ntasks,state,user,cluster%20                                                                                                                                                        
       JobID    Elapsed      NCPUS   NTasks      State      User              Cluster 
------------ ---------- ---------- -------- ---------- --------- -------------------- 
36             00:00:00          2              FAILED  ec2-user      parallelcluster 
36.batch       00:00:00          1        1     FAILED                parallelcluster 
37             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
37.batch       00:00:00          1        1  COMPLETED                parallelcluster 
37.0           00:00:00          2        2  COMPLETED                parallelcluster 
38             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
38.batch       00:00:00          1        1  COMPLETED                parallelcluster 
38.0           00:00:00          2        2  COMPLETED                parallelcluster 
39             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
39.batch       00:00:00          1        1  COMPLETED                parallelcluster 
39.0           00:00:00          2        2  COMPLETED                parallelcluster 
40             00:00:10          2           COMPLETED  ec2-user      parallelcluster 
40.batch       00:00:10          1        1  COMPLETED                parallelcluster 
40.0           00:00:00          2        2  COMPLETED                parallelcluster 
40.1           00:00:10          2        2  COMPLETED                parallelcluster 
41             00:00:29        144              FAILED  ec2-user      parallelcluster 
41.batch       00:00:29         36        1     FAILED                parallelcluster 
41.0           00:00:00        144      144  COMPLETED                parallelcluster 
41.1           00:00:01        144      144  COMPLETED                parallelcluster 
41.2           00:00:00          3        3  COMPLETED                parallelcluster 
41.3           00:00:00          3        3  COMPLETED                parallelcluster 
42             01:22:03        144           COMPLETED  ec2-user      parallelcluster 
42.batch       01:22:03         36        1  COMPLETED                parallelcluster 
42.0           00:00:01        144      144  COMPLETED                parallelcluster 
42.1           00:00:00        144      144  COMPLETED                parallelcluster 
42.2           00:00:39          3        3  COMPLETED                parallelcluster 
42.3           00:34:55          3        3  COMPLETED                parallelcluster 
43             00:00:11          2           COMPLETED  ec2-user      parallelcluster 
43.batch       00:00:11          1        1  COMPLETED                parallelcluster 
43.0           00:00:01          2        2  COMPLETED                parallelcluster 
43.1           00:00:10          2        2  COMPLETED                parallelcluster 
44             00:00:11          2           COMPLETED    nicola      parallelcluster 
44.batch       00:00:11          1        1  COMPLETED                parallelcluster 
44.0           00:00:01          2        2  COMPLETED                parallelcluster 
44.1           00:00:10          2        2  COMPLETED                parallelcluster 
4              00:00:10          2           COMPLETED    nicola     parallelcluster2 
4.batch        00:00:10          1        1  COMPLETED               parallelcluster2 
4.0            00:00:00          2        2  COMPLETED               parallelcluster2 
4.1            00:00:10          2        2  COMPLETED               parallelcluster2 
[[email protected] ~]# 

You can also directly query your database, and look at the accounting information stored in it or link your preferred BI tool to get insights from your HPC cluster. To do so, run the following code:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 48
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
| Database           |
| information_schema |
| mysql              |
| performance_schema |
| slurm_acct_db      |
4 rows in set (0.00 sec)

mysql> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
| Tables_in_slurm_acct_db                 |
| acct_coord_table                        |
| acct_table                              |
| clus_res_table                          |
| cluster_table                           |
| convert_version_table                   |
| federation_table                        |
| parallelcluster_assoc_table             |
| parallelcluster_assoc_usage_day_table   |
| parallelcluster_assoc_usage_hour_table  |
| parallelcluster_assoc_usage_month_table |
| parallelcluster_event_table             |
| parallelcluster_job_table               |
| parallelcluster_last_ran_table          |
| parallelcluster_resv_table              |
| parallelcluster_step_table              |
| parallelcluster_suspend_table           |
| parallelcluster_usage_day_table         |
| parallelcluster_usage_hour_table        |
| parallelcluster_usage_month_table       |
| parallelcluster_wckey_table             |
| parallelcluster_wckey_usage_day_table   |
| parallelcluster_wckey_usage_hour_table  |
| parallelcluster_wckey_usage_month_table |
| qos_table                               |
| res_table                               |
| table_defs_table                        |
| tres_table                              |
| txn_table                               |
| user_table                              |
29 rows in set (0.00 sec)



You’re finally all set! In this blog post you set up a database using Amazon RDS, configured AWS ParallelCluster and Slurm to enable job accounting with your database, and learned how to query your job accounting history from your database using the sacct command or by running SQL queries.

Deriving insights for your HPC workloads doesn’t end when your workloads finish running. Now, you can better understand and optimize your usage patterns and generate ideas about how to wring more price-performance out of your HPC clusters on AWS. For retrospective analysis, you can easily understand whether specific jobs, projects, or users are responsible for driving your HPC usage on AWS.  For forward-looking analysis, you can better forecast future usage to set budgets with appropriate insight into your costs and your resource consumption.

You can also use these accounting systems to identify users who may require additional training on how to make the most of cloud resources on AWS. Finally, together with your spending patterns, you can better capture and explain the return on investment from all of the valuable HPC work you do. And, this gives you the raw data to analyze how you can get even more value and price-performance out of the work you’re doing on AWS.





Running Simcenter STAR-CCM+ on AWS with AWS ParallelCluster, Elastic Fabric Adapter and Amazon FSx for Lustre

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

This post is contributed by Anh Tran, Senior HPC Specialist Solution Architect, AWS


AWS recently introduced many HPC services that boost the performance and scalability of Computational Fluid Dynamics (CFD) workloads on AWS. These services include: Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and AWS ParallelCluster 2.5.1. In this technical post, I walk through these three services. Additionally, I outline an example of using AWS ParallelCluster to set up an HPC system with EFA and Amazon FSx Lustre to run a CFD workload.  The CFD application that you will set up during this blog post is Simcenter STAR-CCM+  – the predominant CFD application from Siemens.

Service and solution overview


This blog primarily uses three services – Amazon FSx for Lustre, EFA, and AWS ParallelCluster. Let’s dig into each of these services before reviewing the solution.

Amazon FSx for Lustre

In December 2018, AWS released Amazon FSx for Lustre. This is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides low latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleashes innovation at an unparalleled pace.  This blog post uses the latest version of Amazon FSx for Lustre which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket. I go into more detail of how to take advantage of this at the end of the blog.

Elastic Fabric Adapter

In April of 2019, AWS released EFA. This enables you to run applications requiring high levels of inter-node communications at scale on AWS.

AWS ParallelCluster 2.5.1.

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use even for if you’re new to the cloud.  AWS recently released AWS ParallelCluster 2.5.1 – which is the version we will use for this blog.

These three AWS HPC components are optimal for CFD applications. Together, they provide simple deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel filesystem. Now, let’s take a look at how these services come together and seamlessly run a real CFD application: Simcenter STAR-CCM+.


AWS has a long-standing collaboration with Siemens. AWS and Siemens are dedicated to enhancing Siemens’ customer experiences when they run Simcenter STAR-CCM+ apps on AWS. I am excited to walk you through the steps and the best practices for running Simcenter STAR-CCM+, the predominant CFD application from Siemens.

The Simcenter STAR-CCM+ application runs on an HPC system. This system is optimized with EFA and Amazon FSx for Lustre — all of which is managed by AWS ParallelCluster. AWS ParallelCluster simplifies the deployment process to such an extent that you can set up your HPC cluster with a high throughput parallel file system (Amazon FSx for Lustre), a high-throughput and low-latency network interface (EFA), and high-bandwidth network interconnects (100 Gbps using C5n instances) in less than 15 minutes.

Now that you have the services and solution overviews, we can get started. This blog post includes the following steps:

  1. Creating an HPC infrastructure stack on AWS, which will include:
    • How to set up AWS ParallelCluster for best performance
    • How to enable EFA
    • How to enable the Amazon FSx Lustre file system and how to use some basic Amazon FSx for Lustre for STAR-CCM+
    • How to connect to a remote desktop session using NICE DCV
  2. Installing the Simcenter STAR-CCM+ application and how to submit a Simcenter STAR-CCM+ job to HPC cluster.

The following diagram outlines the steps


Setting up the HPC infrastructure stack

Before you can run Simcenter STAR-CCM+, you need to build an HPC cluster first.  Some best practices that you should consider when setting up a cluster on AWS include:

  • Turn off Hyper-Threading (HT): AWS instances have HT turned on by default.
  • Use EFA and a cluster placement group for your compute fleet to minimize the latency between nodes.
  • Select the right instance type for your compute fleet. Here, I use C5n.18xlarge because of its high-performance CPU, high-bandwidth bandwidth networking, and the EFA network interface capabilities.

With HPC best practices in mind, you can set up your AWS ParallelCluster

Set up AWS ParallelCluster:

$aws s3 mb s3://benchmark-starccm
make_bucket: benchmark-starccm

*Note: As an example, I create an S3 bucket benchmark-starccm, however you should create an S3 bucket with a different name of your choice because the S3 bucket name must be globally unique.

Let’s download the STAR-CCM+ installation file and a case file, then upload them to the S3 bucket that we just created.

  • Download latest Simcenter STAR-CCM+ package from the Siemens portal. It will look something like this: STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
  • Download the Le Mans case file. The Le Mans case is 104 Million cells, which even today is considered large for a CFD case.  The file will look something like this:  LeMans_104M.sim

After downloading the STAR-CCM+ software and LeMans case file, upload them to the S3 bucket created above

aws s3 cp STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip s3://benchmark-starccm
aws s3 cp LeMans_104M.sim s3://benchmark-starccm/

We will use this same S3 bucket to install the Simcenter STAR-CCM+ application later in this tutorial

With all the ground work done, we can now build our HPC cluster. For more detailed instructions, you can consult Getting Started with AWS ParallelCluster.

  • Install AWS ParallelCluster
pip install aws-parallelcluster
  • Configure AWS ParallelCluster with some basic network information such as AWS Region ID, VPC ID, Subnet ID

pcluster configure

Modify your ~/.parallelcluster/config file to include a cluster section that minimally includes the following:

aws_region_name = us-east-2

[cluster default]
vpc_settings = public
key_name = <Key-Name>
initial_queue_size = 2
max_queue_size = 100
maintain_initial_size = true
placement_group = DYNAMIC
placement = cluster
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
cluster_type = ondemand

base_os = centos7
tags = {“Name” : “STARCCM”}

enable_efa = compute
fsx_settings = fsxshared
disable_hyperthreading = true

dcv_settings = hpc-dcv

[vpc public]
master_subnet_id = subnet-<Subnet-ID>
vpc_id = vpc-<VPC-ID>

update_check = true
sanity_check = true
cluster_template = default

[dcv hpc-dcv]

enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
imported_file_chunk_size = 1024
import_path = s3://benchmark-starccm

export_path = s3://benchmark-starccm

ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

  • Now, create your first HPC cluster with the name starccmby running

pcluster create starccm

Your HPC cluster should be ready in about 15 minutes.

While you’re waiting for your cluster to be ready, let’s take a deeper look at what some of the different parameters we used mean for our HPC cluster:

initial_queue_size: We will start with two compute instances after the HPC cluster is up.

max_queue_size: We will limit the maximum compute fleet to 100 instances. This allows us room to scale our jobs up to a large number of cores while putting a limit on the number of compute nodes to help control costs.

base_os: For this blog, we will select centos 7 as a base os. Currently we support Amazon Linux (alinux), Centos 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.

master_instance_type: This can be any instance type. Here we choose c5.xlarge because it is inexpensive and relatively fast for the head node.

compute_instance_type: We select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC. Note that EFA is currently only available on c5n.18xlarge, c5n.metal, i3en.24xlarge, p3dn.24xlarge, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, and p3dn.24xlarge. See the docs for Currently supported instances.

placement_group: We use placement_group to ensure our instances are located as physically close to one another as possible to minimize the latency between compute nodes and take advantage of EFA’s low latency networking.

enable_efa: with just one configuration line, we can easily turn on EFA support for our HPC cluster.

dcv_settings = hpc-dcv: With AWS ParallelCluster 2.5.1 you can use NICE DCV to support your remote visualization needs.

disable_hyperthreading: This setting turns off hyper-threading on the cluster

[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory will be mounted, the storage capacity for the filesystem, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.

[dcv hpc-dcv]: This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.

  •  After you set up your config file for AWS ParallelCluster, log in and verify that you can access the cluster’s head node
pcluster ssh starccm -i ~/path/to/ssh_key
  • Verify the compute nodes are up. We should see two c5n.18xlarge nodes.
$ qhost
global - - - - - - - - - -
ip-172-31-14-220 lx-amd64 36 2 36 36 0.49 184.6G 11.7G 0.0 0.0
ip-172-31-2-137 lx-amd64 36 2 36 36 0.45 184.6G 11.7G 0.0 0.0
  • Verify the EFA driver has been loaded successfully.

In order to verify if EFA is installed correctly you will need to ssh into one of compute nodes and run :

[[email protected] ~]$ which mpirun

[[email protected] ~]$ fi_info -p efa

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 2.0
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

At this point, EFA is verified.

Install Simcenter STAR-CCM+ application

Now that the HPC cluster using AWS ParallelCluster is set up, it’s time to install the Simcenter STAR-CCM+ application.  In the prior steps, you uploaded a Simcenter STAR-CCM+ application and a case file to S3 bucket and used that S3 bucket as a source for the Amazon FSx for Lustre /fsx storage. As soon as the cluster created, the installation file and the case file will be available in /fsx

[[email protected] fsx]$ cd /fsx
[[email protected] fsx]$ ls
LeMans_104M.sim  STAR-CCM-14.06.013_01_linux-x86_64.tar.gz  STAR-CCM+14.06.013_01_linux-x86_64.tar.gz

As you can see, Amazon FSx for Lustre has already downloaded the case file from the S3 bucket to the /fsx partition, so now you can start install Simcenter STAR-CCM+ software using the following steps.

  • Install Simcenter STAR-CCM+: install Simcenter STAR-CCM+ on /fsx  – a 1.2TB Lustre filesystem that you configured in a previous step.
cd /fsx
sudo unzip STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
cd STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1
Select Installation Location : /fsx/Siemens

After following all the standard installation steps from Simcenter STAR-CCM+, the application should be installed at the following location:


  • Test the installation
[[email protected] fsx]$ /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+ -version
Simcenter STAR-CCM+ 2020.1 Build 15.02.003 (linux-x86_64-2.12/gnu7.1)

When the above code shows up, you correctly installed Simcenter STAR-CCM+, so now you can run the application.

Running Simcenter STAR-CCM+ on AWS ParallelCluster

Before I move on, let’s recap what you’ve have done so far.

  1. You set up an HPC cluster using AWS ParallelCluster with compute-optimized C5n instances, an Amazon FSx for Lustre filesystem, and EFA-enabled networking.
  2. You installed Simcenter STAR-CCM+ application

Now, let us create an SGE submission script and submit a job to the HPC cluster.

  • Create an SGE job submission script :
cd /fsx/

vi star-ccm.qsub

#$ -N check
#$ -cwd
#$ -j Y
#$ -pe mpi 252


your_pod_key="your license key"
ccmp=" /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+”


${ccmp} \
-power \
-podkey $your_pod_key \
-licpath [email protected] \
-mpi openmpi \
-bs sge \
-benchmark:"-preits 40 -nits 20 -nps ${NSLOTS} -tag ${tag}" \


mkdir ${NSLOTS}new
mv *xml ${NSLOTS}new
  • Submit your Simcenter STAR-CCM+ job to your HPC cluster
qsub -V star-ccm.qsub

Now you have submitted an HPC job that requests 252 cores of c5n.18xlarge.

  • Check the status of your jobs by running
qstat -f

Simcenter STAR-CCM+ result analysis

Here are sample scaling results for the LeMans 104M cell benchmark case. As you can see, the Simcenter STAR-CCM+ result shows exciting performance and scaling results running on AWS. The performance is fast and the simulation scales very well with EFA-based networking and great CPU performance.

Connect to a NICE DCV session

AWS ParallelCluster is now natively integrated with NICE DCV. You can configure a NICE DCV session to visualize your STAR-CCM+ result or connect to the application remotely.

As you will recall from when we configured our AWS ParallelCluster in the previous section, I named the DCV server as hpc-dcv.  To create and connect to a NICE DCV session, just run:

pcluster dcv connect hpc-dcv -k <Key-Name>

After you connect to NICE DCV session, you will be able to access to a Linux desktop to work with STAR-CCM+ application.

Backup SIMCENTER STAR-CCM+ result to S3 bucket

After you finish your STAR-CCM+ simulation, you can backup data in /fsx to your S3 bucket that you created from running your application. You can now use Data Repository Tasks. Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

*Note: in order to use new Amazon FSx for Lustre feature, you will need to have AWS CLI version 1.16.309 or above.

In this case, I select to export the  STAR-CCM+ application directory Siemens as an example.

  • Exit the HPC head node, and go back to your laptop or Cloud9 environment where you have configured your AWS CLI. Find out your Amazon FSx Lustre ID by running:
aws fsx describe-file-systems
  • After you find the Amazon FSx for Lustre ID, which looks simiar to fs-0d72d520f620d765a, create a backup of the data by running:

aws fsx create-data-repository-task --file-system-id fs-0d72d520f620d765a --type EXPORT_TO_REPOSITORY --paths Siemens,testfsx --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://benchmark-starccm/


–file-system-id: your file system ID

–type EXPORT_TO_REPOSITORY: we will export the data back to the S3 bucket

–paths Siemens,testfsx: the directories you want to export to S3 bucket

Format=REPORT_CSV_20191124:  note this is only name the Amazon FSx Lustre supports. Please keep it the same.

  • Check the status of the backup by running describe-create-data-repository-task
aws fsx describe-data-repository-tasks

As you can see, I can now use FSx for Lustre to install the Simcenter STAR-CCM+ application, and seamlessly move case data between on-premise to Amazon S3 and AWS HPC system. I can create a file system linked to a S3 bucket, create a FSx for Lustre filesystem, and export data back to their S3 bucket after running the CFD app.


Give it a try, set up an HPC environment, and let us know how it goes! If you need more information about running CFD and HPC cases on AWS you can find it on our HPC home page. Please feel free to contact us with questions that you might have.


Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.



This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.


Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.


Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.


Instance Type Provisioning Host Count Hourly Instance Cost Total 30-minute batch cost
c5.18xlarge Spot 15  $     1.2367  $     9.2753
c5.2xlarge Spot 22  $     0.1547  $     1.7017
c5.4xlarge Spot 12  $     0.2772  $     1.6632
c5.9xlarge Spot 11  $     0.6239  $     3.4315
c5.2xlarge On-Demand 13  $     0.3840  $     2.4960
c5.4xlarge On-Demand 3  $     0.7680  $     1.1520
c5.9xlarge On-Demand 4  $     1.7280  $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.



This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/leveraging-efa-to-run-hpc-and-ml-workloads-on-aws-batch/

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

 This post is contributed by  Sean Smith, Software Development Engineer II, AWS ParallelCluster & Arya Hezarkhani, Software Development Engineer II, AWS Batch and HPC


On August 2, 2019, AWS Batch announced support for Elastic Fabric Adapter (EFA). This enables you to run highly performant, distributed high performance computing (HPC) and machine learning (ML) workloads by using AWS Batch’s managed resource provisioning and job scheduling.

EFA is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypasses the hardware interface and enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, HPC applications using the Message Passing Interface (MPI) and ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of cores or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud.

AWS Batch is a cloud-native batch scheduler that manages instance provisioning and job scheduling. AWS Batch automatically provisions instances according to job specifications, with the appropriate placement group, networking configurations, and any user-specified file system. It also automatically sets up the EFA interconnect to the instances it launches, which you specify through a single launch template parameter.

In this post, we walk through the setup of EFA on AWS Batch and run the NAS Parallel Benchmark (NPB), a benchmark suite that evaluates the performance of parallel supercomputers, using the open source implementation of MPI, OpenMPI.



This walk-through assumes:


Configuring your compute environment

First, configure your compute environment to launch instances with the EFA device.

Creating an EC2 placement group

The first step is to create a cluster placement group. This is a logical grouping of instances within a single Availability Zone. The chief benefit of a cluster placement group is non-blocking, non-oversubscribed, fully bi-sectional network connectivity. Use a Region that supports EFA—currently, that is us-east-1, us-east-2, us-west-2, and eu-west-1. Run the following command:

$ aws ec2 create-placement-group –group-name “efa” –strategy “cluster” –region [your-region]

Creating an EC2 launch template

Next, create a launch template that contains a user-data script to install EFA libraries onto the instance. Launch templates enable you to store launch parameters so that you do not have to specify them every time you launch an instance. This will be the launch template used by AWS Batch to scale the necessary compute resources in your AWS Batch Compute Environment.

First, encode the user data into base64-encoding. This example uses the CLI utility base64 to do so.


$ echo “MIME-Version: 1.0

Content-Type: multipart/mixed; boundary=”==MYBOUNDARY==”



Content-Type: text/cloud-boothook; charset=”us-ascii” cloud-init-per once yum_wget yum install -y wget

cloud-init-per once wget_efa wget -q –timeout=20 https://s3-us-west-2.amazonaws.com/

cloud-init-per once tar_efa tar -xf /tmp/aws-efa-installer-latest.tar.gz -C /tmp pushd /tmp/aws-efa-installer

cloud-init-per once install_efa ./efa_installer.sh -y pop /tmp/aws-efa-installer

cloud-init-per once efa_info /opt/amazon/efa/bin/fi_info -p efa

–==MYBOUNDARY==–” | base64


Save the base64-encoded output, because you need it to create the launch template.


Next, make sure that your default security group is configured correctly. On the EC2 console, select the default security group associated with your default VPC and edit the inbound rules to allow SSH and All traffic to itself. This must be set explicitly to the security group ID for EFA to work, as seen in the following screenshot.




Then edit the outbound rules and add a rule that allows all inbound traffic from the security group itself, as seen in the following screenshot. This is a requirement for EFA to work.




Now, create an ecsInstanceRole, the Amazon ECS instance profile that will be applied to Amazon EC2 instances in a Compute Environment. To create a role, follow these steps.

  1. Choose Roles, then Create Role.
  2. Select AWS Service, then EC2.
  3. Choose Permissions.
  4. Attach the managed policy AmazonEC2ContainerServiceforEC2Role.
  5. Name the role ecsInstanceRole.


You will create the launch template using the ID of the security group, the ID of a subnet in your default VPC, and the ecsInstanceRole that you created.

Next, choose an instance type that supports EFA, that’s denoted by the n in the instance name. This example uses c5n.18xlarge instances.

You also need an Amazon Machine Image (AMI) ID. This example uses the latest ECS-optimized AMI based on Amazon Linux 2. Grab the AMI ID that corresponds to the Region that you are using.

This example uses UserData to install EFA. This adds 1.5 minutes of bootstrap time to the instance launch. In production workloads, bake the EFA installation into the AMI to avoid this additional bootstrap delay.

Now create a file called launch_template.json with the following content, making sure to substitute the account ID, security group, subnet ID, AMI ID, and key name.


“LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “LaunchTemplateData”: {

“InstanceType”: “c5n.18xlarge”,

“IamInstanceProfile”: {

“Arn”: “arn:aws:iam::<Account Id>:instance-profile/ecsInstanceRole”


“NetworkInterfaces”: [


“DeviceIndex”: 0,

“Groups”: [

“<Security Group>”


“SubnetId”: “<Subnet Id>”,

“InterfaceType”: “efa”,

“Description”: “NetworkInterfaces Configuration For EFA and Batch”



“Placement”: {

“GroupName”: “efa”


“TagSpecifications”: [


“ResourceType”: “instance”,

“Tags”: [


“Key”: “from-lt”,

“Value”: “networkInterfacesConfig-EFA-Batch”





“UserData”: “TUlNRS1WZXJzaW9uOiAxLjAKQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PU1ZQk9VTkRBUlk9PSIKCi0tPT1NWUJPVU5EQVJZPT0KQ29udGVudC1UeXBlOiB0ZXh0L2Nsb3VkLWJvb3Rob29rOyBjaGFyc2V0PSJ1cy1hc2NpaSIKCmNsb3VkLWluaXQtcGVyIG9uY2UgeXVtX3dnZXQgeXVtIGluc3RhbGwgLXkgd2dldAoKY2xvdWQtaW5pdC1wZXIgb25jZSB3Z2V0X2VmYSB3Z2V0IC1xIC0tdGltZW91dD0yMCBodHRwczovL3MzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tL2F3cy1lZmEtaW5zdGFsbGVyL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLU8gL3RtcC9hd3MtZWZhLWluc3RhbGxlci1sYXRlc3QudGFyLmd6CgpjbG91ZC1pbml0LXBlciBvbmNlIHRhcl9lZmEgdGFyIC14ZiAvdG1wL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLUMgL3RtcAoKcHVzaGQgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgpjbG91ZC1pbml0LXBlciBvbmNlIGluc3RhbGxfZWZhIC4vZWZhX2luc3RhbGxlci5zaCAteQpwb3AgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgoKY2xvdWQtaW5pdC1wZXIgb25jZSBlZmFfaW5mbyAvb3B0L2FtYXpvbi9lZmEvYmluL2ZpX2luZm8gLXAgZWZhCgotLT09TVlCT1VOREFSWT09LS0K”



Create a launch template from that file:


$ aws ec2 create-launch-template –cli-input-json file://launch_template.json


“LaunchTemplate”: {

“LatestVersionNumber”: 1,

“LaunchTemplateId”: “lt-*****************”, “LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “DefaultVersionNumber”: 1,

“CreatedBy”: “arn:aws:iam::************:user/desktop-user”, “CreateTime”: “2019-09-23T13:00:21.000Z”




Creating a compute environment

Next, create an AWS Batch Compute Environment. This uses the information from the launch template

EFA-Batch-Launch-Template created earlier.



“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”,

“type”: “MANAGED”,

“state”: “ENABLED”,

“computeResources”: {

“type”: “EC2”,

“minvCpus”: 0,

“maxvCpus”: 2088,

“desiredvCpus”: 0,

“instanceTypes”: [



“subnets”: [



“instanceRole”: “arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole”,

“launchTemplate”: {

“launchTemplateName”: “EFA-Batch-LaunchTemplate”,

“version”: “$Latest”



“serviceRole”: “arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole”



Now, create the compute environment:

$ aws batch create-compute-environment –cli-input-json file://compute_environment.json


“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”, “computeEnvironmentArn”: “arn:aws:batch:us-east-1:<Account Id>:compute-environment”



Building the container image

To build the container, clone the repository that contains the Dockerfile used in this example.

First, install git:

$ git clone https://github.com/aws-samples/aws-batch-efa.git


In that repository, there are several files, one of which is the following Dockerfile.


FROM amazonlinux:1 ENV USER efauser


RUN yum update -y

RUN yum install -y which util-linux make tar.x86_64 iproute2 gcc-gfortran openssh-serv RUN pip-2.7 install supervisor


RUN useradd -ms /bin/bash $USER ENV HOME /home/$USER


##################################################### ## SSH SETUP


RUN mkdir -p ${SSHDIR} \

&& touch ${SSHDIR}/sshd_config \

&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ” \

&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \ && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \

&& echo ”  IdentityFile ${SSHDIR}/id_rsa” >> ${SSHDIR}/config \ && echo ”  StrictHostKeyChecking no” >> ${SSHDIR}/config \

&& echo ”  UserKnownHostsFile /dev/null” >> ${SSHDIR}/config \ && echo ”  Port 2022″ >> ${SSHDIR}/config \

&& echo ‘Port 2022’ >> ${SSHDIR}/sshd_config \

&& echo ‘UsePrivilegeSeparation no’ >> ${SSHDIR}/sshd_config \

&& echo “HostKey ${SSHDIR}/ssh_host_rsa_key” >> ${SSHDIR}/sshd_config \ && echo “PidFile ${SSHDIR}/sshd.pid” >> ${SSHDIR}/sshd_config \

&& chmod -R 600 ${SSHDIR}/* \

&& chown -R ${USER}:${USER} ${SSHDIR}/


# check if ssh agent is running or not, if not, run RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa


################################################# ## EFA and MPI SETUP

RUN curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-1. && tar -xf aws-efa-installer-1.5.0.tar.gz \

&& cd aws-efa-installer \

&& ./efa_installer.sh -y –skip-kmod –skip-limit-conf –no-verify


RUN wget https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz \ && tar xzf NPB3.3.1.tar.gz

COPY make.def_efa /NPB3.3.1/NPB3.3-MPI/config/make.def COPY suite.def      /NPB3.3.1/NPB3.3-MPI/config/suite.def


RUN cd /NPB3.3.1/NPB3.3-MPI \

&& make suite \

&& chmod -R 755 /NPB3.3.1/NPB3.3-MPI/



## supervisor container startup


ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh

RUN chmod 755 supervised-scripts/mpi-run.sh



ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh RUN chmod 755 batch-runtime-scripts/entry-point.sh


CMD /batch-runtime-scripts/entry-point.sh


To build this Dockerfile, run the included Makerfile with:


Now, push the created container image to Amazon Elastic Container Registry (ECR), so you can use it in your AWS Batch JobDefinition:

From the AWS CLI, create an ECR repository, we’ll call it aws-batch-efa:

$ aws ecr create-repository –repository-name aws-batch-efa


“repository”: {

“registryId”: “<Account-Id>”,
“repositoryName”: “aws-batch-efa”,
“repositoryArn”: “arn:aws:ecr:us-east-2:<Account-Id>:repository/aws-batch-efa”,
“createdAt”: 1568154893.0,
“repositoryUri”: “<Account-Id>.dkr.ecr.us-east-2.amazonaws.com/aws-batch-efa”


Edit the top of the makefile and add your AWS account number and AWS Region.


To push the image to the ECR repository, run:

make tag
make push


Run the application

To run the application using AWS Batch multi-node parallel jobs, follow these steps.


Setting up the AWS Batch multi-node job definition

Set up the AWS Batch multi-node job definition and expose the EFA device to the container by following these steps.


First, create a file called job_definition.json with the following contents. This file holds the configurations for the AWS Batch JobDefinition. Specifically, this JobDefinition uses the newly supported field LinuxParameters.Devices to expose a particular device—in this case, the EFA device path /dev/infiniband/uverbs0—to the container. Be sure to substitute the image URI with the one you pushed to ECR in the previous step. This is used to start the container.



“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“type”: “multinode”,

“nodeProperties”: {

“numNodes”: 8,

“mainNode”: 0,

“nodeRangeProperties”: [


“targetNodes”: “0:”,

“container”: {

“user”: “efauser”,

“image”: “<Docker Image From Previous Section>”,

“vcpus”: 72,

“memory”: 184320,

“linuxParameters”: {

“devices”: [


“hostPath”: “/dev/infiniband/uverbs0”




“ulimits”: [


“hardLimit”: -1,

“name”: “memlock”,

“softLimit”: -1









$ aws batch register-job-definition –cli-input-json file://job_definition.json


“jobDefinitionArn”: “arn:aws:batch:us-east-1:<account-id>:job-definition/EFA-MPI-JobDefinition”,

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“revision”: 1



Run the job

Next, create a job queue. This job queue points at the compute environment created before. When jobs are submitted to it, they queue until instances are available to run them.



“jobQueueName”: “EFA-Batch-JobQueue”,

“state”: “ENABLED”,

“priority”: 10,

“computeEnvironmentOrder”: [


“order”: 1,

“computeEnvironment”: “EFA-Batch-ComputeEnvironment”





aws   batch create-job-queue –cli-input-json  file://job_queue.json

Now that you’ve created all the resources, submit the job. The numNodes=8 parameter tells the job definition to use eight nodes.

aws   batch submit-job –job-name      example-mpi-job –job-queue EFA-Batch-JobQueue –job-definition EFA-MPI-JobDefinition –node-overrides numNodes=8


NPB overview

NPB is a small set of benchmarks derived from computational fluid dynamics (CFD) applications. They consist of five kernels and three pseudo-applications. This example runs the 3D Fast Fourier Transform (FFT) benchmark, as it tests all-to-all communication. For this run, use c5n.18xlarge, as configured in the AWS compute environment earlier. This is an excellent choice for this workload as it has an Intel Skylake processor (72 hyperthreaded cores) and 100 GB Enhanced Networking (ENA), which you can take advantage of with EFA.


This test runs the FT “C” Benchmark with eight nodes * 72 vcpus = 576 vcpus.


NAS Parallel Benchmarks 3.3 — FT Benchmark

No input file inputft.data. Using compiled defaults Size : 512x 512x 512

Iterations : 20

Number of processes : 512 Processor array : 1x 512 Layout type : 1D

Initialization time = 1.3533580760000063

T = 1 Checksum = 5.195078707457D+02 5.149019699238D+02

T = 2 Checksum = 5.155422171134D+02 5.127578201997D+02
T = 3 Checksum = 5.144678022222D+02 5.122251847514D+02
T = 4 Checksum = 5.140150594328D+02 5.121090289018D+02
T = 5 Checksum = 5.137550426810D+02 5.121143685824D+02
T = 6 Checksum = 5.135811056728D+02 5.121496764568D+02
T = 7 Checksum = 5.134569343165D+02 5.121870921893D+02
T = 8 Checksum = 5.133651975661D+02 5.122193250322D+02
T = 9 Checksum = 5.132955192805D+02 5.122454735794D+02
T = 10 Checksum = 5.132410471738D+02 5.122663649603D+02
T = 11 Checksum = 5.131971141679D+02 5.122830879827D+02
T = 12 Checksum = 5.131605205716D+02 5.122965869718D+02
T = 13 Checksum = 5.131290734194D+02 5.123075927445D+02
T = 14 Checksum = 5.131012720314D+02 5.123166486553D+02
T = 15 Checksum = 5.130760908195D+02 5.123241541685D+02
T = 16 Checksum = 5.130528295923D+02 5.123304037599D+02
T = 17 Checksum = 5.130310107773D+02 5.123356167976D+02
T = 18 Checksum = 5.130103090133D+02 5.123399592211D+02
T = 19 Checksum = 5.129905029333D+02 5.123435588985D+02
T = 20 Checksum = 5.129714421109D+02 5.123465164008D+02


Result verification successful class = C

FT Benchmark Completed. Class = C

Size = 512x 512x 512 Iterations = 20

Time in seconds = 1.92 Total processes = 512 Compiled procs = 512 Mop/s total = 206949.17 Mop/s/process = 404.20

Operation type = floating point Verification = SUCCESSFUL



In this post, we covered how to run MPI Batch jobs with an EFA-enabled elastic network interface using AWS Batch multi-node parallel jobs and an EC2 launch template. We used a launch template to configure the AWS Batch compute environment to launch an instance with the EFA device installed. We showed you how to expose the EFA device to the container. You also learned how to package an MPI benchmarking application, the NPB, as a Docker container, and how to run the application as an AWS Batch multi-node parallel job.

We hope you found the information in this post helpful and encouraging as to all the possibilities for HPC on AWS.

Deploying a Burstable and Event-driven HPC Cluster on AWS Using SLURM, Part 1

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploying-a-burstable-and-event-driven-hpc-cluster-on-aws-using-slurm-part-1/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

When you execute high performance computing (HPC) workflows on AWS, you can take advantage of the elasticity and concomitant scale associated with recruiting resources for your computational workloads. AWS offers a variety of services, solutions, and open source tools to deploy, manage, and dynamically destroy compute resources for running various types of HPC workloads.

Best practices in deploying HPC resources on AWS include creating much of the infrastructure on-demand, and making it as ephemeral and dynamic as possible. Traditional HPC clusters use a resource scheduler that maintains a set of computational resources and distributes those resources over a collection of queued jobs.

With a central resource scheduler, all users have a single point of entry to a broad range of compute. Traditionally, many of these schedulers managed on-premises systems. They weren’t offered dynamically as much as cloud-based HPC clusters, and they usually only needed to manage a largely static set of resources.

However, many schedulers now support the ability to burst into AWS or manage a dynamically changing HPC environment through plugins, connectors, and custom scripting. Some of the more common resource schedulers include:


Simple Linux Resource Manager (SLURM) by SchedMD is one such popular scheduler. Using a derivative of SLURM’s elastic power plugin, you can coordinate the launch of a set of compute nodes with the appropriate CPU/disk/GPU/network topologies. You standup the compute resources for the job, instead of trying to fit a job within a set of pre-existing compute topologies.

We have recently released an example implementation of the SLURM bursting capability in the AWS Samples GitHub repo.

The following diagram shows the reference architecture.


Download the aws-plugin-for-slurm directory locally. Use the following AWS CLI commands to sync the directory with an S3 bucket to be referenced later. For more detailed instructions on the deployment follow the README within the GitHub.

git clone https://github.com/aws-samples/aws-plugin-for-slurm.git 
aws s3 sync aws-plugin-for-slurm/ s3://<bucket-name>

Included is a CloudFormation script, which you use to stand up the VPC and subnets, as well as the headnode. In AWS CloudFormation, choose Create Stack and import the slurm_headnode-clouformation.yml script.

The CloudFormation script lays down the landing zone with the appropriate network topology. The headnode is based on the publicly available CentOS 7.5 available in the AWS Marketplace. The latest security packages are installed with the dependencies needed to install SLURM.

I have found that scheduling performance is best if the source is compiled at runtime, which the CloudFormation script takes care of. The script sets up the headnode as a single controller. However, with minor modifications, it can be set up in a highly available manner with a backup SLURM controller.

After the deployment moves to CREATE_COMPLETEstatus in CloudFormation, use SSH to connect to the slurm headnode:

ssh -i <path/to/private/key.pem> [email protected]<public-ip-address>

Create a new sbatch job submission file by running the following commands in the vi/nano text editor:

#SBATCH —nodes=2
#SBATCH —ntasks-per-node=1
#SBATCH —cpus-per-task=4
#SBATCH —constraint=[us-west-2a]

sleep 60

This job submission script requests two nodes to be allocated, running one task per node and using four CPUs. The constraint is optional but allows SLURM to allocate the job among the available zones.

The elasticity of the cluster comes in setting the slurm.conf parameters SuspendProgram and ResumeProgram in the slurm.conf file.


You can set the responsiveness of the scaling on AWS by modifying SuspendTime. Do not set a value for ResumeRateor SuspendRate, as the underlying SuspendProgram and ResumeProgramscripts have API calls that impose their own rate limits. If you find that your API call rate limit is reached at scale (approximately 1000 nodes/sec), you can set ResumeRate and SuspendRate accordingly.

If you are familiar with SLURM’s configuration options, you can make further modifications by editing the /nfs/slurm/etc/slurm.conf.d/slurm_nodes.conffile. That file contains the node definitions, with some minor modifications. You can schedule GPU-based workloads separate from CPU to instantiate a heterogeneous cluster layout. You also get more flexibility running tightly coupled workloads alongside loosely coupled jobs, as well as job array support. For additional commands administrating the SLURM cluster, see the SchedMD SLURM documentation.

The initial state of the cluster shows that no compute resources are available. Run the sinfocommand:

[[email protected] ~]$ sinfo
all*         up  infinite     0   n/a 
gpu          up  infinite     0   n/a 
[[email protected] ~]$ 

The job described earlier is submitted with the sbatchcommand:

sbatch test.sbatch

The power plugin allocates the requested number of nodes based on your job definition and runs the Amazon EC2 API operations to request those instances.

[[email protected] ~]$ sinfo
all*      up    infinite      2 alloc# ip-10-0-1-[6-7]
gpu       up    infinite      2 alloc# ip-10-0-1-[6-7]
[[email protected] ~]$ 

The log file is located at /var/log/power_save.log.

Wed Sep 12 18:37:45 UTC 2018 Resume invoked 
/nfs/slurm/bin/slurm-aws-startup.sh ip-10-0-1-[6-7]

After the request job is complete, the compute nodes remain idle for the duration of the SuspendTime=60value in the slurm.conffile.

[[email protected] ~]$ sinfo
all*         up  infinite     2  idle ip-10-0-1-[6-7]
gpu          up  infinite     2  idle ip-10-0-1-[6-7]
[[email protected] ~]$ 

Ideally, ensure that other queued jobs have an opportunity to run on the current infrastructure, assuming that the job requirements are fulfilled by the compute nodes.

If the job requirements are not fulfilled and there are no more jobs in the queue, the aws-slurm shutdown script takes over and terminates the instance. That’s one of the benefits of an elastic cluster.

Wed Sep 12 18:42:38 UTC 2018 Suspend invoked /nfs/slurm/bin/slurm-aws-shutdown.sh ip-10-0-1-[6-7]
    "TerminatingInstances": [
            "InstanceId": "i-0b4c6ec4945afe52e", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
    "TerminatingInstances": [
            "InstanceId": "i-0f3139a52f2602c60", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            "PreviousState": {
                "Code": 16, 
                "Name": "running"

The SLURM elastic compute plugin provisions the compute resources based on the scheduler queue load. In this example implementation, you are distributing a set of compute nodes to take advantage of scale and capacity across all Availability Zones within an AWS Region.

With a minor modification on the VPC layer, you can use this same plugin to stand up compute resources across multiple Regions. With this implementation, you can truly take advantage of a global HPC footprint.

“Imagine creating your own custom cluster mix of GPUs, CPUs, storage, memory, and networking – just the way you want it, then running your experiment, getting the results, and then tearing it all down.” — InsideHPC. Innovation Unbound: What Would You do with a Million Cores?

Part 2

In Part 2 of this post series, you integrate this cluster with AWS native services to handle scalable job submission and monitoring using Amazon API Gateway, AWS Lambda, and Amazon CloudWatch.

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.


May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.


May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.


May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.


May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.


May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.


May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.


May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.


May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.


May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.





Serverless @ re:Invent 2017

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/serverless-reinvent-2017/

At re:Invent 2014, we announced AWS Lambda, what is now the center of the serverless platform at AWS, and helped ignite the trend of companies building serverless applications.

This year, at re:Invent 2017, the topic of serverless was everywhere. We were incredibly excited to see the energy from everyone attending 7 workshops, 15 chalk talks, 20 skills sessions and 27 breakout sessions. Many of these sessions were repeated due to high demand, so we are happy to summarize and provide links to the recordings and slides of these sessions.

Over the course of the week leading up to and then the week of re:Invent, we also had over 15 new features and capabilities across a number of serverless services, including AWS Lambda, Amazon API Gateway, AWS [email protected], AWS SAM, and the newly announced AWS Serverless Application Repository!

AWS Lambda

Amazon API Gateway

  • Amazon API Gateway Supports Endpoint Integrations with Private VPCs – You can now provide access to HTTP(S) resources within your VPC without exposing them directly to the public internet. This includes resources available over a VPN or Direct Connect connection!
  • Amazon API Gateway Supports Canary Release Deployments – You can now use canary release deployments to gradually roll out new APIs. This helps you more safely roll out API changes and limit the blast radius of new deployments.
  • Amazon API Gateway Supports Access Logging – The access logging feature lets you generate access logs in different formats such as CLF (Common Log Format), JSON, XML, and CSV. The access logs can be fed into your existing analytics or log processing tools so you can perform more in-depth analysis or take action in response to the log data.
  • Amazon API Gateway Customize Integration Timeouts – You can now set a custom timeout for your API calls as low as 50ms and as high as 29 seconds (the default is 30 seconds).
  • Amazon API Gateway Supports Generating SDK in Ruby – This is in addition to support for SDKs in Java, JavaScript, Android and iOS (Swift and Objective-C). The SDKs that Amazon API Gateway generates save you development time and come with a number of prebuilt capabilities, such as working with API keys, exponential back, and exception handling.

AWS Serverless Application Repository

Serverless Application Repository is a new service (currently in preview) that aids in the publication, discovery, and deployment of serverless applications. With it you’ll be able to find shared serverless applications that you can launch in your account, while also sharing ones that you’ve created for others to do the same.

AWS [email protected]

[email protected] now supports content-based dynamic origin selection, network calls from viewer events, and advanced response generation. This combination of capabilities greatly increases the use cases for [email protected], such as allowing you to send requests to different origins based on request information, showing selective content based on authentication, and dynamically watermarking images for each viewer.


Twitch Launchpad live announcements

Other service announcements

Here are some of the other highlights that you might have missed. We think these could help you make great applications:

AWS re:Invent 2017 sessions

Coming up with the right mix of talks for an event like this can be quite a challenge. The Product, Marketing, and Developer Advocacy teams for Serverless at AWS spent weeks reading through dozens of talk ideas to boil it down to the final list.

From feedback at other AWS events and webinars, we knew that customers were looking for talks that focused on concrete examples of solving problems with serverless, how to perform common tasks such as deployment, CI/CD, monitoring, and troubleshooting, and to see customer and partner examples solving real world problems. To that extent we tried to settle on a good mix based on attendee experience and provide a track full of rich content.

Below are the recordings and slides of breakout sessions from re:Invent 2017. We’ve organized them for those getting started, those who are already beginning to build serverless applications, and the experts out there already running them at scale. Some of the videos and slides haven’t been posted yet, and so we will update this list as they become available.

Find the entire Serverless Track playlist on YouTube.

Talks for people new to Serverless

Advanced topics

Expert mode

Talks for specific use cases

Talks from AWS customers & partners

Looking to get hands-on with Serverless?

At re:Invent, we delivered instructor-led skills sessions to help attendees new to serverless applications get started quickly. The content from these sessions is already online and you can do the hands-on labs yourself!
Build a Serverless web application

Still looking for more?

We also recently completely overhauled the main Serverless landing page for AWS. This includes a new Resources page containing case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Well-Architected Lens: Focus on Specific Workload Types

Post Syndicated from Philip Fitzsimons original https://aws.amazon.com/blogs/architecture/well-architected-lens-focus-on-specific-workload-types/

Customers have been building their innovations on AWS for over 11 years. During that time, our solutions architects have conducted tens of thousands of architecture reviews for our customers. In 2012 we created the “Well-Architected” initiative to share with you best practices for building in the cloud, and started publishing them in 2015. We recently released an updated Framework whitepaper, and a new Operational Excellence Pillar whitepaper to reflect what we learned from working with customers every day. Today, we are pleased to announce a new concept called a “lens” that allows you to focus on specific workload types from the well-architected perspective.

A well-architected review looks at a workload from a general technology perspective, which means it can’t provide workload-specific advice. For example, there are additional best practices when you are building high-performance computing (HPC) or serverless applications. Therefore, we created the concept of a lens to focus on what is different for those types of workloads.

In each lens, we document common scenarios we see — specific to that workload — providing reference architectures and a walkthrough. The lens also provides design principles to help you understand how to architect these types of workloads for the cloud, and questions for assessing your own architecture.

Today, we are releasing two lenses:

Well-Architected: High-Performance Computing (HPC) Lens <new>
Well-Architected: Serverless Applications Lens <new>

We expect to create more lenses over time, and evolve them based on customer feedback.

Philip Fitzsimons, Leader, AWS Well-Architected Team