Tag Archives: HPC

Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/

This post was written by By Kevin Tuil, AWS HPC consultant 

Modeling fires is key for many industries, from the design of new buildings, defining evacuation procedures for trains, planes and ships, and even the spread of wildfires. Modeling these fires is complex. It involves both the need to model the three-dimensional unsteady turbulent flow of the fire and the many potential chemical reactions. To achieve this, the fire modeling community has moved to higher-fidelity turbulence modeling approaches such as the Large Eddy Simulation, which requires both significant temporal and spatial resolution. It means that the computational cost for these simulations is typically in the order of days to weeks on a single workstation.
While there are a number of software packages, one of the most popular is the open-source code: Fire Dynamics Simulation (FDS) developed by National Institute of Standards and Technology (NIST).

In this blog, I focus on how AWS High Performance Computing (HPC) resources (e.g AWS ParallelCluster, Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and Amazon S3) allow FDS users to scale up beyond a single workstation to hundreds of cores to achieve simulation times of hours rather than days or weeks. In this blog, I outline the architecture needed, providing scripts and templates to compile FDS and run your simulation.

Service and solution overview

AWS ParallelCluster

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use, even if you are new to the cloud. AWS released AWS ParallelCluster 2.9.1 and its user guide – which is the version I use in this blog.

These three AWS HPC resources are optimal for Fire Dynamics Simulation. Together, they provide easy deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel file system.

Elastic Fabric Adapter

EFA is a critical service that provides low latency and high-bandwidth 100 Gbps network communication. EFA allows applications to scale at the level of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud. Computational Fluid Dynamics (CFD), among other tightly coupled applications, is an excellent candidate for the use of EFA.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides sub-millisecond latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleash innovation at an unparalleled pace. This blog post uses the latest version of Amazon FSx for Lustre, which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket.

Solution and steps

The overall solution that I deploy in this blog is represented in the following diagram:

solution overview diagram

Step 1: Access to AWS Cloud9 terminal and upload data

There are two ways to start using AWS ParallelCluster. You can either install AWS CLI or turn on AWS Cloud9, which is a cloud-based integrated development environment (IDE) that includes a terminal. For simplicity, I use AWS Cloud9 to create the HPC cluster. Please refer to this link to proceed to AWS Cloud9 set up and to this link for AWS CLI setup.

Once logged into your AWS Cloud9 instance, the first thing you want to create is the S3 bucket. This bucket is key to exchange user data in and out from the corporate data center and the AWS HPC cluster. Please make sure that your bucket name is unique globally, meaning there is only one worldwide across all AWS Regions.

aws s3 mb s3://fds-smv-bucket-unique
make_bucket: fds-smv-bucket-unique

Download the latest FDS-SMV Linux version package from the official NIST website. It looks something like: FDS6.7.4_SMV6.7.14_lnx.sh

For the geometry, it should be renamed to “geometry.fds”, and must be uploaded to your AWS Cloud9 or directly to your S3 bucket.

Please note that once the FDS-SMV package has been downloaded locally to the instance, you must upload it to the S3 bucket using the following command.

aws s3 cp FDS6.7.4_SMV6.7.14_lnx.sh s3://fds-smv-bucket-unique
aws s3 cp geometry.fds s3://fds-smv-bucket-unique

You use the same S3 bucket to install FDS-SMV later on with the Amazon FSx for Lustre File System.

Step 2: Set up AWS ParallelCluster

You can install AWS ParallelCluster running the following command from your AWS Cloud9 instance:

sudo pip install aws-parallelcluster

Once it is installed, you can run the following command to check the version:

pcluster version 

At the time of writing this blog, 2.9.1 is the most up-to-date version.

Then use the text editor of your choice and open the configuration file as follows:

vim ~/.parallelcluster/config

Replace the bolded section, if not yet filled in, by your own information and save the configuration file.

[aws]
aws_region_name = <AWS-REGION>

[global]
sanity_check = true
cluster_template = fds-smv-cluster
update_check = true

[vpc public]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<SUBNET-ID>

[cluster fds-smv-cluster]
key_name = <Key-Name>
vpc_settings = public
compute_instance_type=c5n.18xlarge
master_instance_type=c5.xlarge
initial_queue_size = 0
max_queue_size = 100
scheduler=slurm
cluster_type = ondemand
s3_read_write_resource=arn:aws:s3:::fds-smv-bucket-unique*
placement_group = DYNAMIC
placement = compute
base_os = alinux2
tags = {"Name" : "fds-smv"}
disable_hyperthreading = true
fsx_settings = fsxshared
enable_efa = compute
dcv_settings = hpc-dcv

[dcv hpc-dcv]
enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://fds-smv-bucket-unique
imported_file_chunk_size = 1024
export_path = s3://fds-smv-bucket-unique

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Let’s review the different sections of the configuration file and explain their role:

  • scheduler: Supported job schedulers are SGE, TORQUE, SLURM and AWS Batch. I have selected SLURM for this example.
  • cluster_type: You have the choice between On-Demand (ondemand) or Spot Instances (spot) for your compute instances. For On-Demand, instances are available for use without condition (if available in the Region selected) at a certain price per hour with the pay-as-you-go model, meaning that as soon as they are started, they are reserved for your utilization. For Spot Instances, you can take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as HPC, for more information about Spot Instances, feel free to visit this webpage.
  • s3_read_write_resource: This parameter allows you to read and write objects directly on your S3 bucket from the cluster you created without additional permissions. It acts as a role for your cluster, allowing you access to your specified S3 bucket.  
  • placement_groupUse DYNAMIC to ensure that your instances are located as physically close to one another as possible. Close placement minimizes the latency between compute nodes and takes advantage of EFA’s low latency networking.
  • placement: By selecting compute you only enforce compute instances to be placed within the same placement group, leaving the head node placement free.
  • compute_instance_type:Select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC applications. Note that EFA is supported only for specific instance types. Please visit currently supported instances for more information.
  • master_instance_type:This can be any instance type. As traffic between head and compute nodes is relatively small, and the head node runs during the entire lifetime of the cluster, I use c5.xlarge because it is inexpensive and is a good fit for this use case.
  • initial_queue_size:You start with no compute instances after the HPC cluster is up. This means that any new job submitted has some delay (time for the nodes to be powered on) before they are seen as available by the job scheduler. This helps you pay for what you use and keeps costs as low as possible.
  • max_queue_size:Limit the maximum compute fleet to 100 instances. This allows you room to scale your jobs up to a large number of cores, while putting a limit on the number of compute nodes to help control costs.
  • base_osFor this blog, select Amazon Linux 2 (alinux2) as a base OS. Currently we also support Amazon Linux (alinux), CentOS 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.
  • disable_hyperthreading: This setting turns off hyperthreading (true) on your cluster, which is the right configuration in this use case.[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory is mounted, the storage capacity for the file system, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.
  • enable_efa: Mark as (true) in this use case since it is a tightly coupled CFD simulation use case.
  • dcv_settings:With AWS ParallelCluster, you can use NICE DCV to support your remote visualization needs.
  • [dcv hpc-dcv]:This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.
  • import_path: This parameter enables all the objects on the S3 bucket available when creating the cluster to be seen directly from the FSx for Lustre file system. In this case, you are able to access the FDS-SMV package and the geometry under the /fsx mounted folder.
  • export_path: This parameter is useful for backup purposes using the Data Repository Tasks. I share more details about this in step 7 (optional).

Step 3: Create the HPC cluster and log in

Now, you can create the HPC cluster, named fds-smv. It takes around 10 minutes to complete and you can see the status changing (going through the different AWS CloudFormation template steps). At the end of creation, two IP addresses are prompted, a public IP and/or a private IP depending on your network choice.

pcluster create fds-smv
Creating stack named: parallelcluster-fds-smv
Status: parallelcluster-fds-smv - CREATE_COMPLETE                               
MasterPublicIP: X.X.X.X
ClusterUser: ec2-user
MasterPrivateIP: X.X.X.X

In order to log in, you must use the key you specified in the AWS ParallelCluster configuration file before creating the cluster:

pcluster ssh fds-smv -i <Key-Name>

You should now be logged in as an ec2-user (since we are using Amazon Linux 2 base OS).

Step 4: Install FDS-SMV package

Now that the HPC cluster using AWS ParallelCluster is set up, it is time to install the FDS-SMV package.  In the prior steps, you uploaded both the FDS-SMV package and the geometry to your S3 bucket. Since you enabled “import_path” to that bucket, they are already available on the Amazon FSx for Lustre storage under /fsx.

Run the script as follows and select /fsx/fds-smv as final target for installation:

cd /fsx
./FDS6.7.4_SMV6.7.14_lnx.sh
[[email protected] fsx]$ ./FDS6.7.4_SMV6.7.14_lnx.sh 

Installing FDS and Smokeview  for Linux

Options:
  1) Press <Enter> to begin installation [default]
  2) Type "extract" to copy the installation files to:
     FDS6.7.4_SMV6.7.14_lnx.tar.gz
 

FDS install options:
  Press 1 to install in /home/ec2-user/FDS/FDS6 [default]
  Press 2 to install in /opt/FDS/FDS6
  Press 3 to install in /usr/local/bin/FDS/FDS6
  Enter a directory path to install elsewhere
/fsx/fds-smv

It is important to source the following scripts as part of the installed packages to check if the installation is successful with the correct versions. Here is the correct output you should get:

[[email protected] ~]$ source /fsx/fds-smv/bin/SMV6VARS.sh 
[[email protected] ~]$ source /fsx/fds-smv/bin/FDS6VARS.sh 
[[email protected] ~]$ fds -version
FDS revision       : FDS6.7.4-0-gbfaa110-release
MPI library version: Intel(R) MPI Library 2019 Update 4 for Linux* OS

[[email protected] ~]$ smokeview -version

Smokeview  SMV6.7.14-0-g568693b-release - Mar  9 2020

Revision         : SMV6.7.14-0-g568693b-release
Revision Date    : Wed Mar 4 23:13:42 2020 -0500
Compilation Date : Mar  9 2020 16:31:22
Compiler         : Intel C/C++ 19.0.4.243
Checksum(SHA1)   : e801eace7c6597dc187739e51ba6f546bfde4e48
Platform         : LINUX64

Important notes:

The way FDS-SMV package has been installed is the default installation. Binaries are already compiled and Intel MPI libraries are embedded as part of the installation package. It is what one would call a self-contained application. For further builds and source codes, please visit this webpage.

Step 5: Running the fire dynamics simulation using FDS

Now that everything is installed, it is time to create the SLURM submission script. In this step, you take advantage of the FSx for Lustre File System, the compute-optimized instance, and the EFA network to maximize simulation performance.

cd /fsx/
vi fds-smv.sbatch

Here is the information you should specify in your submission script:

#!/bin/bash
#SBATCH --job-name=fds-smv-job
#SBATCH --ntasks=<Total number of MPI processes>
#SBATCH --ntasks-per-node=36
#SBATCH --output=%x_%j.out

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

module load intelmpi 

export OMP_NUM_THREADS=1
export I_MPI_PIN_DOMAIN=omp

cd /fsx/<results>

time mpirun -ppn 36 -np <Total number of MPI processes>  fds geometry.fds

Replace the <results> with the one of your choice, and don’t forget to copy the geometry.fds file in it before submitting your job. Once ready, save the file and submit the job using the following command:

sbatch fds-smv.sbatch 

If you decided to build your HPC cluster with c5n.18xlarge instances, the number of MPI processes per node is 36 since you turned off the hyperthreading, and that the instance has 36 physical cores. That is the meaning of the “#SBATCH --ntasks-per-node=36” line.

For any run exceeding 36 MPI processes, the job is split among multiple instances and take advantage of EFA for internode communication.

It is important to note that FDS only allows the number of MPI processes to be equal to the number of meshes in the input geometry (geometry.fds in this scenario). In case the number of meshes in the input geometry cannot be modified, OpenMP threads can be enabled and efficiently increase performance. Do this using up to four OpenMP Threads across four CPU cores attached to one MPI process.

Please read best practices provided by NIST for that topic on their user guide.

In order to take advantage of the distributed computing capability of FDS, it is mandatory to work first on the input geometry, and divide it into the appropriate number of meshes. It is also highly advised to evenly distribute the number of cells/elements per mesh across all meshes. This best practice optimizes the load balancing for each CPU core.

Step 6: Visualizing the results using NICE DCV and SMV

In order to visualize results, you must connect to the head node using NICE DCV streaming protocol.

As a reminder, the current instance type for the head node is a c5.xlarge, which is not a graphics-accelerated instance. For heavy and GPU intensive visualization, it is important to set up a more appropriate instance such as the G4 instance group.

Go back to your AWS Cloud9 instance, open a new terminal side by side to your session connected to your AWS HPC cluster, and enter the following command in the terminal:

pcluster dcv connect fds-smv -k <Key-Name>

You are provided a one-time HTTPS URL available for a short period of time in order to connect to your head node using the NICE DCV protocol.

Once connected, open the terminal inside your session and source the FDS-SMV scripts as before:

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

Navigate to your <results> folder and start SMV with your result.

I have selected one of the geometries named fire_whirl_pool.fds in the Examples folder, part of the default FDS-SMV installation package located here:

/fsx/fds-smv/Examples/Fires/fire_whirl_pool.fds

You can find other scenarios under the Examples folder to run some more use cases if you did not already choose your geometry.fds file.

Now you can run SMV and visualize your results:

smokeview fire_whirl_pool.smv

SMV (smokeview) takes as an input .smv extension files, please replace with your appropriate file. If you have already chosen your geometry.fds, then run the following command:

smokeview geometry.smv

The application then open as follows, and you can visualize the results. The following image is an output of the SOOT DENSITY of the 3D smoke.

fire simulation picture

Step 7 (optional): Back up your FDS-SMV results to an S3 bucket

First update the AWS CLI to its most recent version. It is compatible with 1.16.309 and above.

After running your FDS-SMV simulation, you can back up your data in /fsx to the S3 bucket you used earlier to upload the installation package, and input files using Data Repository Tasks.

Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

Open your AWS Cloud9 terminal and exit the HPC head node cluster. Retrieve your Amazon FSx for Lustre ID using:

aws fsx describe-file-systems

It looks something like, fs-0533eebf1148fc8dd. Then create a backup of the data as follows:

aws fsx create-data-repository-task --file-system-id fs-0533eebf1148fc8dd --type EXPORT_TO_REPOSITORY --paths results --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://fds-smv-bucket-unique/

The following are definitions about the command parameters:

  • file-system-id: Your file system ID.
  • type EXPORT_TO_REPOSITORY: Exports the data back to the S3 bucket.
  • paths results: The directory you want to export to your S3 bucket. If you have more than one folder to back up, use a comma-separated notation such as: results1,results2,…
  • Format=REPORT_CSV_20191124: Note this is only the name the Amazon FSx Lustre supports. Please keep it the same.

You can check the backup status by running:

aws fsx describe-data-repository-tasks

Please wait for the copy to be achieved, once finished you should see on the Lifecycle line "Lifecycle": "SUCCEEDED"

Also go back to your S3 bucket, and your folder(s) should appear with all the files correctly uploaded from your /fsx folder you specified.

In terms of data management, Amazon S3 is an important service. You started by uploading installation package and geometry files from an external source, such as your laptop or an on-premises system. Then made these files available to the AWS HPC cluster under the Amazon FSx for Lustre file system and ran the simulation. Finally, you backed up the results from the Amazon FSx for Lustre to Amazon S3. You can also decide to download the results on Amazon S3 back to your local system if needed.

Step 8: Delete your AWS resources created during the deployment of this blog

After your run is completed and your data backed up successfully (Step 7 is optional) on your S3 bucket, you can then delete your cluster by using the following command in your Cloud9 terminal:

pcluster delete fds-smv

Warning:

If you run the command above all resources you created during this blog are automatically deleted beside your Cloud9 session and your data on your S3 bucket you created earlier.

Your S3 bucket still contains your input “geometry.fds” and your installation package “FDS6.7.4_SMV6.7.14_lnx.sh” files.

If you selected to back up your data during Step 7 (optional), then your S3 bucket also contains that data on top of the two previous files mentioned above.

If you want to delete your S3 bucket and all data mentioned above, go to your AWS Management Console, select S3 service then select your S3 bucket and hit delete on the top section.

If you want to terminate your Cloud9 session, go to your AWS Management Console, select Cloud9 service then select your session and hit delete on the top right section.

After performing these operations, there will be no more resources running on AWS related to this blog.

Conclusion

I showed that AWS ParallelCluster, Amazon FSx for Lustre, EFA, and Amazon S3 are key AWS services and features for HPC workloads such as CFD and in particular for FDS.

You can achieve simulation times of hours on AWS rather than days or weeks on a single workstation.

Please visit this workshop  for a more in-depth tutorial on running Fire Dynamics Simulation on AWS and our HPC dedicated homepage.

 

Custom logging with AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch. 

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

  1. IAM roles, policies, and profiles to grant access and permissions
  2. A compute environment to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on a compute environment
  4. A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.

Prerequisite

To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role.  You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "logs:CreateLogGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
      }
    }
  }]
}
EOF
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file://compute-environment.json  

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch  describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
{
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
    }
  }
}
EOF
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events

Splunk

Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

{
  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"
      }
    }
  }
}

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk

Conclusion

This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.

 

OpenFOAM on Amazon EC2 C6g Arm-based Graviton2 Instances – up to 37% better price/performance

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/c6g-openfoam-better-price-performance/

This post is contributed to by: Neil Ashton (AWS) – Principal CFD Specialist SA, Karthik Raman (AWS) – Senior HPC Specialist SA, Oliver Perks (Arm) – Principal HPC Engineer

Over the past 30 years, Computational Fluid Dynamics (CFD) has become a key part of many engineering design processes. From aircraft design to modelling the blood flow inside our bodies, the ability to understand the behavior of fluids has enabled countless innovations and improved the time to market for many products. Typically, the modelling of the underlying fluid flow equations remains the major limit to accuracy; however, the scale and availability of HPC resources is arguably the main bottleneck for the typical CFD user. The need for more HPC resources has accelerated over recent years due to the move to higher-fidelity approaches, in addition to the growing use of optimization techniques and machine learning driven workflows. These workflows require the simulation of thousands or even millions of small jobs to explore a design space or train an improved turbulence model. AWS enables engineers to overcome this HPC bottleneck by allowing the quick deployment of a supercomputer. This can scale to practically any size and with the right hardware to match your needs, for example, CPUs, GPUs.

CFD users have a broad choice for instance types on AWS.  Typically the majority of users select the compute-optimized Amazon EC2 C5 family of instances. These offer a better price/performance over the Amazon EC2 M5 family of instances for the majority of CFD cases due to a need for memory bandwidth over total memory needs. This broad set of available instances allows users to easily incorporate Amazon EC2 M5 or Amazon EC2 R5 instances when higher memory is needed, for example, for pre or post-processing.

There are typically two competing needs for the typical CFD user: turn-around time and cost. Depending on the underlying numerical methods, the speed of a simulation does not linearly increase with an increased core count. At some point, the cost of network communication means a non-linear increase. This means the cost of increasing the number of cores is not matched by a similar decrease in runtime. Therefore, the user typically faces a choice between the fastest possible runtime or the most efficient cost for the simulation.

The recently announced Amazon EC2 C6g instances powered by AWS-designed Arm-based Graviton2 processors bring in a new instance option. So, the logical question for any AWS customer is whether this is suitable for their CFD workload. In this blog, we provide benchmarking results for the open-source CFD code OpenFOAM that can be easily compiled and run with the Amazon EC2 C6g instance. We demonstrate that you can achieve 37% better price/performance for both single-node and multi-node cases on more than 200M cells on thousands of cores.

 

AWS Graviton2-based Instances

AWS Graviton2 processors are custom built by AWS using the 64-bit Arm Neoverse cores to deliver great price performance for your cloud workloads running in Amazon EC2. The Graviton2-powered EC2 instances were announced at re:Invent 2019. The new Amazon EC2 C6g compute-optimized instances are available as part of the sixth generation EC2 offering. These instances are powered by AWS Graviton2 processors that use 64-bit Arm Neoverse N1 cores and custom silicon designed by AWS, built using advanced 7-nanometer manufacturing technology.

Graviton2 processor cores feature 64 KB L1 cache and 1 MB L2 cache, includes dual SIMD units to double the floating-point performance versus first-generation Graviton processors.  This targets high performance computing workloads and also support int8/fp16 number formats to accelerate machine learning inference workloads. Every vCPU is a physical core (that is, no simultaneous multithreading – SMT). The instances are single socket, so there are no NUMA concerns since every core sees the same path to memory and other cores. There are 8x DDR4 memory channels running at 3200 MT/s delivering over 200 GB/s of peak memory bandwidth.

The compute-optimized Amazon EC2 C6g instances are available in 9 sizes 1, 2, 4, 8, 16, 32, 48, and 64 vCPUs, or as bare metal instances. The compute-optimized Amazon EC2 C6g instances support configurations with up to 128 GB of DDR4 memory or 2 GB/ CPU. The instances support up-to 25 Gbps of network bandwidth and 19 Gbps of Amazon EBS bandwidth. These instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor.

Benchmarking Results

Single-Node Performance

In order to demonstrate the performance of the Amazon EC2 C6g instance, we have taken two test-cases that reflect the range of cases that CFD users run. For the first case, we simulate a 4 million cell motorbike case that is part of the standard OpenFOAM tutorial suite. It might seem that these cases do not represent the complexity of some production CFD workflows. However, these were chosen to allow readers to easily replicate the results. This case uses the OpenFOAM simpleFoam solver, which uses a steady-state incompressible segregated SIMPLE approach combined with a multigrid method for the linear solver. The k- SST model is used with second order upwinding for both momentum and turbulent equations. The scotch domain decomposition approach is used to ensure best parallel efficiency. However, for this first case the focus is on the single-node performance, given the low cell count.

Figures 1 and 2 show the time taken to run 5000 iterations as well as the cost for that simulation based upon on-demand pricing. It can be seen that the C6g.16xlarge offers a clear cost per simulation benefit over both the C5n.18xlarge (37%), and C5.24xlarge (29%) instances. There is however an increase in total run time compared to the C5.24xlarge (33%) and C5n.18xlarge (13%) which may mean that the C5.24xlarge or C5n.18xlarge will still be preferred for those prioritizing turn-around time.

 

Figure 1 - Single-node OpenFOAM cost per simulation

Figure 1 – Single-node OpenFOAM cost per simulation

Figure 2 - Single-node OpenFOAM run-time performance

Figure 2 – Single-node OpenFOAM run-time performance

Single-Node Performance Profile

Profiling performed on the simpleFoam solver (used in the motorbike example) showed that it is limited by the sustained memory bandwidth on the instance. We therefore measured the peak sustainable memory bandwidth performance (Figure 3) on the three instance types (C5n.18xlarge, C6g.16xlarge, C5.24xlarge) using STREAM Triad. Then calculated the corresponding bandwidth efficiencies, which are shown in Figure 4.

 

Figure 3 - Peak memory bandwidth using STREAM

Figure 3 – Peak memory bandwidth using STREAM

Figure 4 - Memory bandwidth efficiency using STREAM Triad

Figure 4 – Memory bandwidth efficiency using STREAM Triad

The C5n.18xlarge and C5.24xlarge instances have higher peak memory bandwidth due to more memory channels (dual socket) compared to C6g.16xlarge instance. However, the Graviton2 based processor can achieve a higher percentage of peak (~86% bandwidth efficiency) compared to the x86 processors (~79% on C5.24xlarge). This in addition to the cost benefits (47% vs. C5.24xlarge, 44% vs. C5n.18xlarge) is the reason for the improved cost per simulation with the C6g.16xlarge instance as shown in Figure 1.

Multi-Node Performance

A further, larger mesh of 222 million cells was studied on the same motorbike geometry with the same underlying numerical methods. We did this to assess whether the same performance trends continue for much larger multi-node simulations. Figure 5 shows the number of iterations possible per minute for different number of cores for the C5.24xlarge, C5n.18xlarge, and C6g.16xlarge instances. You can see in Figure 5 that the scaling of the C5n.18xlarge instances is much better than C5.24xlarge and C6g.16xlarge instances due to the use of the Elastic Fabric Adapter (EFA), which enables optimum scaling thanks to low latency, high-bandwidth  communication. However, while the scaling is much improved for C5n.18xlarge instances with EFA, the cost per simulation (Figure 6) shows up to 37% better price/performance for the C6g.16xlarge instances over the C5n.18xlarge and C5n.24xlarge instances.

Figure 5 - Multi-node OpenFOAM Scaling Performance

Figure 5 – Multi-node OpenFOAM Scaling Performance

Figure 6 - Multi-node OpenFOAM cost per simulation

Figure 6 – Multi-node OpenFOAM cost per simulation

Conclusions

This blog has demonstrated the ease with which both single-node and multi-node CFD simulations can be ported to the new Amazon EC2 C6g instances powered by Arm-based AWS Graviton2 processors. With no code modifications, we have been able to port an x86 workload to the new C6g instances and achieve competitive single node and multi-node performance with up to 37% better price/performance over other C family instances. We encourage you to test out your applications for yourself and reach out to us if you have any questions!

OpenFOAM compilation on AWS Graviton2 Instances

OpenFOAM v1912 builds on modern Arm-based systems, with support in-place for both the GCC compiler and the Arm Compiler for Linux (ACfL). No source code modifications are required to obtain a working OpenFOAM binary. The process for building OpenFOAM on Arm based systems is equivalent to that on x86 systems, by specifying the desired compiler in OpenFOAM bashrc file.

The only external dependencies are a working MPI library. For this experiment, we used Open MPI version 4.0.3, built with UCX 1.8 and GCC 9.2. We note that we have not applied any additional hardware-specific performance optimizations. Although deploying OpenFOAM on the Amazon EC2 C6g instances was trivial, there is documentation on the Arm community GitLab pages. This documentation covers the installation of different OpenFOAM versions and with different compilers.

For the C5n.18xlarge and C5.24xlarge simulations, OpenFOAM v1912 was compiled using GCC 8.2 and IntelMPI 2019.7. For all simulations Amazon Linux 2 was the operating system. The HPC environment for all tests was created using AWS ParallelCluster, which will soon include official support for AWS’ latest Graviton2 instances, including C6g. You can stay up to date on the latest information and releases of AWS ParallelCluster on GitHub.

Instance Details:

ModelvCPUMemory (GiB)Instance Storage (GiB)Network Bandwidth (Gbps)EBS Bandwidth (Mbps)
C5n.18xlarge72192EBS-only10019,000
C6g.16xlarge64128EBS-only2519,000
C5.24xlarge96192EBS-only2519,000

 

Enabling job accounting for HPC with AWS ParallelCluster and Amazon RDS

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/

This post is written by Nicola Venuti, HPC Specialist SA, and contributed to by Rex Chen, Software Development Engineer.

Introduction

Accounting, reporting, and advanced analytics used for data-driven planning and decision making are key areas of focus for High Performance Computing (HPC) Administrators. In the cloud, these areas are more relevant to the costs of the services, which directly impact budgeting and forecasting of expenses. With the growth of new HPC services that perform analyses and corrective actions, you can better optimize for performance, which reduces cost.

Solution Overview

In this blog post, we walk through an easy way to collect accounting information for evert job and step executed in a cluster with job scheduling. This post uses a new feature in the latest version (2.6.0) of AWS ParallelCluster, which makes this process easier than before, and Slurm.  Accounting records are saved into a relational database for both currently executing jobs and jobs, which have already terminated.

Prerequisites

This tutorial assumes you already have a cluster in AWS ParallelCluster. If  you don’t, refer to the AWS ParallelCluster documentation, a getting started blog post, or a how-to blog post.

Solution

Choose your architecture

There are two common architectures to save job accounting information into a database:

  1. Installing and directly managing a DBMS in the master node of your cluster (or in an additional EC2 instance dedicated to it)
  2. Using a fully managed service like Amazon Relational Database Service (RDS)

While the first option might appear to be the most economical solution, it requires heavy lifting. You must install and manage the database, which is not a core part of running your HPC workloads.  Alternatively, Amazon RDS reduces this burden of installing updates, managing security patches, and properly allocating resources.  Additionally, Amazon RDS Free Tier can get you started with a managed database service in the cloud for free. Refer to the hyperlink for a list of free resources.

Amazon RDS is my preferred choice, and the following sections implement this architecture. Bear in mind, however, that the settings and the customizations required in the AWS ParallelCluster environment are the same, regardless of which architecture you prefer.

 

Set up and configure your database

Now, with your architecture determined, let’s configure it.  First, go to Amazon RDS’ console.  Select the same Region where your AWS ParallelCluster is deployed, then click on Create Database.

There are two database instances to consider: Amazon Aurora and MySQL.

Amazon Aurora has many benefits compared to MySQL. However, in this blog post, I use MySQL to show how to build your HPC accounting database within the Free-tier offering.

The following steps are the same regardless of your database choice. So, if you’re interested in one of the many features that differentiate Amazon Aurora from MySQL, feel free to use. Check out Amazon Aurora’s landing page to learn more about its benefits, such as its faster performance and cost effectiveness.

To configure your database, you must complete the following steps:

  1. Name the database
  2. Establish credential settings
  3. Select the DB instance size
  4. Identify storage type
  5. Allocate amount of storage

The following images show the settings that I chose for storage options and the “Free tier” template.  Feel free to change it accordingly to the scope and the usage you expect.

Make sure you select the corresponding VPC to wherever your “compute fleet” is deployed by AWS ParallelCluster, and wherever the Security Group of your compute fleet is selected.  You can access information for your “compute fleet” in your AWS ParallelCluster config file. The Security Group should look something like this: “parallelcluster-XXX-ComputeSecurityGroup-XYZ”.

At this stage, you can click on Create database and wait until the Database status moves from the Creating to Available in the Amazon RDS Dashboard.

The last step for this section is to grant privileges on the database.

  1. Connect to your database. Use the master node of your AWS ParallelCluster as a client.
  2. Install the MySQL client by running sudo yum install mysql on AmazonLinux and CentOS and sudo apt-get install mysql-client on Ubuntu.
  3. Connect to your MySQL RDS database using the following code: mysql --host=<your_rds_endpoint> --port=3306 -u admin -p The following screenshot shows how to find your RDS endpoint and port.

 

4. Run GRANT ALL ON `%`.* TO [email protected]`%`; to grant the required privileges.

The following code demonstrates these steps together:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 251
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> GRANT ALL ON `%`.* TO [email protected]`%`;

Note: typically this command is run as GRANT ALL ON *.* TO 'admin'@'%';   With Amazon RDS , for security reasons, this is not possible as the master account does not have access to the MySQL database. Using *.*  triggers an error. To work around this, I use the _ and % wildcards that are permitted. To look at the actual grants, you can run the following: SHOW GRANTS;

Enable Slurm Database logging

Now, your database is fully configured. The next step is to enable Slurm as a workload manager.

A few steps must occur to let Slurm log its job accounting information on an external database. The following code demonstrates the steps you must make.

  1. Add the DB configuration file, slurmdbd.conf after /opt/slurm/etc/ 
  2. Slurm’s slurm.conf file requires a few modifications. These changes are noted after the following code examples.

 

Note: You do not need to configure each and every compute node because AWS ParallelCluster installs Slurm in a shared directory. All of these nodes share this directory, and, thus the same configuration files with the master node of your cluster.

Below, you can find two example configuration files that you can use just by modifying a few parameters accordingly to your setup.

For more information about all the possible settings of configuration parameters, please refer to the official Slurm documentation, and in particular to the accounting section.

Add the DB configuration file

#
## Sample /opt/slurm/etc/slurmdbd.conf
#
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
AuthType=auth/munge
DbdHost=ip-10-0-16-243  #YOUR_MASTER_IP_ADDRESS_OR_NAME
DbdPort=6819
DebugLevel=info
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageUser=admin
StoragePass=password
StorageHost=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com # Endpoint from RDS console
StoragePort=3306                                                                # Port from RDS console

See below for key values that you should plug into the example configuration file:

  • DbdHost: the name of the machine where the Slurm Database Daemon is executed. This is typically the master node of your AWS ParallelCluster. You can run hostname -s on your master node to get this value.
  • DbdPort: The port number that the Slurm Database Daemon (slurmdbd) listens to for work. 6819 is the default value.
  • StorageUser: Define the user name used to connect to the database. This has been defined during the Amazon RDS configuration as shown in the second step of the previous section.
  • StoragePass: Define the password used to gain access to the database. Defined as the user name during the Amazon RDS configuration.
  • StorageHost: Define the name of the host running the database. You can find this value in the Amazon RDS console, under “Connectivity & security”.
  • StoragePort: Define the port on which the database is listening. You can find this value in the Amazon RDS console, under “Connectivity & security”. (see the screenshot below for more information).

Modify the file

Add the following lines at the end of the slurm configuration file:

#
## /opt/slurm/etc/slurm.conf
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=ip-10-0-16-243
AccountingStorageUser=admin
AccountingStoragePort=6819

Modify the following:

  • AccountingStorageHost: The hostname or address of the host where SlurmDBD executes. In our case this is again the master node of our AWS ParallelCluster, you can get this value by running hostname -s again.
  • AccountingStoragePort: The network port that SlurmDBD accepts communication on. It must be the same as DbdPort specified in /opt/slurm/etc/slurmdbd.conf
  • AccountingStorageUser: it must be the same as in /opt/slurm/etc/slurmdbd.conf (specified in the “Credential Settings” of your Amazon RDS database).

 

Restart the Slurm service and start the SlurmDB demon on the master node

 

Depending on the operating system you are running, this would look like:

  • Amazon Linux / Amazon Linux 2
[[email protected]]$ sudo /etc/init.d/slurm restart                                                                                                                                                                                                                                             
stopping slurmctld:                                        [  OK  ]
slurmctld is stopped
slurmctld is stopped
starting slurmctld:                                        [  OK  ]
[[email protected]]$ 
[[email protected]]$ sudo /opt/slurm/sbin/slurmdbd
  • CentOS7 and Ubuntu 16/18
[[email protected] ~]$ sudo /opt/slurm/sbin/slurmdbd
[[email protected] ~]$ sudo systemctl restart slurmctld

Note: even if you have jobs running, restarting the daemons will not affect them.

Check to see if your cluster is already in the Slurm Database:

/opt/slurm/bin/sacctmgr list cluster

And if it is not (see below):

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
[[email protected]]$ 

You can add it as follows:

sudo /opt/slurm/bin/sacctmgr add cluster parallelcluster

You should now see something like the following:

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+     10.0.16.243         6817  8704         1                                                                                           normal           
[[email protected]]$ 

At this stage, you should be all set with your AWS ParallelCluster accounting configured to be stored in the Amazon RDS database.

Replicate the process on multiple clusters

The same database instance can be easily used for multiple clusters to log its accounting data in. To do this, repeat the last configuration step for your clusters built using AWS ParallelCluster that you want to share the same database.

The additional steps to follow are:

  • Ensure that all the clusters are in the same VPC (or, if you prefer to use multiple VPCs, you can choose to set up VPC-Peering)
  • Add the SecurityGroup of your new compute fleets (“parallelcluster-XXX-ComputeSecurityGroup-XYZ”) to your RDS database
  • Change the cluster name parameter at the very top of the file. This is in addition to the slurm configuration file ( /opt/slurm/etc/slurm.conf) editing explained prior.  By default, your cluster is called “parallelcluster.” You may want to change that to clearly identify other clusters using the same database. For instance: ClusterName=parallelcluster2

Once these additional steps are complete, you can run /opt/slurm/bin/sacctmgr list cluster  again. Now, you should see two (or multiple) clusters:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+                            0  8704         1                                                                                           normal           
parallelc+     10.0.16.129         6817  8704         1                                                                                           normal           
[[email protected] ]#   

If you want to see the full name of your clusters, run the following:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster format=cluster%30
                       Cluster 
------------------------------ 
               parallelcluster 
              parallelcluster2 
[[email protected] ]# 

Note: If you check the Slurm logs (under /var/log/slurm*), you may see this error:

error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout

This error refers to default parameters that Amazon RDS sets for you on your MySQL database. You can change them by setting new “group parameters” as explained in the official documentation and in this support article. Please also note that the innodb_buffer_pool_size is related to the amount of memory available on your instance, so you may want to use a different instance type with higher memory to avoid this warning.

Run your first job and check the accounting

Now that the application is installed and configured, you can test it! Submit a job to Slurm, query your database, and check your job accounting information.

If you are using a brand new cluster, test it with a simple hostname job as follows:

[[email protected]]$ sbatch -N2 <<EOF
> #!/bin/sh
> srun hostname |sort
> srun sleep 10
> EOF
Submitted batch job 31
[[email protected]]$

Immediately after you have submitted the job, you should see it with a state of “pending”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2    PENDING      0:0 
[[email protected]]$ 

And, after a while the job should be “completed”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
38.batch           batch                                1  COMPLETED      0:0 
38.0           hostname                                2  COMPLETED      0:0 
[[email protected]]$

Now that you know your cluster works, you can build complex queries using sacct. See few examples below, and refer to the official documentation for more details:

[[email protected]]$ sacct --format=jobid,elapsed,ncpus,ntasks,state
       JobID    Elapsed      NCPUS   NTasks      State 
------------ ---------- ---------- -------- ---------- 
38             00:00:00          2           COMPLETED 
38.batch       00:00:00          1        1  COMPLETED 
38.0           00:00:00          2        2  COMPLETED 
39             00:00:00          2           COMPLETED 
39.batch       00:00:00          1        1  COMPLETED 
39.0           00:00:00          2        2  COMPLETED 
40             00:00:10          2           COMPLETED 
40.batch       00:00:10          1        1  COMPLETED 
40.0           00:00:00          2        2  COMPLETED 
40.1           00:00:10          2        2  COMPLETED 
[[email protected]]$ sacct --allocations
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
39               sbatch    compute                     2  COMPLETED      0:0 
40               sbatch    compute                     2  COMPLETED      0:0 
[[email protected]]$ sacct -S2020-02-17 -E2020-02-20 -X -ojobid,start,end,state
       JobID               Start                 End      State 
------------ ------------------- ------------------- ---------- 
38           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
39           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
40           2020-02-17T12:27:59 2020-02-17T12:28:09  COMPLETED 
[[email protected]]$ 

If you have configured your cluster(s) for multiple users, you may want to look at the accounting info for all of these.  If you want to configure your clusters with multiple users, follow this blog post. It demonstrates how to configure AWS ParallelCluster with AWS Directory Services to create a multiuser, POSIX-compliant system with centralized authentication.

Each and every user can only look at his own accounting data. However, Slurm admins (or root) can see accounting info for every user. The following code shows accounting data coming from two clusters (parallelcluster and parallelcluster2) and from two users (ec2-user and nicola):

[[email protected] ~]# sacct -S 2020-01-01 --clusters=parallelcluster,parallelcluster2 --format=jobid,elapsed,ncpus,ntasks,state,user,cluster%20                                                                                                                                                        
       JobID    Elapsed      NCPUS   NTasks      State      User              Cluster 
------------ ---------- ---------- -------- ---------- --------- -------------------- 
36             00:00:00          2              FAILED  ec2-user      parallelcluster 
36.batch       00:00:00          1        1     FAILED                parallelcluster 
37             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
37.batch       00:00:00          1        1  COMPLETED                parallelcluster 
37.0           00:00:00          2        2  COMPLETED                parallelcluster 
38             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
38.batch       00:00:00          1        1  COMPLETED                parallelcluster 
38.0           00:00:00          2        2  COMPLETED                parallelcluster 
39             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
39.batch       00:00:00          1        1  COMPLETED                parallelcluster 
39.0           00:00:00          2        2  COMPLETED                parallelcluster 
40             00:00:10          2           COMPLETED  ec2-user      parallelcluster 
40.batch       00:00:10          1        1  COMPLETED                parallelcluster 
40.0           00:00:00          2        2  COMPLETED                parallelcluster 
40.1           00:00:10          2        2  COMPLETED                parallelcluster 
41             00:00:29        144              FAILED  ec2-user      parallelcluster 
41.batch       00:00:29         36        1     FAILED                parallelcluster 
41.0           00:00:00        144      144  COMPLETED                parallelcluster 
41.1           00:00:01        144      144  COMPLETED                parallelcluster 
41.2           00:00:00          3        3  COMPLETED                parallelcluster 
41.3           00:00:00          3        3  COMPLETED                parallelcluster 
42             01:22:03        144           COMPLETED  ec2-user      parallelcluster 
42.batch       01:22:03         36        1  COMPLETED                parallelcluster 
42.0           00:00:01        144      144  COMPLETED                parallelcluster 
42.1           00:00:00        144      144  COMPLETED                parallelcluster 
42.2           00:00:39          3        3  COMPLETED                parallelcluster 
42.3           00:34:55          3        3  COMPLETED                parallelcluster 
43             00:00:11          2           COMPLETED  ec2-user      parallelcluster 
43.batch       00:00:11          1        1  COMPLETED                parallelcluster 
43.0           00:00:01          2        2  COMPLETED                parallelcluster 
43.1           00:00:10          2        2  COMPLETED                parallelcluster 
44             00:00:11          2           COMPLETED    nicola      parallelcluster 
44.batch       00:00:11          1        1  COMPLETED                parallelcluster 
44.0           00:00:01          2        2  COMPLETED                parallelcluster 
44.1           00:00:10          2        2  COMPLETED                parallelcluster 
4              00:00:10          2           COMPLETED    nicola     parallelcluster2 
4.batch        00:00:10          1        1  COMPLETED               parallelcluster2 
4.0            00:00:00          2        2  COMPLETED               parallelcluster2 
4.1            00:00:10          2        2  COMPLETED               parallelcluster2 
[[email protected] ~]# 

You can also directly query your database, and look at the accounting information stored in it or link your preferred BI tool to get insights from your HPC cluster. To do so, run the following code:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 48
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| slurm_acct_db      |
+--------------------+
4 rows in set (0.00 sec)

mysql> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------------------------+
| Tables_in_slurm_acct_db                 |
+-----------------------------------------+
| acct_coord_table                        |
| acct_table                              |
| clus_res_table                          |
| cluster_table                           |
| convert_version_table                   |
| federation_table                        |
| parallelcluster_assoc_table             |
| parallelcluster_assoc_usage_day_table   |
| parallelcluster_assoc_usage_hour_table  |
| parallelcluster_assoc_usage_month_table |
| parallelcluster_event_table             |
| parallelcluster_job_table               |
| parallelcluster_last_ran_table          |
| parallelcluster_resv_table              |
| parallelcluster_step_table              |
| parallelcluster_suspend_table           |
| parallelcluster_usage_day_table         |
| parallelcluster_usage_hour_table        |
| parallelcluster_usage_month_table       |
| parallelcluster_wckey_table             |
| parallelcluster_wckey_usage_day_table   |
| parallelcluster_wckey_usage_hour_table  |
| parallelcluster_wckey_usage_month_table |
| qos_table                               |
| res_table                               |
| table_defs_table                        |
| tres_table                              |
| txn_table                               |
| user_table                              |
+-----------------------------------------+
29 rows in set (0.00 sec)

mysql>

Conclusion

You’re finally all set! In this blog post you set up a database using Amazon RDS, configured AWS ParallelCluster and Slurm to enable job accounting with your database, and learned how to query your job accounting history from your database using the sacct command or by running SQL queries.

Deriving insights for your HPC workloads doesn’t end when your workloads finish running. Now, you can better understand and optimize your usage patterns and generate ideas about how to wring more price-performance out of your HPC clusters on AWS. For retrospective analysis, you can easily understand whether specific jobs, projects, or users are responsible for driving your HPC usage on AWS.  For forward-looking analysis, you can better forecast future usage to set budgets with appropriate insight into your costs and your resource consumption.

You can also use these accounting systems to identify users who may require additional training on how to make the most of cloud resources on AWS. Finally, together with your spending patterns, you can better capture and explain the return on investment from all of the valuable HPC work you do. And, this gives you the raw data to analyze how you can get even more value and price-performance out of the work you’re doing on AWS.

 

 

 

 

Running Simcenter STAR-CCM+ on AWS with AWS ParallelCluster, Elastic Fabric Adapter and Amazon FSx for Lustre

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

This post is contributed by Anh Tran, Senior HPC Specialist Solution Architect, AWS

Introduction

AWS recently introduced many HPC services that boost the performance and scalability of Computational Fluid Dynamics (CFD) workloads on AWS. These services include: Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and AWS ParallelCluster 2.5.1. In this technical post, I walk through these three services. Additionally, I outline an example of using AWS ParallelCluster to set up an HPC system with EFA and Amazon FSx Lustre to run a CFD workload.  The CFD application that you will set up during this blog post is Simcenter STAR-CCM+  – the predominant CFD application from Siemens.

Service and solution overview

Services

This blog primarily uses three services – Amazon FSx for Lustre, EFA, and AWS ParallelCluster. Let’s dig into each of these services before reviewing the solution.

Amazon FSx for Lustre

In December 2018, AWS released Amazon FSx for Lustre. This is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides low latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleashes innovation at an unparalleled pace.  This blog post uses the latest version of Amazon FSx for Lustre which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket. I go into more detail of how to take advantage of this at the end of the blog.

Elastic Fabric Adapter

In April of 2019, AWS released EFA. This enables you to run applications requiring high levels of inter-node communications at scale on AWS.

AWS ParallelCluster 2.5.1.

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use even for if you’re new to the cloud.  AWS recently released AWS ParallelCluster 2.5.1 – which is the version we will use for this blog.

These three AWS HPC components are optimal for CFD applications. Together, they provide simple deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel filesystem. Now, let’s take a look at how these services come together and seamlessly run a real CFD application: Simcenter STAR-CCM+.

Solution

AWS has a long-standing collaboration with Siemens. AWS and Siemens are dedicated to enhancing Siemens’ customer experiences when they run Simcenter STAR-CCM+ apps on AWS. I am excited to walk you through the steps and the best practices for running Simcenter STAR-CCM+, the predominant CFD application from Siemens.

The Simcenter STAR-CCM+ application runs on an HPC system. This system is optimized with EFA and Amazon FSx for Lustre — all of which is managed by AWS ParallelCluster. AWS ParallelCluster simplifies the deployment process to such an extent that you can set up your HPC cluster with a high throughput parallel file system (Amazon FSx for Lustre), a high-throughput and low-latency network interface (EFA), and high-bandwidth network interconnects (100 Gbps using C5n instances) in less than 15 minutes.

Now that you have the services and solution overviews, we can get started. This blog post includes the following steps:

  1. Creating an HPC infrastructure stack on AWS, which will include:
    • How to set up AWS ParallelCluster for best performance
    • How to enable EFA
    • How to enable the Amazon FSx Lustre file system and how to use some basic Amazon FSx for Lustre for STAR-CCM+
    • How to connect to a remote desktop session using NICE DCV
  2. Installing the Simcenter STAR-CCM+ application and how to submit a Simcenter STAR-CCM+ job to HPC cluster.

The following diagram outlines the steps

Steps

Setting up the HPC infrastructure stack

Before you can run Simcenter STAR-CCM+, you need to build an HPC cluster first.  Some best practices that you should consider when setting up a cluster on AWS include:

  • Turn off Hyper-Threading (HT): AWS instances have HT turned on by default.
  • Use EFA and a cluster placement group for your compute fleet to minimize the latency between nodes.
  • Select the right instance type for your compute fleet. Here, I use C5n.18xlarge because of its high-performance CPU, high-bandwidth bandwidth networking, and the EFA network interface capabilities.

With HPC best practices in mind, you can set up your AWS ParallelCluster

Set up AWS ParallelCluster:

$aws s3 mb s3://benchmark-starccm
make_bucket: benchmark-starccm

*Note: As an example, I create an S3 bucket benchmark-starccm, however you should create an S3 bucket with a different name of your choice because the S3 bucket name must be globally unique.

Let’s download the STAR-CCM+ installation file and a case file, then upload them to the S3 bucket that we just created.

  • Download latest Simcenter STAR-CCM+ package from the Siemens portal. It will look something like this: STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
  • Download the Le Mans case file. The Le Mans case is 104 Million cells, which even today is considered large for a CFD case.  The file will look something like this:  LeMans_104M.sim

After downloading the STAR-CCM+ software and LeMans case file, upload them to the S3 bucket created above

aws s3 cp STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip s3://benchmark-starccm
aws s3 cp LeMans_104M.sim s3://benchmark-starccm/

We will use this same S3 bucket to install the Simcenter STAR-CCM+ application later in this tutorial

With all the ground work done, we can now build our HPC cluster. For more detailed instructions, you can consult Getting Started with AWS ParallelCluster.

  • Install AWS ParallelCluster
pip install aws-parallelcluster
  • Configure AWS ParallelCluster with some basic network information such as AWS Region ID, VPC ID, Subnet ID

pcluster configure

Modify your ~/.parallelcluster/config file to include a cluster section that minimally includes the following:

[aws]
aws_region_name = us-east-2

[cluster default]
vpc_settings = public
key_name = <Key-Name>
initial_queue_size = 2
max_queue_size = 100
maintain_initial_size = true
placement_group = DYNAMIC
placement = cluster
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
cluster_type = ondemand

base_os = centos7
tags = {“Name” : “STARCCM”}

enable_efa = compute
fsx_settings = fsxshared
disable_hyperthreading = true

dcv_settings = hpc-dcv

[vpc public]
master_subnet_id = subnet-<Subnet-ID>
vpc_id = vpc-<VPC-ID>

[global]
update_check = true
sanity_check = true
cluster_template = default

[dcv hpc-dcv]

enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
imported_file_chunk_size = 1024
import_path = s3://benchmark-starccm

export_path = s3://benchmark-starccm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

  • Now, create your first HPC cluster with the name starccmby running

pcluster create starccm

Your HPC cluster should be ready in about 15 minutes.

While you’re waiting for your cluster to be ready, let’s take a deeper look at what some of the different parameters we used mean for our HPC cluster:

initial_queue_size: We will start with two compute instances after the HPC cluster is up.

max_queue_size: We will limit the maximum compute fleet to 100 instances. This allows us room to scale our jobs up to a large number of cores while putting a limit on the number of compute nodes to help control costs.

base_os: For this blog, we will select centos 7 as a base os. Currently we support Amazon Linux (alinux), Centos 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.

master_instance_type: This can be any instance type. Here we choose c5.xlarge because it is inexpensive and relatively fast for the head node.

compute_instance_type: We select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC. Note that EFA is currently only available on c5n.18xlarge, c5n.metal, i3en.24xlarge, p3dn.24xlarge, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, and p3dn.24xlarge. See the docs for Currently supported instances.

placement_group: We use placement_group to ensure our instances are located as physically close to one another as possible to minimize the latency between compute nodes and take advantage of EFA’s low latency networking.

enable_efa: with just one configuration line, we can easily turn on EFA support for our HPC cluster.

dcv_settings = hpc-dcv: With AWS ParallelCluster 2.5.1 you can use NICE DCV to support your remote visualization needs.

disable_hyperthreading: This setting turns off hyper-threading on the cluster

[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory will be mounted, the storage capacity for the filesystem, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.

[dcv hpc-dcv]: This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.

  •  After you set up your config file for AWS ParallelCluster, log in and verify that you can access the cluster’s head node
pcluster ssh starccm -i ~/path/to/ssh_key
  • Verify the compute nodes are up. We should see two c5n.18xlarge nodes.
$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
global - - - - - - - - - -
ip-172-31-14-220 lx-amd64 36 2 36 36 0.49 184.6G 11.7G 0.0 0.0
ip-172-31-2-137 lx-amd64 36 2 36 36 0.45 184.6G 11.7G 0.0 0.0
  • Verify the EFA driver has been loaded successfully.

In order to verify if EFA is installed correctly you will need to ssh into one of compute nodes and run :

[[email protected] ~]$ which mpirun
/opt/amazon/openmpi/bin/mpirun

[[email protected] ~]$ fi_info -p efa

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

At this point, EFA is verified.

Install Simcenter STAR-CCM+ application

Now that the HPC cluster using AWS ParallelCluster is set up, it’s time to install the Simcenter STAR-CCM+ application.  In the prior steps, you uploaded a Simcenter STAR-CCM+ application and a case file to S3 bucket and used that S3 bucket as a source for the Amazon FSx for Lustre /fsx storage. As soon as the cluster created, the installation file and the case file will be available in /fsx

[[email protected] fsx]$ cd /fsx
[[email protected] fsx]$ ls
LeMans_104M.sim  STAR-CCM-14.06.013_01_linux-x86_64.tar.gz  STAR-CCM+14.06.013_01_linux-x86_64.tar.gz

As you can see, Amazon FSx for Lustre has already downloaded the case file from the S3 bucket to the /fsx partition, so now you can start install Simcenter STAR-CCM+ software using the following steps.

  • Install Simcenter STAR-CCM+: install Simcenter STAR-CCM+ on /fsx  – a 1.2TB Lustre filesystem that you configured in a previous step.
cd /fsx
sudo unzip STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
cd STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1
./STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.sh
Select Installation Location : /fsx/Siemens

After following all the standard installation steps from Simcenter STAR-CCM+, the application should be installed at the following location:

/fsx/Siemens/15.02.003/STAR-CCM+15.02.003/

  • Test the installation
[[email protected] fsx]$ /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+ -version
Simcenter STAR-CCM+ 2020.1 Build 15.02.003 (linux-x86_64-2.12/gnu7.1)

When the above code shows up, you correctly installed Simcenter STAR-CCM+, so now you can run the application.

Running Simcenter STAR-CCM+ on AWS ParallelCluster

Before I move on, let’s recap what you’ve have done so far.

  1. You set up an HPC cluster using AWS ParallelCluster with compute-optimized C5n instances, an Amazon FSx for Lustre filesystem, and EFA-enabled networking.
  2. You installed Simcenter STAR-CCM+ application

Now, let us create an SGE submission script and submit a job to the HPC cluster.

  • Create an SGE job submission script :
cd /fsx/

vi star-ccm.qsub

#!/bin/bash
#$ -N check
#$ -cwd
#$ -j Y
#$ -pe mpi 252

date

your_pod_key="your license key"
ccmp=" /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+”
case="/fsx/LeMans_104M.sim"

fabric="OFI"
tag="fsx-${fabric}"

${ccmp} \
-power \
-podkey $your_pod_key \
-licpath [email protected] \
-mpi openmpi \
-bs sge \
-benchmark:"-preits 40 -nits 20 -nps ${NSLOTS} -tag ${tag}" \
${case}

date

mkdir ${NSLOTS}new
mv *xml ${NSLOTS}new
  • Submit your Simcenter STAR-CCM+ job to your HPC cluster
qsub -V star-ccm.qsub

Now you have submitted an HPC job that requests 252 cores of c5n.18xlarge.

  • Check the status of your jobs by running
qstat -f

Simcenter STAR-CCM+ result analysis

Here are sample scaling results for the LeMans 104M cell benchmark case. As you can see, the Simcenter STAR-CCM+ result shows exciting performance and scaling results running on AWS. The performance is fast and the simulation scales very well with EFA-based networking and great CPU performance.

Connect to a NICE DCV session

AWS ParallelCluster is now natively integrated with NICE DCV. You can configure a NICE DCV session to visualize your STAR-CCM+ result or connect to the application remotely.

As you will recall from when we configured our AWS ParallelCluster in the previous section, I named the DCV server as hpc-dcv.  To create and connect to a NICE DCV session, just run:

pcluster dcv connect hpc-dcv -k <Key-Name>

After you connect to NICE DCV session, you will be able to access to a Linux desktop to work with STAR-CCM+ application.

Backup SIMCENTER STAR-CCM+ result to S3 bucket

After you finish your STAR-CCM+ simulation, you can backup data in /fsx to your S3 bucket that you created from running your application. You can now use Data Repository Tasks. Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

*Note: in order to use new Amazon FSx for Lustre feature, you will need to have AWS CLI version 1.16.309 or above.

In this case, I select to export the  STAR-CCM+ application directory Siemens as an example.

  • Exit the HPC head node, and go back to your laptop or Cloud9 environment where you have configured your AWS CLI. Find out your Amazon FSx Lustre ID by running:
aws fsx describe-file-systems
  • After you find the Amazon FSx for Lustre ID, which looks simiar to fs-0d72d520f620d765a, create a backup of the data by running:
create-data-repository-task

aws fsx create-data-repository-task --file-system-id fs-0d72d520f620d765a --type EXPORT_TO_REPOSITORY --paths Siemens,testfsx --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://benchmark-starccm/

Explanation:

–file-system-id: your file system ID

–type EXPORT_TO_REPOSITORY: we will export the data back to the S3 bucket

–paths Siemens,testfsx: the directories you want to export to S3 bucket

Format=REPORT_CSV_20191124:  note this is only name the Amazon FSx Lustre supports. Please keep it the same.

  • Check the status of the backup by running describe-create-data-repository-task
aws fsx describe-data-repository-tasks

As you can see, I can now use FSx for Lustre to install the Simcenter STAR-CCM+ application, and seamlessly move case data between on-premise to Amazon S3 and AWS HPC system. I can create a file system linked to a S3 bucket, create a FSx for Lustre filesystem, and export data back to their S3 bucket after running the CFD app.

Conclusion

Give it a try, set up an HPC environment, and let us know how it goes! If you need more information about running CFD and HPC cases on AWS you can find it on our HPC home page. Please feel free to contact us with questions that you might have.

 

Running ANSYS Fluent on Amazon EC2 C5n with Elastic Fabric Adapter (EFA)

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-ansys-fluent-on-amazon-ec2-c5n-with-elastic-fabric-adapter-efa/

Written by: Nicola Venuti, HPC Specialist Solutions Architect 

In July 2019 I published: “Best Practices for Running Ansys Fluent Using AWS ParallelCluster.” The first post demonstrated how to launch ANSYS Fluent on AWS using AWS ParallelCluster. In this blog, I discuss a new AWS service: the Elastic Fabric Adapter (EFA).   I also walk you through an example EFA for tightly coupled workloads. Finally, I demonstrate how you can accelerate your tightly coupled (MPI) workloads with EFA, which lowers your cost per job.

 

EFA is a new network interface for Amazon EC2 instances designed to accelerate tightly coupled (MPI) workloads on AWS. AWS announced EFA at re:Invent in 2018. If you want to learn more about EFA, you can read Jeff Barr’s blog post and watch this technical video led by our Principal Engineer, Brian Barrett.

 

After reading our first blog post, readers asked for benchmark results and possibly the cost associated to the job. So, in addition to a step-by-step guide on how to run your first ANSYS Fluent job with EFA, this blog post also shows the results (in terms of rating and scaling curve) up to 5000 cores of a common ANSYS Fluent benchmark the Formula-1 Race Car (140M cells Mesh), and the costs per job comparison among the most suitable Amazon EC2 instance types.

 

Create your HPC Cluster

In this part of the blog, I will walk you through the following:  the setup of AWS ParallelCluster configuration file, the setup the post-install script, and the deployment of your HPC cluster.

 

Setup AWS ParellelCluster
I use AWS ParallelCluster in this example because it simplifies the deployment of  HPC clusters on AWS.  This AWS supported, open source tool manages and deploys HPC clusters in the cloud.  Additionally, AWS ParellelCluster is already integrated with EFA, which eliminates extra effort to run your preferred HPC applications.

 

The latest release (2.5.1) of AWS ParellelCluster simplifies cluster deployment in three main ways. First, the updates remove the need for custom AMIs. Second, important components (particularly Nice DCV) run on AWS ParallelCluster. Finally, the hyperthreading can be shutdown using a new parameter in the configuration file.

 

Note: If you need additional instructions on how to install AWS ParallelCluster and get started, read this blog post, and/or the AWS ParallelCluster documentation.

 

The first few steps of this blog post differ from the previous post’s because of these updates. This means that the AWS ParallelCluster configuration file is different. In particular, here are the additions:

  1. enable_efa = compute in the [cluster] section
  2. the new [dcv] section and the dcv_settings parameter in the [cluster] section
  3. the new parameter disable_hyperthreading in the [cluster] section

These additions to the configuration file enable automatic functionalities that previously needed to be enabled manually.

 

Next, in your preferred text editor paste the following code:

aws_region_name = <your-preferred-region>

[global]
sanity_check = true
cluster_template = fluentEFA
update_check = true

[vpc my-vpc-1]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<Subnet-ID>

[cluster fluentEFA]
key_name = <Key-Name>
vpc_settings = my-vpc-1
compute_instance_type=c5n.18xlarge
master_instance_type=c5n.2xlarge
initial_queue_size = 0
max_queue_size = 100
maintain_initial_size = true
scheduler=slurm
cluster_type = ondemand
s3_read_write_resource=arn:aws:s3:::<Your-S3-Bucket>*
post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh
placement_group = DYNAMIC
placement = compute
base_os = centos7
tags = {"Name" : "fluentEFA"}
disable_hyperthreading = true
fsx_settings = parallel-fs
enable_efa = compute
dcv_settings = my-dcv

[dcv my-dcv]
enable = master

[fsx parallel-fs]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://<Your-S3-Bucket>
imported_file_chunk_size = 1024
export_path = s3://<Your-S3-Bucket>/export

 

Now that the ParellelCluster is set up, you are ready for the second step: post-install script.

 

Edit the Post-Install Script

Below is an example of a post install script. Make sure it is saved in the S3 bucket defined with the parameter post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh in the configuration file above.

 

To upload the post-install script into your S3 bucket, run the following command:

aws s3 cp fluent-efa-post-install.sh s3://<Your-S3-Bucket>/fluent-efa-post-install.sh

#!/bin/bash

#this will disable the ssh host key checking
#usually not needed, but Fluent might require this setting.
cat <<\EOF >> /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
EOF

# set higher ulimits,
# usefull when running Fluent (and in general HPC applications) on multiple instances via mpi
cat <<\EOF >> /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard stack 1024000
* soft stack 1024000
* hard nofile 1024000
* soft nofile 1024000
EOF

#stop and disable the firewall
systemctl disable firewalld
systemctl stop firewalld

Now, you have in place all the s of AWS ParellelCluster, and you are ready to deploy your HPC cluster.

 

Deploy HPC Cluster

Run the following command to create your HPC cluster that is EFA enabled:

pcluster create -c fluent.config fluentEFA -t fluentEFA -r <your-preferred-region>

Note:  The “*” at the end of the s3_read_write_resource parameter line is needed in order to let AWS ParallelCluster accessing your S3 bucket correctly. So, for example, if your S3 bucket is called “ansys-download,” it would look like:

s3_read_write_resource=arn:aws:s3:::ansys-download*

 

You should have your HPC cluster up and running after following the three main steps in this section. Now you can install ANSYS Fluent.

Install ANSYS Fluent

The previous section of this post should take about 10 minutes to produce the following output:

Status: parallelcluster-fluentEFA - CREATE_COMPLETE
MasterPublicIP: 3.212.243.33
ClusterUser: centos
MasterPrivateIP: 10.6.1.153

Once you receive that successful output, you can move on to install ANSYS fluent. Enter the following commands to connect to the master node of your new cluster via SSH and/or DCV:

  1. via SSH: pcluster ssh fluentEFA -i ~/my_key.pem
  2. via DCV: pcluster dcv connect fluentEFA --key-path ~/my_key.pem

 

Once you are logged in, become root (sudo su - or sudo -i ), and install the ANSYS suite under the /fsx directory. You can install it manually, or you can use the sample script.

 

Note: I defined the import_path = s3://<Your-S3-Bucket> in the Amazon FSx section of the configuration file. This tells Amazon FSx to preload all the data from <Your-S3-Bucket>. I recommend copying the ANSYS installation files, and any other file or package you need, to S3 in advance. This step ensures that your files are available under the /fsx directory of your cluster.

 

The example below uses the ANSYS iso installation files. You can use either the tar or the iso file. You can download both from the ANSYS Customer Portal under Download → Current Release.

 

Run this sample script to install ANSYS:

#!/bin/bash

#check the installation directory
if [ ! -d "${1}" -o -z "${1}" ]; then
echo "Error: please check the install dir"
exit -1
fi

ansysDir="${1}/ansys_inc"
installDir="${1}/"

ansysDisk1="ANSYS2019R3_LINX64_Disk1.iso"
ansysDisk2="ANSYS2019R3_LINX64_Disk2.iso"

# mount the Disks
disk1="${installDir}/AnsysDisk1"
disk2="${installDir}/AnsysDisk2"
mkdir -p "${disk1}"
mkdir -p "${disk2}"

echo "Mounting ${ansysDisk1} ..."
mount -o loop "${installDir}/${ansysDisk1}" "${disk1}"

echo "Mounting ${ansysDisk2} ..."
mount -o loop "${installDir}/${ansysDisk2}" "${disk2}"

# INSTALL Ansys WB
echo "Installing Ansys ${ansysver}"
"${disk1}/INSTALL" -silent -install_dir "${ansysDir}" -media_dir2 "${disk2}"

echo "Ansys installed"

umount -l "${disk1}"
echo "${ansysDisk1} unmounted..."

umount -l "${disk2}"
echo "${ansysDisk2} unmounted..."

echo "Cleaning up temporary install directory"
rm -rf "${disk1}"
rm -rf "${disk2}"

echo "Installation process completed!"

 

Congrats, now you have successfully installed ANSYS Workbench!

 

Adapt the ANSYS Fluent mpi_wrapper

Now that your HPC cluster is running and that ANSYS Workbench is installed, you can patch ANSYS Fluent. ANSYS Fluent does not currently support EFA out-of-the-box, so, you need to make a few modifications to get your app running properly.

 

Complete the following steps to make the proper modifications:

 

Open mpirun.fl (an MPI wrapper script) with your preferred text editor:

vim /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl 

 

Comment this line 465:

# For better performance, suggested by Intel
FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv I_MPI_ADJUST_REDUCE 2 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_ADJUST_BCAST 1"

 

In addition to that, line 548:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/libmpi_mt.so"

should be modified as follows:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/release_mt/libmpi.so"

The library file location and name changed for Intel 2019 Update 5. Fixing this will remove the following error message:

ERROR: ld.so: object '/opt/intel/parallel_studio_xe_2019/compilers_and_libraries_2019/linux/mpi/intel64//lib/libmpi_mt.so' from LD_PR ELOAD cannot be preloaded: ignored

 

I recommend backing-up the MPI wrapper script before any modification:

cp /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl.ORIG

Once these steps are completed, your ANSYS Fluent installation is properly modified to support EFA.

 

Run your first ANSYS Fluent job using EFA

You are almost ready to run your first ANSYS Fluent job using EFA. You can use the same submission script used previously.  Export INTELMPI_ROOT or OPENMPI_ROOT in order to specify the custom MPI library to use.

 

The following script demonstrates this step:

#!/bin/bash

#SBATCH -J Fluent
#SBATCH -o Fluent."%j".out

module load intelmpi
export INTELMPI_ROOT=/opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/

export [email protected]<your-license-server>
export [email protected]<your-license-server>

basedir="/fsx"
workdir="${basedir}/$(date "+%d-%m-%Y-%H-%M")-${SLURM_NPROCS}-$RANDOM"
mkdir "${workdir}"
cd "${workdir}"
cp "${basedir}/f1_racecar_140m.tar.gz" .
tar xzvf f1_racecar_140m.tar.gz
rm -f f1_racecar_140m.tar.gz
cd bench/fluent/v6/f1_racecar_140m/cas_dat

srun -l /bin/hostname | sort -n | awk '{print $2}' > hostfile
${basedir}/ansys_inc/v194/fluent/bin/fluentbench.pl f1_racecar_140m -t${SLURM_NPROCS} -cnf=hostfile -part=1 -nosyslog -noloadchk -ssh -mpi=intel -cflush

Save this snippet as fluent-run-efa.sh under /fsx and run it as follows:

sbatch -n 2304 /fsx/fluent-run-efa.sh

 

Note1: The number, 2304 cores, is an example, this command will tell AWS ParallelCluster to spin-up 64 C5n.18xlarge. Feel free to change it and run it as you wish.

Note2: you may want to copy on S3 the benchmark file f1_racecar_140m.tar.gz or any other dataset you want to use, so that it’s preloaded on Amazon FSx and ready for you to use.

 

Performance and cost considerations

Now I will show benchmark results (in terms of rating and scaling efficiency) and cost per job (only EC2 instances costs will be considered). The following graph shows the scaling curve of EFA vs C5.18xlarge vs the ideal scalability.

 

The Formula-1 Race Car used for this benchmark is a 140-M cells mesh. The range of 70k-100k cells per core optimizes cost for performance. Improvement in turnaround time continues up to 40,000 cells per core with an acceptable cost for the performance. C5n.18xlarge + EFA shows ~89% scaling efficiency at 3024 cores. This metric is a great improvement compared to the C5.18xlarge scaling (48% at 3024 cores). In both cases, I ran with the hyperthreading disabled, up to 84 instances in total.

Scaling vs number of cores graph

ANSYS has published some results of this benchmark here. The plot below shows the “Rating” of a Cray XC50 and C5n.18xlarge + EFA. In ANSYS’ own words the rating is defined as: “ the primary metric used to report performance results of the Fluent Benchmarks. It is defined as the number of benchmarks that can be run on a given machine (in sequence) in a 24 hour period. It is computed by dividing the number of seconds in a day (86,400 seconds) by the number of seconds required to run the benchmark. A higher rating means faster performance.”

 

The plot below shows C5n.18xlarge + EFA with a higher rating than the XC50, up to ~2400 cores, and is on par with it up to ~3800 cores.

In addition to turnaround time improvements, EFA brings another primary advantage: cost reduction. At the moment, C5n.18xlarge costs 27% more compared to C5.18xlarge (EFA is available at no additional cost). This price difference is due to the higher, 4x network performance (100-Gbps vs 25-Gbps) and 33% higher memory footprint (192 vs 144 GB). The following chart shows the cost comparison between C5.18xlarge and C5n.18xlarge + EFA as I scale out for the ANSYS benchmark run.

 

cost per run vs number of cores

Please note that the chart above shows the cost per job using the On-Demand price (OD) in US-East-1 (N. Virginia), for short jobs (that last minutes or even hours) you may want to consider using the EC2 Spot price. Spot Instances offer spare EC2 instances at steep discounts. At the moment I am writing this blog post, the C5n.18xlarge Spot price in N. Virginia is 70% lower compared to the On-Demand price: a significant price reduction.

 

Conclusions

This blog post reviewed best practices for running ANSYS Fluent with EFA, and walked through the performance and cost benefits of EFA running a 140-M cell ANSYS Fluent benchmark. Computational Fluid Dynamics and tightly coupled workloads involve an iterative process of tuning, refining, testing, and benchmarking. Many variables can affect performance of these workloads, so the AWS HPC team is committed to document best practices for MPI workloads on AWS.

I would also love to hear from you. Please let us know about other applications you would like us to test and features you would like to request.

Decoupled Serverless Scheduler To Run HPC Applications At Scale on EC2

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/decoupled-serverless-scheduler-to-run-hpc-applications-at-scale-on-ec2/

This post is written by Ludvig Nordstrom and Mark Duffield | on November 27, 2019

In this blog post, we dive in to a cloud native approach for running HPC applications at scale on EC2 Spot Instances, using a decoupled serverless scheduler. This architecture is ideal for many workloads in the HPC and EDA industries, and can be used for any batch job workload.

At the end of this blog post, you will have two takeaways.

  1. A highly scalable environment that can run on hundreds of thousands of cores across EC2 Spot Instances.
  2. A fully serverless architecture for job orchestration.

We discuss deploying and running a pre-built serverless job scheduler that can run both Windows and Linux applications using any executable file format for your application. This environment provides high performance, scalability, cost efficiency, and fault tolerance. We introduce best practices and benefits to creating this environment, and cover the architecture, running jobs, and integration in to existing environments.

quick note about the term cloud native: we use the term loosely in this blog. Here, cloud native  means we use AWS Services (to include serverless and microservices) to build out our compute environment, instead of a traditional lift-and-shift method.

Let’s get started!

 

Solution overview

This blog goes over the deployment process, which leverages AWS CloudFormation. This allows you to use infrastructure as code to automatically build out your environment. There are two parts to the solution: the Serverless Scheduler and Resource Automation. Below are quick summaries of each part of the solutions.

Part 1 – The serverless scheduler

This first part of the blog builds out a serverless workflow to get jobs from SQS and run them across EC2 instances. The CloudFormation template being used for Part 1 is serverless-scheduler-app.template, and here is the Reference Architecture:

 

Serverless Scheduler Reference Architecture . Reference Architecture for Part 1. This architecture shows just the Serverless Schduler. Part 2 builds out the resource allocation architecture. Outlined Steps with detail from figure one

    Figure 1: Serverless Scheduler Reference Architecture (grayed-out area is covered in Part 2).

Read the GitHub Repo if you want to look at the Step Functions workflow contained in preceding images. The walkthrough explains how the serverless application retrieves and runs jobs on its worker, updates DynamoDB job monitoring table, and manages the worker for its lifetime.

 

Part 2 – Resource automation with serverless scheduler


This part of the solution relies on the serverless scheduler built in Part 1 to run jobs on EC2.  Part 2 simplifies submitting and monitoring jobs, and retrieving results for users. Jobs are spread across our cost-optimized Spot Instances. AWS Autoscaling automatically scales up the compute resources when jobs are submitted, then terminates them when jobs are finished. Both of these save you money.

The CloudFormation template used in Part 2 is resource-automation.template. Building on Figure 1, the additional resources launched with Part 2 are noted in the following image, they are an S3 Bucket, AWS Autoscaling Group, and two Lambda functions.

Resource Automation using Serverless Scheduler This is Part 2 of the deployment process, and leverages the Part 1 architecture. This provides the resource allocation, that allows for automated job submission and EC2 Auto Scaling. Detailed steps for the prior image

 

Figure 2: Resource Automation using Serverless Scheduler

                               

Introduction to decoupled serverless scheduling

HPC schedulers traditionally run in a classic master and worker node configuration. A scheduler on the master node orchestrates jobs on worker nodes. This design has been successful for decades, however many powerful schedulers are evolving to meet the demands of HPC workloads. This scheduler design evolved from a necessity to run orchestration logic on one machine, but there are now options to decouple this logic.

What are the possible benefits that decoupling this logic could bring? First, we avoid a number of shortfalls in the environment such as the need for all worker nodes to communicate with a single master node. This single source of communication limits scalability and creates a single point of failure. When we split the scheduler into decoupled components both these issues disappear.

Second, in an effort to work around these pain points, traditional schedulers had to create extremely complex logic to manage all workers concurrently in a single application. This stifled the ability to customize and improve the code – restricting changes to be made by the software provider’s engineering teams.

Serverless services, such as AWS Step Functions and AWS Lambda fix these major issues. They allow you to decouple the scheduling logic to have a one-to-one mapping with each worker, and instead share an Amazon Simple Queue Service (SQS) job queue. We define our scheduling workflow in AWS Step Functions. Then the workflow scales out to potentially thousands of “state machines.” These state machines act as wrappers around each worker node and manage each worker node individually.  Our code is less complex because we only consider one worker and its job.

We illustrate the differences between a traditional shared scheduler and decoupled serverless scheduler in Figures 3 and 4.

 

Traditional Scheduler Model This shows a traditional sceduler where there is one central schduling host, and then multiple workers.

Figure 3: Traditional Scheduler Model

 

Decoupled Serverless Scheduler on each instance This shows what a Decoupled Serverless Scheduler design looks like, wit

Figure 4: Decoupled Serverless Scheduler on each instance

 

Each decoupled serverless scheduler will:

  • Retrieve and pass jobs to its worker
  • Monitor its workers health and take action if needed
  • Confirm job success by checking output logs and retry jobs if needed
  • Terminate the worker when job queue is empty just before also terminating itself

With this new scheduler model, there are many benefits. Decoupling schedulers into smaller schedulers increases fault tolerance because any issue only affects one worker. Additionally, each scheduler consists of independent AWS Lambda functions, which maintains the state on separate hardware and builds retry logic into the service.  Scalability also increases, because jobs are not dependent on a master node, which enables the geographic distribution of jobs. This geographic distribution allows you to optimize use of low-cost Spot Instances. Also, when decoupling the scheduler, workflow complexity decreases and you can customize scheduler logic. You can leverage lower latency job monitoring and customize automated responses to job events as they happen.

 

Benefits

  • Fully managed –  With Part 2, Resource Automation deployed, resources for a job are managed. When a job is submitted, resources launch and run the job. When the job is done, worker nodes automatically shut down. This prevents you from incurring continuous costs.

 

  • Performance – Your application runs on EC2, which means you can choose any of the high performance instance types. Input files are automatically copied from Amazon S3 into local Amazon EC2 Instance Store for high performance storage during execution. Result files are automatically moved to S3 after each job finishes.

 

  • Scalability – A worker node combined with a scheduler state machine become a stateless entity. You can spin up as many of these entities as you want, and point them to an SQS queue. You can even distribute worker and state machine pairs across multiple AWS regions. These two components paired with fully managed services optimize your architecture for scalability to meet your desired number of workers.

 

  • Fault Tolerance –The solution is completely decoupled, which means each worker has its own state machine that handles scheduling for that worker. Likewise, each state machine is decoupled into Lambda functions that make up your state machine. Additionally, the scheduler workflow includes a Lambda function that confirms each successful job or resubmits jobs.

 

  • Cost Efficiency – This fault tolerant environment is perfect for EC2 Spot Instances. This means you can save up to 90% on your workloads compared to On-Demand Instance pricing. The scheduler workflow ensures little to no idle time of workers by closely monitoring and sending new jobs as jobs finish. Because the scheduler is serverless, you only incur costs for the resources required to launch and run jobs. Once the job is complete, all are terminated automatically.

 

  • Agility – You can use AWS fully managed Developer Tools to quickly release changes and customize workflows. The reduced complexity of a decoupled scheduling workflow means that you don’t have to spend time managing a scheduling environment, and can instead focus on your applications.

 

 

Part 1 – serverless scheduler as a standalone solution

 

If you use the serverless scheduler as a standalone solution, you can build clusters and leverage shared storage such as FSx for Lustre, EFS, or S3. Additionally, you can use AWS CloudFormation or to deploy more complex compute architectures that suit your application. So, the EC2 Instances that run the serverless scheduler can be launched in any number of ways. The scheduler only requires the instance id and the SQS job queue name.

 

Submitting Jobs Directly to serverless scheduler

The severless scheduler app is a fully built AWS Step Function workflow to pull jobs from an SQS queue and run them on an EC2 Instance. The jobs submitted to SQS consist of an AWS Systems Manager Run Command, and work with any SSM Document and command that you chose for your jobs. Examples of SSM Run Commands are ShellScript and PowerShell.  Feel free to read more about Running Commands Using Systems Manager Run Command.

The following code shows the format of a job submitted to SQS in JSON.

  {

    "job_id": "jobId_0",

    "retry": "3",

    "job_success_string": " ",

    "ssm_document": "AWS-RunPowerShellScript",

    "commands":

        [

            "cd C:\\ProgramData\\Amazon\\SSM; mkdir Result",

            "Copy-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 -LocalFolder .\\",

            "C:\\ProgramData\\Amazon\\SSM\\jobId_0.bat",

            "Write-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 –Folder .\\Result\\"

        ],

  }

 

Any EC2 Instance associated with a serverless scheduler it receives jobs picked up from a designated SQS queue until the queue is empty. Then, the EC2 resource automatically terminates. If the job fails, it retries until it reaches the specified number of times in the job definition. You can include a specific string value so that the scheduler searches for job execution outputs and confirms the successful completions of jobs.

 

Tagging EC2 workers to get a serverless scheduler state machine

In Part 1 of the deployment, you must manage your EC2 Instance launch and termination. When launching an EC2 Instance, tag it with a specific tag key that triggers a state machine to manage that instance. The tag value is the name of the SQS queue that you want your state machine to poll jobs from.

In the following example, “my-scheduler-cloudformation-stack-name” is the tag key that serverless scheduler app will for with any new EC2 instance that starts. Next, “my-sqs-job-queue-name” is the default job queue created with the scheduler. But, you can change this to any queue name you want to retrieve jobs from when an instance is launched.

{"my-scheduler-cloudformation-stack-name":"my-sqs-job-queue-name"}

 

Monitor jobs in DynamoDB

You can monitor job status in the following DynamoDB. In the table you can find job_id, commands sent to Amazon EC2, job status, job output logs from Amazon EC2, and retries among other things.

Alternatively, you can query DynamoDB for a given job_id via the AWS Command Line Interface:

aws dynamodb get-item --table-name job-monitoring \

                      --key '{"job_id": {"S": "/my-jobs/my-job-id.bat"}}'

 

Using the “job_success_string” parameter

For the prior DynamoDB table, we submitted two identical jobs using an example script that you can also use. The command sent to the instance is “echo Hello World.” The output from this job should be “Hello World.” We also specified three allowed job retries.  In the following image, there are two jobs in SQS queue before they ran.  Look closely at the different “job_success_strings” for each and the identical command sent to both:

DynamoDB CLI info This shows an example DynamoDB CLI output with job information.

From the image we see that Job2 was successful and Job1 retried three times before permanently labelled as failed. We forced this outcome to demonstrate how the job success string works by submitting Job1 with “job_success_string” as “Hello EVERYONE”, as that will not be in the job output “Hello World.” In “Job2” we set “job_success_string” as “Hello” because we knew this string will be in the output log.

Job outputs commonly have text that only appears if job succeeded. You can also add this text yourself in your executable file. With “job_success_string,” you can confirm a job’s successful output, and use it to identify a certain value that you are looking for across jobs.

 

Part 2 – Resource Automation with the serverless scheduler

The additional services we deploy in Part 2 integrate with existing architectures to launch resources for your serverless scheduler. These services allow you to submit jobs simply by uploading input files and executable files to an S3 bucket.

Likewise, these additional resources can use any executable file format you want, including proprietary application level scripts. The solution automates everything else. This includes creating and submitting jobs to SQS job queue, spinning up compute resources when new jobs come in, and taking them back down when there are no jobs to run. When jobs are done, result files are copied to S3 for the user to retrieve. Similar to Part 1, you can still view the DynamoDB table for job status.

This architecture makes it easy to scale out to different teams and departments, and you can submit potentially hundreds of thousands of jobs while you remain in control of resources and cost.

 

Deeper Look at the S3 Architecture

The following diagram shows how you can submit jobs, monitor progress, and retrieve results. To submit jobs, upload all the needed input files and an executable script to S3. The suffix of the executable file (uploaded last) triggers an S3 event to start the process, and this suffix is configurable.

The S3 key of the executable file acts as the job id, and is kept as a reference to that job in DynamoDB. The Lambda (#2 in diagram below) uses the S3 key of the executable to create three SSM Run Commands.

  1. Synchronize all files in the same S3 folder to a working directory on the EC2 Instance.
  2. Run the executable file on EC2 Instances within a specified working directory.
  3. Synchronize the EC2 Instances working directory back to the S3 bucket where newly generated result files are included.

This Lambda (#2) then places the job on the SQS queue using the schedulers JSON formatted job definition seen above.

IMPORTANT: Each set of job files should be given a unique job folder in S3 or more files than needed might be moved to the EC2 Instance.

 

Figure 5: Resource Automation using Serverless Scheduler - A deeper look A deeper dive in to Part 2, resource allcoation.

Figure 5: Resource Automation using Serverless Scheduler – A deeper look

 

EC2 and Step Functions workflow use the Lambda function (#3 in prior diagram) and the Auto Scaling group to scale out based on the number of jobs in the queue to a maximum number of workers (plus state machine), as defined in the Auto Scaling Group. When the job queue is empty, the number of running instances scale down to 0 as they finish their remaining jobs.

 

Process Submitting Jobs and Retrieving Results

  1. Seen in1, upload input file(s) and an executable file into a unique job folder in S3 (such as /year/month/day/jobid/~job-files). Upload the executable file last because it automatically starts the job. You can also use a script to upload multiple files at a time but each job will need a unique directory. There are many ways to make S3 buckets available to users including AWS Storage Gateway, AWS Transfer for SFTP, AWS DataSync, the AWS Console or any one of the AWS SDKs leveraging S3 API calls.
  2. You can monitor job status by accessing the DynamoDB table directly via the AWS Management Console or use the AWS CLI to call DynamoDB via an API call.
  3. Seen in step 5, you can retrieve result files for jobs from the same S3 directory where you left the input files. The DynamoDB table confirms when jobs are done. The SQS output queue can be used by applications that must automatically poll and retrieve results.

You no longer need to create or access compute nodes as compute resources. These automatically scale up from zero when jobs come in, and then back down to zero when jobs are finished.

 

Deployment

Read the GitHub Repo for deployment instructions. Below are CloudFormation templates to help:

AWS RegionLaunch Stack
eu-north-1link to zone
ap-south-1
eu-west-3
eu-west-2
eu-west-1
ap-northeast-3
ap-northeast-2
ap-northeast-1
sa-east-1
ca-central-1
ap-southeast-1
ap-southeast-2
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2

 

 

Additional Points on Usage Patterns

 

  • While the two solutions in this blog are aimed at HPC applications, they can be used to run any batch jobs. Many customers that run large data processing batch jobs in their data lakes could use the serverless scheduler.

 

  • You can build pipelines of different applications when the output of one job triggers another to do something else – an example being pre-processing, meshing, simulation, post-processing. You simply deploy the Resource Automation template several times, and tailor it so that the output bucket for one step is the input bucket for the next step.

 

  • You might look to use the “job_success_string” parameter for iteration/verification used in cases where a shot-gun approach is needed to run thousands of jobs, and only one has a chance of producing the right result. In this case the “job_success_string” would identify the successful job from potentially hundreds of thousands pushed to SQS job queue.

 

Scale-out across teams and departments

Because all services used are serverless, you can deploy as many run environments as needed without increasing overall costs. Serverless workloads only accumulate cost when the services are used. So, you could deploy ten job environments and run one job in each, and your costs would be the same if you had one job environment running ten jobs.

 

All you need is an S3 bucket to upload jobs to and an associated AMI that has the right applications and license configuration. Because a job configuration is passed to the scheduler at each job start, you can add new teams by creating an S3 bucket and pointing S3 events to a default Lambda function that pulls configurations for each job start.

 

Setup CI/CD pipeline to start continuous improvement of scheduler

If you are advanced, we encourage you to clone the git repo and customize this solution. The serverless scheduler is less complex than other schedulers, because you only think about one worker and the process of one job’s run.

Ways you could tailor this solution:

  • Add intelligent job scheduling using AWS Sagemaker  – It is hard to find data as ready for ML as log data because every job you run has different run times and resource consumption. So, you could tailor this solution to predict the best instance to use with ML when workloads are submitted.
  • Add Custom Licensing Checkout Logic – Simply add one Lambda function to your Step Functions workflow to make an API call a license server before continuing with one or more jobs. You can start a new worker when you have a license checked out or if a license is not available then the instance can terminate to remove any costs waiting for licenses.
  • Add Custom Metrics to DynamoDB – You can easily add metrics to DynamoDB because the solution already has baseline logging and monitoring capabilities.
  • Run on other AWS Services – There is a Lambda function in the Step Functions workflow called “Start_Job”. You can tailor this Lambda to run your jobs on AWS Sagemaker, AWS EMR, AWS EKS or AWS ECS instead of EC2.

 

Conclusion

 

Although HPC workloads and EDA flows may still be dependent on current scheduling technologies, we illustrated the possibilities of decoupling your workloads from your existing shared scheduling environments. This post went deep into decoupled serverless scheduling, and we understand that it is difficult to unwind decades of dependencies. However, leveraging numerous AWS Services encourages you to think completely differently about running workloads.

But more importantly, it encourages you to Think Big. With this solution you can get up and running quickly, fail fast, and iterate. You can do this while scaling to your required number of resources, when you want them, and only pay for what you use.

Serverless computing  catalyzes change across all industries, but that change is not obvious in the HPC and EDA industries. This solution is an opportunity for customers to take advantage of the nearly limitless capacity that AWS.

Please reach out with questions about HPC and EDA on AWS. You now have the architecture and the instructions to build your Serverless Decoupled Scheduling environment.  Go build!


About the Authors and Contributors

Authors 

 

Ludvig Nordstrom is a Senior Solutions Architect at AWS

 

 

 

 

Mark Duffield is a Tech Lead in Semiconductors at AWS

 

 

 

Contributors

 

Steve Engledow is a Senior Solutions Builder at AWS

 

 

 

 

Arun Thomas is a Senior Solutions Builder at AWS

 

 

Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/building-a-tightly-coupled-molecular-dynamics-workflow-with-multi-node-parallel-jobs-in-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services and Aswin Damodar, Senior Software Development Engineer, AWS Batch

At Supercomputing 2018 in Dallas, TX, AWS announced AWS Batch support for running tightly coupled workloads in a multi-node parallel jobs environment. This AWS Batch feature enables applications that require strong scaling for efficient computational workloads.

Some of the more popular workloads that can take advantage of this feature enhancement include computational fluid dynamics (CFD) codes such as OpenFoam, Fluent, and ANSYS. Other workloads include molecular dynamics (MD) applications such as AMBER, GROMACS, NAMD.

Running tightly coupled, distributed, deep learning frameworks is also now possible on AWS Batch. Applications that can take advantage include TensorFlow, MXNet, Pytorch, and Chainer. Essentially, any application scaling that benefits from tightly coupled–based scalability can now be integrated into AWS Batch.

In this post, we show you how to build a workflow executing an MD simulation using GROMACS running on GPUs, using the p3 instance family.

AWS Batch overview

AWS Batch is a service providing managed planning, scheduling, and execution of containerized workloads on AWS. Purpose-built for scalable compute workloads, AWS Batch is ideal for high throughput, distributed computing jobs such as video and image encoding, loosely coupled numerical calculations, and multistep computational workflows.

If you are new to AWS Batch, consider gaining familiarity with the service by following the tutorial in the Creating a Simple “Fetch & Run” AWS Batch Job post.

Prerequisites

You need an AWS account to go through this walkthrough. Other prerequisites include:

  • Launch an ECS instance, p3.2xlarge with a NVIDIA Tesla V100 backend. Use the Amazon Linux 2 AMIs for ECS.
  • In the ECS instance, install the latest CUDA 10 stack, which provides the toolchain and compilation libraries as well as the NVIDIA driver.
  • Install nvidia-docker2.
  • In your /etc/docker/daemon.json file, ensure that the default-runtime value is set to nvidia.
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  • Finally, save the EC2 instance as an AMI in your account. Copy the AMI ID, as you need it later in the post.

Deploying the workload

In a production environment, it’s important to efficiently execute the compute workload with multi-node parallel jobs. Most of the optimization is on the application layer and how efficiently the Message Passing Interface (MPI) ranks (MPI and OpenMP threads) are distributed across nodes. Application-level optimization is out of scope for this post, but should be considered when running in production.

One of the key requirements for running on AWS Batch is a Dockerized image with the application, libraries, scripts, and code. For multi-node parallel jobs, you need an MPI stack for the tightly coupled communication layer and a wrapper script for the MPI orchestration. The running child Docker containers need to pass container IP address information to the master node to fill out the MPI host file.

The undifferentiated heavy lifting that AWS Batch provides is the Docker-to-Docker communication across nodes using Amazon ECS task networking. With multi-node parallel jobs, the ECS container receives environmental variables from the backend, which can be used to establish which running container is the master and which is the child.

  • AWS_BATCH_JOB_MAIN_NODE_INDEX—The designation of the master node in a multi-node parallel job. This is the main node in which the MPI job is launched.
  • AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS—The IPv4 address of the main node. This is presented in the environment for all children nodes.
  • AWS_BATCH_JOB_NODE_INDEX—The designation of the node index.
  • AWS_BATCH_JOB_NUM_NODES – The number of nodes launched as part of the node group for your multi-node parallel job.

If AWS_BATCH_JOB_MAIN_NODE_INDEX = AWS_BATCH_JOB_NODE_INDEX, then this is the main node. The following code block is an example MPI synchronization script that you can include as part of the CMD structure of the Docker container. Save the following code as mpi-run.sh.

#!/bin/bash

cd $JOB_DIR

PATH="$PATH:/opt/openmpi/bin/"
BASENAME="${0##*/}"
log () {
  echo "${BASENAME} - ${1}"
}
HOST_FILE_PATH="/tmp/hostfile"
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"

aws s3 cp $S3_INPUT $SCRATCH_DIR
tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR

sleep 2

usage () {
  if [ "${#@}" -ne 0 ]; then
    log "* ${*}"
    log
  fi
  cat <<ENDUSAGE
Usage:
export AWS_BATCH_JOB_NODE_INDEX=0
export AWS_BATCH_JOB_NUM_NODES=10
export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
export AWS_BATCH_JOB_ID=string
./mpi-run.sh
ENDUSAGE

  error_exit
}

# Standard function to print an error and exit with a failing return code
error_exit () {
  log "${BASENAME} - ${1}" >&2
  log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
}

# Set child by default switch to main if on main node container
NODE_TYPE="child"
if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == 
"${AWS_BATCH_JOB_NODE_INDEX}" ]; then
  log "Running synchronize as the main node"
  NODE_TYPE="main"
fi


# wait for all nodes to report
wait_for_nodes () {
  log "Running as master node"

  touch $HOST_FILE_PATH
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
  
  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "master details -> $ip:$availablecores"
  echo "$ip slots=$availablecores" >> $HOST_FILE_PATH

  lines=$(uniq $HOST_FILE_PATH|wc -l)
  while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
  do
    log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
    sleep 1
    lines=$(uniq $HOST_FILE_PATH|wc -l)
  done
  # Make the temporary file executable and run it with any given arguments
  log "All nodes successfully joined"
  
  # remove duplicates if there are any.
  awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-
deduped
  cat $HOST_FILE_PATH-deduped
  log "executing main MPIRUN workflow"

  cd $SCRATCH_DIR
  . /opt/gromacs/bin/GMXRC
  /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 \
                          -x PATH -x LD_LIBRARY_PATH -x 
GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
                          --allow-run-as-root --machinefile 
${HOST_FILE_PATH}-deduped \
                          $GMX_COMMAND
  sleep 2

  tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$SCRATCH_DIR/*
  aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$S3_OUTPUT

  log "done! goodbye, writing exit code to 
$AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo "0" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
  exit 0
}


# Fetch and run a script
report_to_master () {
  # get own ip and num cpus
  #
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "I am a child node -> $ip:$availablecores, reporting to the master node -> 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
  until echo "$ip slots=$availablecores" | ssh 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> 
/$HOST_FILE_PATH"
  do
    echo "Sleeping 5 seconds and trying again"
  done
  log "done! goodbye"
  exit 0
  }


# Main - dispatch user request to appropriate function
log $NODE_TYPE
case $NODE_TYPE in
  main)
    wait_for_nodes "${@}"
    ;;

  child)
    report_to_master "${@}"
    ;;

  *)
    log $NODE_TYPE
    usage "Could not determine node type. Expected (main/child)"
    ;;
esac

The synchronization script supports downloading the assets from Amazon S3 as well as preparing the MPI host file based on GPU scheduling for GROMACS.

Furthermore, the mpirun stanza is captured in this script. This script can be a template for several multi-node parallel job applications by just changing a few lines.  These lines are essentially the GROMACS-specific steps:

. /opt/gromacs/bin/GMXRC
export OMP_NUM_THREADS=$OMP_THREADS
/opt/openmpi/bin/mpirun -np $MPI_THREADS --mca btl_tcp_if_include eth0 \
-x OMP_NUM_THREADS -x PATH -x LD_LIBRARY_PATH -x GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
--allow-run-as-root --machinefile ${HOST_FILE_PATH}-deduped \
$GMX_COMMAND

In your development environment for building Docker images, create a Dockerfile that prepares the software stack for running GROMACS. The key elements of the Dockerfile are:

  1. Set up a passwordless-ssh keygen.
  2. Download, and compile OpenMPI. In this Dockerfile, you are downloading the recently released OpenMPI 4.0.0 source and compiling on a NVIDIA Tesla V100 GPU-backed instance (p3.2xlarge).
  3. Download and compile GROMACS.
  4. Set up supervisor to run SSH at Docker container startup as well as processing the mpi-run.sh script as the CMD.

Save the following script as a Dockerfile:

FROM nvidia/cuda:latest

ENV USER root

# -------------------------------------------------------------------------------------
# install needed software -
# openssh
# mpi
# awscli
# supervisor
# -------------------------------------------------------------------------------------

RUN apt update
RUN DEBIAN_FRONTEND=noninteractive apt install -y iproute2 cmake openssh-server openssh-client python python-pip build-essential gfortran wget curl
RUN pip install supervisor awscli

RUN mkdir -p /var/run/sshd
ENV DEBIAN_FRONTEND noninteractive

ENV NOTVISIBLE "in users profile"

#####################################################
## SSH SETUP

RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed '[email protected]\s*required\s*[email protected] optional [email protected]' -i /etc/pam.d/sshd
RUN echo "export VISIBLE=now" >> /etc/profile

RUN echo "${USER} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
ENV SSHDIR /root/.ssh
RUN mkdir -p ${SSHDIR}
RUN touch ${SSHDIR}/sshd_config
RUN ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ''
RUN cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys
RUN cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa
RUN echo " IdentityFile ${SSHDIR}/id_rsa" >> /etc/ssh/ssh_config
RUN echo "Host *" >> /etc/ssh/ssh_config && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config
RUN chmod -R 600 ${SSHDIR}/* && \
chown -R ${USER}:${USER} ${SSHDIR}/
# check if ssh agent is running or not, if not, run
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

##################################################
## S3 OPTIMIZATION

RUN aws configure set default.s3.max_concurrent_requests 30
RUN aws configure set default.s3.max_queue_size 10000
RUN aws configure set default.s3.multipart_threshold 64MB
RUN aws configure set default.s3.multipart_chunksize 16MB
RUN aws configure set default.s3.max_bandwidth 4096MB/s
RUN aws configure set default.s3.addressing_style path

##################################################
## CUDA MPI

RUN wget -O /tmp/openmpi.tar.gz https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz && \
tar -xvf /tmp/openmpi.tar.gz -C /tmp
RUN cd /tmp/openmpi* && ./configure --prefix=/opt/openmpi --with-cuda --enable-mpirun-prefix-by-default && \
make -j $(nproc) && make install
RUN echo "export PATH=$PATH:/opt/openmpi/bin" >> /etc/profile
RUN echo "export LD_LIBRARY_PATH=$LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64" >> /etc/profile

###################################################
## GROMACS 2018 INSTALL

ENV PATH $PATH:/opt/openmpi/bin
ENV LD_LIBRARY_PATH $LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64
RUN wget -O /tmp/gromacs.tar.gz http://ftp.gromacs.org/pub/gromacs/gromacs-2018.4.tar.gz && \
tar -xvf /tmp/gromacs.tar.gz -C /tmp
RUN cd /tmp/gromacs* && mkdir build
RUN cd /tmp/gromacs*/build && \
cmake .. -DGMX_MPI=on -DGMX_THREAD_MPI=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/opt/gromacs && \
make -j $(nproc) && make install
RUN echo "source /opt/gromacs/bin/GMXRC" >> /etc/profile

###################################################
## supervisor container startup

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf
ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN chmod 755 supervised-scripts/mpi-run.sh

EXPOSE 22
RUN export PATH="$PATH:/opt/openmpi/bin"
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh
RUN chmod 0755 batch-runtime-scripts/entry-point.sh

CMD /batch-runtime-scripts/entry-point.sh

After the container is built, push the image to your Amazon ECR repository and note the container image URI for later steps.

Set up GROMACS

For the input files, use the Chalcone Synthase (1CGZ) example, from RCSB.org. For this post, just run a simple simulation following the Lysozyme in Water GROMACS tutorial.

Execute the production MD run before the analysis (that is, after the system is solvated, neutralized, and equilibrated), so you can show that the longest part of the simulation can be achieved in a containizered workflow.

It is possible from the tutorial to run the entire workflow from PDB preparation to solvation, and energy minimization and analysis in AWS Batch.

Set up the compute environment

For the purpose of running the MD simulation in this test case, use two p3.2xlarge instances. Each instance provides one NVIDIA Tesla V100 GPU for which GROMACS distributes the job. You don’t have to launch specific instance types. With the p3 family, the MPI-wrapper can concomitantly modify the MPI ranks to accommodate the current GPU and node topology.

When the job is executed, instantiate two MPI processes with two OpenMP threads per MPI process. For this post, launch EC2 OnDemand, using the Amazon Linux AMIs we can take advantage of per-second billing.

Under Create Compute Environment, choose a managed compute environment and provide a name, such as gromacs-gpu-ce. Attach two roles:

  • AWSBatchServiceRole—Allows AWS Batch to make EC2 calls on your behalf.
  • ecsInstanceRole—Allows the underlying instance to make AWS API calls.

In the next panel, specify the following field values:

  • Provisioning model: EC2
  • Allowed instance types: p3 family
  • Minimum vCPUs: 0
  • Desired vCPUs: 0
  • Maximum vCPUs: 128

For Enable user-specified Ami ID and enter the AMI that you created earlier.

Finally, for the compute environment, specify the network VPC and subnets for launching the instances, as well as a security group. We recommend specifying a placement group for tightly coupled workloads for better performance. You can also create EC2 tags for the launch instances. We used name=gromacs-gpu-processor.

Next, choose Job Queues and create a gromacs-queue queue coupled with the compute environment created earlier. Set the priority to 1 and select Enable job queue.

Set up the job definition

In the job definition setup, you create a two-node group, where each node pulls the gromacs_mpi image. Because you are using the p3.2xlarge instance providing one V100 GPU per instance, your vCPU slots = 8 for scheduling purposes.

{
    "jobDefinitionName": "gromacs-jobdef",
    "jobDefinitionArn": "arn:aws:batch:us-east-2:<accountid>:job-definition/gromacs-jobdef:1",
    "revision": 6,
    "status": "ACTIVE",
    "type": "multinode",
    "parameters": {},
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:1",
                "container": {
                    "image": "<accountid>.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
                    "vcpus": 8,
                    "memory": 24000,
                    "command": [],
                    "jobRoleArn": "arn:aws:iam::<accountid>:role/ecsTaskExecutionRole",
                    "volumes": [
                        {
                            "host": {
                                "sourcePath": "/scratch"
                            },
                            "name": "scratch"
                        },
                        {
                            "host": {
                                "sourcePath": "/efs"
                            },
                            "name": "efs"
                        }
                    ],
                    "environment": [
                        {
                            "name": "SCRATCH_DIR",
                            "value": "/scratch"
                        },
                        {
                            "name": "JOB_DIR",
                            "value": "/efs"
                        },
                        {
                            "name": "GMX_COMMAND",
                            "value": "gmx_mpi mdrun -deffnm md_0_1 -nb gpu -ntomp 1"
                        },
                        {
                            "name": "OMP_THREADS",
                            "value": "2"
                        },
                        {
                            “name”: “MPI_THREADS”,
                            “value”: “1”
                        },
                        {
                            "name": "S3_INPUT",
                            "value": "s3://ragab-md/1CGZ.tar.gz"
                        },
                        {
                            "name": "S3_OUTPUT",
                            "value": "s3://ragab-md"
                        }
                    ],
                    "mountPoints": [
                        {
                            "containerPath": "/scratch",
                            "sourceVolume": "scratch"
                        },
                        {
                            "containerPath": "/efs",
                            "sourceVolume": "efs"
                        }
                    ],
                    "ulimits": [],
                    "instanceType": "p3.2xlarge"
                }
            }
        ]
    }
}

Submit the GROMACS job

In the AWS Batch job submission portal, provide a job name and select the job definition created earlier as well as the job queue. Ensure that the vCPU value is set to 8 and the Memory (MiB) value is 24000.

Under Environmental Variables, within in each node group, ensure that the keys are set correctly as follows.

KeyValue
SCRATCH_DIR/scratch
JOB_DIR/efs
OMP_THREADS2
GMX_COMMANDgmx_mpi mdrun -deffnm md_0_1 -nb gpu
MPI_THREADS2
S3_INPUTs3://<your input>
S3_OUTPUTs3://<your output>

Submit the job and wait for it to enter into the RUNNING state. After the job is in the RUNNING state, select the job ID and choose Nodes.

The containers listed each write to a separate Amazon CloudWatch log stream where you can monitor the progress.

After the job is completed the entire working directory is compressed and uploaded to S3, the trajectories (*.xtc) and input .gro files can be viewed in your favorite MD analysis package. For more information about preparing a desktop, see Deploying a 4x4K, GPU-backed Linux desktop instance on AWS.

You can view the trajectories in PyMOL as well as running any subsequent trajectory analysis.

Extending the solution

As we mentioned earlier, you can take this core workload and extend it as part of a job execution chain in a workflow. Native support for job dependencies exists in AWS Batch and alternatively in AWS Step Functions. With Step Functions, you can create a decision-based workflow tree to run the preparation, solvation, energy minimization, equilibration, production MD, and analysis.

Conclusion

In this post, we showed that tightly coupled, scalable MD simulations can be executed using the recently released multi-node parallel jobs feature for AWS Batch. You finally have an end-to-end solution for distributed and MPI-based workloads.

As we mentioned earlier, many other applications can also take advantage of this feature set. We invite you to try this out and let us know how it goes.

Want to discuss how your tightly coupled workloads can benefit on AWS Batch? Contact AWS.

Deploying a Burstable and Event-driven HPC Cluster on AWS Using SLURM, Part 1

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploying-a-burstable-and-event-driven-hpc-cluster-on-aws-using-slurm-part-1/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

When you execute high performance computing (HPC) workflows on AWS, you can take advantage of the elasticity and concomitant scale associated with recruiting resources for your computational workloads. AWS offers a variety of services, solutions, and open source tools to deploy, manage, and dynamically destroy compute resources for running various types of HPC workloads.

Best practices in deploying HPC resources on AWS include creating much of the infrastructure on-demand, and making it as ephemeral and dynamic as possible. Traditional HPC clusters use a resource scheduler that maintains a set of computational resources and distributes those resources over a collection of queued jobs.

With a central resource scheduler, all users have a single point of entry to a broad range of compute. Traditionally, many of these schedulers managed on-premises systems. They weren’t offered dynamically as much as cloud-based HPC clusters, and they usually only needed to manage a largely static set of resources.

However, many schedulers now support the ability to burst into AWS or manage a dynamically changing HPC environment through plugins, connectors, and custom scripting. Some of the more common resource schedulers include:

SLURM

Simple Linux Resource Manager (SLURM) by SchedMD is one such popular scheduler. Using a derivative of SLURM’s elastic power plugin, you can coordinate the launch of a set of compute nodes with the appropriate CPU/disk/GPU/network topologies. You standup the compute resources for the job, instead of trying to fit a job within a set of pre-existing compute topologies.

We have recently released an example implementation of the SLURM bursting capability in the AWS Samples GitHub repo.

The following diagram shows the reference architecture.

Deployment

Download the aws-plugin-for-slurm directory locally. Use the following AWS CLI commands to sync the directory with an S3 bucket to be referenced later. For more detailed instructions on the deployment follow the README within the GitHub.

git clone https://github.com/aws-samples/aws-plugin-for-slurm.git 
aws s3 sync aws-plugin-for-slurm/ s3://<bucket-name>

Included is a CloudFormation script, which you use to stand up the VPC and subnets, as well as the headnode. In AWS CloudFormation, choose Create Stack and import the slurm_headnode-clouformation.yml script.

The CloudFormation script lays down the landing zone with the appropriate network topology. The headnode is based on the publicly available CentOS 7.5 available in the AWS Marketplace. The latest security packages are installed with the dependencies needed to install SLURM.

I have found that scheduling performance is best if the source is compiled at runtime, which the CloudFormation script takes care of. The script sets up the headnode as a single controller. However, with minor modifications, it can be set up in a highly available manner with a backup SLURM controller.

After the deployment moves to CREATE_COMPLETEstatus in CloudFormation, use SSH to connect to the slurm headnode:

ssh -i <path/to/private/key.pem> [email protected]<public-ip-address>

Create a new sbatch job submission file by running the following commands in the vi/nano text editor:

#!/bin/bash
#SBATCH —nodes=2
#SBATCH —ntasks-per-node=1
#SBATCH —cpus-per-task=4
#SBATCH —constraint=[us-west-2a]

env
sleep 60

This job submission script requests two nodes to be allocated, running one task per node and using four CPUs. The constraint is optional but allows SLURM to allocate the job among the available zones.

The elasticity of the cluster comes in setting the slurm.conf parameters SuspendProgram and ResumeProgram in the slurm.conf file.

SuspendTime=60
ResumeTimeout=250
TreeWidth=60000
SuspendExcNodes=ip-10-0-0-251
SuspendProgram=/nfs/slurm/bin/slurm-aws-shutdown.sh
ResumeProgram=/nfs/slurm/bin/slurm-aws-startup.sh
ResumeRate=0
SuspendRate=0

You can set the responsiveness of the scaling on AWS by modifying SuspendTime. Do not set a value for ResumeRateor SuspendRate, as the underlying SuspendProgram and ResumeProgramscripts have API calls that impose their own rate limits. If you find that your API call rate limit is reached at scale (approximately 1000 nodes/sec), you can set ResumeRate and SuspendRate accordingly.

If you are familiar with SLURM’s configuration options, you can make further modifications by editing the /nfs/slurm/etc/slurm.conf.d/slurm_nodes.conffile. That file contains the node definitions, with some minor modifications. You can schedule GPU-based workloads separate from CPU to instantiate a heterogeneous cluster layout. You also get more flexibility running tightly coupled workloads alongside loosely coupled jobs, as well as job array support. For additional commands administrating the SLURM cluster, see the SchedMD SLURM documentation.

The initial state of the cluster shows that no compute resources are available. Run the sinfocommand:

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*         up  infinite     0   n/a 
gpu          up  infinite     0   n/a 
[[email protected] ~]$ 

The job described earlier is submitted with the sbatchcommand:

sbatch test.sbatch

The power plugin allocates the requested number of nodes based on your job definition and runs the Amazon EC2 API operations to request those instances.

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*      up    infinite      2 alloc# ip-10-0-1-[6-7]
gpu       up    infinite      2 alloc# ip-10-0-1-[6-7]
[[email protected] ~]$ 

The log file is located at /var/log/power_save.log.

Wed Sep 12 18:37:45 UTC 2018 Resume invoked 
/nfs/slurm/bin/slurm-aws-startup.sh ip-10-0-1-[6-7]

After the request job is complete, the compute nodes remain idle for the duration of the SuspendTime=60value in the slurm.conffile.

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*         up  infinite     2  idle ip-10-0-1-[6-7]
gpu          up  infinite     2  idle ip-10-0-1-[6-7]
[[email protected] ~]$ 

Ideally, ensure that other queued jobs have an opportunity to run on the current infrastructure, assuming that the job requirements are fulfilled by the compute nodes.

If the job requirements are not fulfilled and there are no more jobs in the queue, the aws-slurm shutdown script takes over and terminates the instance. That’s one of the benefits of an elastic cluster.

Wed Sep 12 18:42:38 UTC 2018 Suspend invoked /nfs/slurm/bin/slurm-aws-shutdown.sh ip-10-0-1-[6-7]
{
    "TerminatingInstances": [
        {
            "InstanceId": "i-0b4c6ec4945afe52e", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            }, 
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
            }
        }
    ]
}
{
    "TerminatingInstances": [
        {
            "InstanceId": "i-0f3139a52f2602c60", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            }, 
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
            }
        }
    ]
}

The SLURM elastic compute plugin provisions the compute resources based on the scheduler queue load. In this example implementation, you are distributing a set of compute nodes to take advantage of scale and capacity across all Availability Zones within an AWS Region.

With a minor modification on the VPC layer, you can use this same plugin to stand up compute resources across multiple Regions. With this implementation, you can truly take advantage of a global HPC footprint.

“Imagine creating your own custom cluster mix of GPUs, CPUs, storage, memory, and networking – just the way you want it, then running your experiment, getting the results, and then tearing it all down.” — InsideHPC. Innovation Unbound: What Would You do with a Million Cores?

Part 2

In Part 2 of this post series, you integrate this cluster with AWS native services to handle scalable job submission and monitoring using Amazon API Gateway, AWS Lambda, and Amazon CloudWatch.

EC2 Instance Update – C5 Instances with Local NVMe Storage (C5d)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/

As you can see from my EC2 Instance History post, we add new instance types on a regular and frequent basis. Driven by increasingly powerful processors and designed to address an ever-widening set of use cases, the size and diversity of this list reflects the equally diverse group of EC2 customers!

Near the bottom of that list you will find the new compute-intensive C5 instances. With a 25% to 50% improvement in price-performance over the C4 instances, the C5 instances are designed for applications like batch and log processing, distributed and or real-time analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. Some of these applications can benefit from access to high-speed, ultra-low latency local storage. For example, video encoding, image manipulation, and other forms of media processing often necessitates large amounts of I/O to temporary storage. While the input and output files are valuable assets and are typically stored as Amazon Simple Storage Service (S3) objects, the intermediate files are expendable. Similarly, batch and log processing runs in a race-to-idle model, flushing volatile data to disk as fast as possible in order to make full use of compute resources.

New C5d Instances with Local Storage
In order to meet this need, we are introducing C5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for the applications that I described above, as well as others that you will undoubtedly dream up! Here are the specs:

Instance NamevCPUsRAMLocal StorageEBS BandwidthNetwork Bandwidth
c5d.large24 GiB1 x 50 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.xlarge48 GiB1 x 100 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.2xlarge816 GiB1 x 225 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.4xlarge1632 GiB1 x 450 GB NVMe SSD2.25 GbpsUp to 10 Gbps
c5d.9xlarge3672 GiB1 x 900 GB NVMe SSD4.5 Gbps10 Gbps
c5d.18xlarge72144 GiB2 x 900 GB NVMe SSD9 Gbps25 Gbps

Other than the addition of local storage, the C5 and C5d share the same specs. Both are powered by 3.0 GHz Intel Xeon Platinum 8000-series processors, optimized for EC2 and with full control over C-states on the two largest sizes, giving you the ability to run two cores at up to 3.5 GHz using Intel Turbo Boost Technology.

You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.

Here are a couple of things to keep in mind about the local NVMe storage:

Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.

Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.

Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.

Available Now
C5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent C5 instances.

Jeff;

PS – We will be adding local NVMe storage to other EC2 instance types in the months to come, so stay tuned!

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

AWS Online Tech Talks – April & Early May 2018

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/

We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

April & early May — 2018 Schedule

Compute

April 30, 2018 | 01:00 PM – 01:45 PM PTBest Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.

May 1, 2018 | 01:00 PM – 01:45 PM PTHow to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.

May 2, 2018 | 01:00 PM – 01:45 PM PTDeep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.

Containers

April 23, 2018 | 11:00 AM – 11:45 AM PTNew Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS.

Databases

April 23, 2018 | 01:00 PM – 01:45 PM PTElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache.

April 25, 2018 | 01:00 PM – 01:45 PM PTIntro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.

DevOps

April 25, 2018 | 09:00 AM – 09:45 AM PTDebug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun.

Enterprise & Hybrid

April 23, 2018 | 09:00 AM – 09:45 AM PTAn Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale.

April 24, 2018 | 11:00 AM – 11:45 AM PTDeploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0

IoT

May 2, 2018 | 11:00 AM – 11:45 AM PTHow to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.

Machine Learning

April 24, 2018 | 09:00 AM – 09:45 AM PT Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe.

April 26, 2018 | 09:00 AM – 09:45 AM PT Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge.

Mobile

April 30, 2018 | 11:00 AM – 11:45 AM PTOffline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.

Networking

May 2, 2018 | 09:00 AM – 09:45 AM PT Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.

Security, Identity & Compliance

April 30, 2018 | 09:00 AM – 09:45 AM PTAmazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings.

May 3, 2018 | 09:00 AM – 09:45 AM PTProtect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.

Serverless

April 24, 2018 | 01:00 PM – 01:45 PM PTTips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes.

Storage

May 1, 2018 | 11:00 AM – 11:45 AM PTBuilding Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.

May 3, 2018 | 11:00 AM – 11:45 AM PTIntegrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.

New Amazon EC2 Spot pricing model: Simplified purchasing without bidding and fewer interruptions

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Contributed by Deepthi Chelupati and Roshni Pary

Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.

At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.

How does the new pricing model work?

You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.

Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.

As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.

In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:

You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:

$ aws ec2 run-instances --instance-market-options 
'{"MarketType":"Spot", "SpotOptions": {"SpotPrice": "0.12"}}' \
    --image-id ami-1a2b3c4d --count 1 --instance-type c4.2xlarge

The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.

Fewer interruptions

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.

To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.

Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.

Raspberry Pi clusters come of age

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/raspberry-pi-clusters-come-of-age/

In today’s guest post, Bruce Tulloch, CEO and Managing Director of BitScope Designs, discusses the uses of cluster computing with the Raspberry Pi, and the recent pilot of the Los Alamos National Laboratory 3000-Pi cluster built with the BitScope Blade.

Raspberry Pi cluster

High-performance computing and Raspberry Pi are not normally uttered in the same breath, but Los Alamos National Laboratory is building a Raspberry Pi cluster with 3000 cores as a pilot before scaling up to 40 000 cores or more next year.

That’s amazing, but why?

I was asked this question more than any other at The International Conference for High-Performance Computing, Networking, Storage and Analysis in Denver last week, where one of the Los Alamos Raspberry Pi Cluster Modules was on display at the University of New Mexico’s Center for Advanced Research Computing booth.

The short answer to this question is: the Raspberry Pi cluster enables Los Alamos National Laboratory (LANL) to conduct exascale computing R&D.

The Pi cluster breadboard

Exascale refers to computing systems at least 50 times faster than the most powerful supercomputers in use today. The problem faced by LANL and similar labs building these things is one of scale. To get the required performance, you need a lot of nodes, and to make it work, you need a lot of R&D.

However, there’s a catch-22: how do you write the operating systems, networks stacks, launch and boot systems for such large computers without having one on which to test it all? Use an existing supercomputer? No — the existing large clusters are fully booked 24/7 doing science, they cost millions of dollars per year to run, and they may not have the architecture you need for your next-generation machine anyway. Older machines retired from science may be available, but at this scale they cost far too much to use and are usually very hard to maintain.

The Los Alamos solution? Build a “model supercomputer” with Raspberry Pi!

Think of it as a “cluster development breadboard”.

The idea is to design, develop, debug, and test new network architectures and systems software on the “breadboard”, but at a scale equivalent to the production machines you’re currently building. Raspberry Pi may be a small computer, but it can run most of the system software stacks that production machines use, and the ratios of its CPU speed, local memory, and network bandwidth scale proportionately to the big machines, much like an architect’s model does when building a new house. To learn more about the project, see the news conference and this interview with insideHPC at SC17.

Traditional Raspberry Pi clusters

Like most people, we love a good cluster! People have been building them with Raspberry Pi since the beginning, because it’s inexpensive, educational, and fun. They’ve been built with the original Pi, Pi 2, Pi 3, and even the Pi Zero, but none of these clusters have proven to be particularly practical.

That’s not stopped them being useful though! I saw quite a few Raspberry Pi clusters at the conference last week.

One tiny one that caught my eye was from the people at openio.io, who used a small Raspberry Pi Zero W cluster to demonstrate their scalable software-defined object storage platform, which on big machines is used to manage petabytes of data, but which is so lightweight that it runs just fine on this:

Raspberry Pi Zero cluster

There was another appealing example at the ARM booth, where the Berkeley Labs’ singularity container platform was demonstrated running very effectively on a small cluster built with Raspberry Pi 3s.

Raspberry Pi 3 cluster demo at a conference stall

My show favourite was from the Edinburgh Parallel Computing Center (EPCC): Nick Brown used a cluster of Pi 3s to explain supercomputers to kids with an engaging interactive application. The idea was that visitors to the stand design an aircraft wing, simulate it across the cluster, and work out whether an aircraft that uses the new wing could fly from Edinburgh to New York on a full tank of fuel. Mine made it, fortunately!

Raspberry Pi 3 cluster demo at a conference stall

Next-generation Raspberry Pi clusters

We’ve been building small-scale industrial-strength Raspberry Pi clusters for a while now with BitScope Blade.

When Los Alamos National Laboratory approached us via HPC provider SICORP with a request to build a cluster comprising many thousands of nodes, we considered all the options very carefully. It needed to be dense, reliable, low-power, and easy to configure and to build. It did not need to “do science”, but it did need to work in almost every other way as a full-scale HPC cluster would.

Some people argue Compute Module 3 is the ideal cluster building block. It’s very small and just as powerful as Raspberry Pi 3, so one could, in theory, pack a lot of them into a very small space. However, there are very good reasons no one has ever successfully done this. For a start, you need to build your own network fabric and I/O, and cooling the CM3s, especially when densely packed in a cluster, is tricky given their tiny size. There’s very little room for heatsinks, and the tiny PCBs dissipate very little excess heat.

Instead, we saw the potential for Raspberry Pi 3 itself to be used to build “industrial-strength clusters” with BitScope Blade. It works best when the Pis are properly mounted, powered reliably, and cooled effectively. It’s important to avoid using micro SD cards and to connect the nodes using wired networks. It has the added benefit of coming with lots of “free” USB I/O, and the Pi 3 PCB, when mounted with the correct air-flow, is a remarkably good heatsink.

When Gordon announced netboot support, we became convinced the Raspberry Pi 3 was the ideal candidate when used with standard switches. We’d been making smaller clusters for a while, but netboot made larger ones practical. Assembling them all into compact units that fit into existing racks with multiple 10 Gb uplinks is the solution that meets LANL’s needs. This is a 60-node cluster pack with a pair of managed switches by Ubiquiti in testing in the BitScope Lab:

60-node Raspberry Pi cluster pack

Two of these packs, built with Blade Quattro, and one smaller one comprising 30 nodes, built with Blade Duo, are the components of the Cluster Module we exhibited at the show. Five of these modules are going into Los Alamos National Laboratory for their pilot as I write this.

Bruce Tulloch at a conference stand with a demo of the Raspberry Pi cluster for LANL

It’s not only research clusters like this for which Raspberry Pi is well suited. You can build very reliable local cloud computing and data centre solutions for research, education, and even some industrial applications. You’re not going to get much heavy-duty science, big data analytics, AI, or serious number crunching done on one of these, but it is quite amazing to see just how useful Raspberry Pi clusters can be for other purposes, whether it’s software-defined networks, lightweight MaaS, SaaS, PaaS, or FaaS solutions, distributed storage, edge computing, industrial IoT, and of course, education in all things cluster and parallel computing. For one live example, check out Mythic Beasts’ educational compute cloud, built with Raspberry Pi 3.

For more information about Raspberry Pi clusters, drop by BitScope Clusters.

I’ll read and respond to your thoughts in the comments below this post too.

Editor’s note:

Here is a photo of Bruce wearing a jetpack. Cool, right?!

Bruce Tulloch wearing a jetpack

The post Raspberry Pi clusters come of age appeared first on Raspberry Pi.

Well-Architected Lens: Focus on Specific Workload Types

Post Syndicated from Philip Fitzsimons original https://aws.amazon.com/blogs/architecture/well-architected-lens-focus-on-specific-workload-types/

Customers have been building their innovations on AWS for over 11 years. During that time, our solutions architects have conducted tens of thousands of architecture reviews for our customers. In 2012 we created the “Well-Architected” initiative to share with you best practices for building in the cloud, and started publishing them in 2015. We recently released an updated Framework whitepaper, and a new Operational Excellence Pillar whitepaper to reflect what we learned from working with customers every day. Today, we are pleased to announce a new concept called a “lens” that allows you to focus on specific workload types from the well-architected perspective.

A well-architected review looks at a workload from a general technology perspective, which means it can’t provide workload-specific advice. For example, there are additional best practices when you are building high-performance computing (HPC) or serverless applications. Therefore, we created the concept of a lens to focus on what is different for those types of workloads.

In each lens, we document common scenarios we see — specific to that workload — providing reference architectures and a walkthrough. The lens also provides design principles to help you understand how to architect these types of workloads for the cloud, and questions for assessing your own architecture.

Today, we are releasing two lenses:

Well-Architected: High-Performance Computing (HPC) Lens <new>
Well-Architected: Serverless Applications Lens <new>

We expect to create more lenses over time, and evolve them based on customer feedback.

Philip Fitzsimons, Leader, AWS Well-Architected Team

Now Available – Compute-Intensive C5 Instances for Amazon EC2

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-compute-intensive-c5-instances-for-amazon-ec2/

I’m thrilled to announce that the new compute-intensive C5 instances are available today in six sizes for launch in three AWS regions!

These instances designed for compute-heavy applications like batch processing, distributed analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. The new instances offer a 25% price/performance improvement over the C4 instances, with over 50% for some workloads. They also have additional memory per vCPU, and (for code that can make use of the new AVX-512 instructions), twice the performance for vector and floating point workloads.

Over the years we have been working non-stop to provide our customers with the best possible networking, storage, and compute performance, with a long-term focus on offloading many types of work to dedicated hardware designed and built by AWS. The C5 instance type incorporates the latest generation of our hardware offloads, and also takes another big step forward with the addition of a new hypervisor that runs hand-in-glove with our hardware. The new hypervisor allows us to give you access to all of the processing power provided by the host hardware, while also making performance even more consistent and further raising the bar on security. We’ll be sharing many technical details about it at AWS re:Invent.

The New Instances
The C5 instances are available in six sizes:

Instance NamevCPUs
RAM
EBS BandwidthNetwork Bandwidth
c5.large24 GiBUp to 2.25 GbpsUp to 10 Gbps
c5.xlarge48 GiBUp to 2.25 GbpsUp to 10 Gbps
c5.2xlarge816 GiBUp to 2.25 GbpsUp to 10 Gbps
c5.4xlarge1632 GiB2.25 GbpsUp to 10 Gbps
c5.9xlarge3672 GiB4.5 Gbps10 Gbps
c5.18xlarge72144 GiB9 Gbps25 Gbps

Each vCPU is a hardware hyperthread on a 3.0 GHz Intel Xeon Platinum 8000-series processor. This custom processor, optimized for EC2, allows you have full control over the C-states on the two largest sizes, allowing you to run a single core at up to 3.5 GHz using Intel Turbo Boost Technology.

As you can see from the table, the four smallest instance sizes offer substantially more EBS and network bandwidth than the previous generation of compute-intensive instances.

Because all networking and storage functionality is implemented in hardware, C5 instances require HVM AMIs that include drivers for the Elastic Network Adapter (ENA) and NVMe. The latest Amazon Linux, Microsoft Windows, Ubuntu, RHEL, CentOS, SLES, Debian, and FreeBSD AMIs all support C5 instances. If you are doing machine learning inferencing, or other compute-intensive work, be sure to check out the most recent version of the Intel Math Kernel Library. It has been optimized for the Intel® Xeon® Platinum processor and has the potential to greatly accelerate your work.

In order to remain compatible with instances that use the Xen hypervisor, the device names for EBS volumes will continue to use the existing /dev/sd and /dev/xvd prefixes. The device name that you provide when you attach a volume to an instance is not used because the NVMe driver assigns its own device name (read Amazon EBS and NVMe to learn more):

The nvme command displays additional information about each volume (install it using sudo yum -y install nvme-cli if necessary):

The SN field in the output can be mapped to an EBS volume ID by inserting a “-” after the “vol” prefix (sadly, the NVMe SN field is not long enough to store the entire ID). Here’s a simple script that uses this information to create an EBS snapshot of each attached volume:

$ sudo nvme list | \
  awk '/dev/ {print(gensub("vol", "vol-", 1, $2))}' | \
  xargs -n 1 aws ec2 create-snapshot --volume-id

With a little more work (and a lot of testing), you could create a script that expands EBS volumes that are getting full.

Getting to C5
As I mentioned earlier, our effort to offload work to hardware accelerators has been underway for quite some time. Here’s a recap:

CC1 – Launched in 2010, the CC1 was designed to support scale-out HPC applications. It was the first EC2 instance to support 10 Gbps networking and one of the first to support HVM virtualization. The network fabric that we designed for the CC1 (based on our own switch hardware) has become the standard for all AWS data centers.

C3 – Launched in 2013, the C3 introduced Enhanced Networking and uses dedicated hardware accelerators to support the software defined network inside of each Virtual Private Cloud (VPC). Hardware virtualization removes the I/O stack from the hypervisor in favor of direct access by the guest OS, resulting in higher performance and reduced variability.

C4 – Launched in 2015, the C4 instances are EBS Optimized by default via a dedicated network connection, and also offload EBS processing (including CPU-intensive crypto operations for encrypted EBS volumes) to a hardware accelerator.

C5 – Launched today, the hypervisor that powers the C5 instances allow practically all of the resources of the host CPU to be devoted to customer instances. The ENA networking and the NVMe interface to EBS are both powered by hardware accelerators. The instances do not require (or support) the Xen paravirtual networking or block device drivers, both of which have been removed in order to increase efficiency.

Going forward, we’ll use this hypervisor to power other instance types and plan to share additional technical details in a set of AWS re:Invent sessions.

Launch a C5 Today
You can launch C5 instances today in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions in On-Demand and Spot form (Reserved Instances are also available), with additional Regions in the works.

One quick note before I go: The current NVMe driver is not optimized for high-performance sequential workloads and we don’t recommend the use of C5 instances in conjunction with sc1 or st1 volumes. We are aware of this issue and have been working to optimize the driver for this important use case.

Jeff;

Disabling Intel Hyper-Threading Technology on Amazon EC2 Windows Instances

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-ec2-windows-instances/

In a prior post, Disabling Intel Hyper-Threading on Amazon Linux, I investigated how the Linux kernel enumerates CPUs. I also discussed the options to disable Intel Hyper-Threading (HT Technology) in Amazon Linux running on Amazon EC2.

In this post, I do the same for Microsoft Windows Server 2016 running on EC2 instances. I begin with a quick review of HT Technology and the reasons you might want to disable it. I also recommend that you take a moment to review the prior post for a more thorough foundation.

HT Technology

HT Technology makes a single physical processor appear as multiple logical processors. Each core in an Intel Xeon processor has two threads of execution. Most of the time, these threads can progress independently; one thread executing while the other is waiting on a relatively slow operation (for example, reading from memory) to occur. However, the two threads do share resources and occasionally one thread is forced to wait while the other is executing.

There a few unique situations where disabling HT Technology can improve performance. One example is high performance computing (HPC) workloads that rely heavily on floating point operations. In these rare cases, it can be advantageous to disable HT Technology. However, these cases are rare, and for the overwhelming majority of workloads you should leave it enabled. I recommend that you test with and without HT Technology enabled, and only disable threads if you are sure it will improve performance.

Exploring HT Technology on Microsoft Windows

Here’s how Microsoft Windows enumerates CPUs. As before, I am running these examples on an m4.2xlarge. I also chose to run Windows Server 2016, but you can walk through these exercises on any version of Windows. Remember that the m4.2xlarge has eight vCPUs, and each vCPU is a thread of an Intel Xeon core. Therefore, the m4.2xlarge has four cores, each of which run two threads, resulting in eight vCPUs.

Windows does not have a built-in utility to examine CPU configuration, but you can download the Sysinternals coreinfo utility from Microsoft’s website. This utility provides useful information about the system CPU and memory topology. For this walkthrough, you enumerate the individual CPUs, which you can do by running coreinfo -c. For example:

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)

As you can see from the screenshot, the coreinfo utility displays a table where each row is a physical core and each column is a logical CPU. In other words, the two asterisks on the first line indicate that CPU 0 and CPU 1 are the two threads in the first physical core. Therefore, my m4.2xlarge has for four physical processors and each processor has two threads resulting in eight total CPUs, just as expected.

It is interesting to note that Windows Server 2016 enumerates CPUs in a different order than Linux. Remember from the prior post that Linux enumerated the first thread in each core, followed by the second thread in each core. You can see from the output earlier that Windows Server 2016, enumerates both threads in the first core, then both threads in the second core, and so on. The diagram below shows the relationship of CPUs to cores and threads in both operating systems.

In the Linux post, I disabled CPUs 4–6, leaving one thread per core, and effectively disabling HT Technology. You can see from the diagram that you must disable the odd-numbered threads (that is, 1, 3, 5, and 7) to achieve the same result in Windows. Here’s how to do that.

Disabling HT Technology on Microsoft Windows

In Linux, you can globally disable CPUs dynamically. In Windows, there is no direct equivalent that I could find, but there are a few alternatives.

First, you can disable CPUs using the msconfig.exe tool. If you choose Boot, Advanced Options, you have the option to set the number of processors. In the example below, I limit my m4.2xlarge to four CPUs. Restart for this change to take effect.

Unfortunately, Windows does not disable hyperthreaded CPUs first and then real cores, as Linux does. As you can see in the following output, coreinfo reports that my c4.2xlarge has two real cores and four hyperthreads, after rebooting. Msconfig.exe is useful for disabling cores, but it does not allow you to disable HT Technology.

Note: If you have been following along, you can re-enable all your CPUs by unselecting the Number of processors check box and rebooting your system.

 

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**-- Physical Processor 0 (Hyperthreaded)
--** Physical Processor 1 (Hyperthreaded)

While you cannot disable HT Technology systemwide, Windows does allow you to associate a particular process with one or more CPUs. Microsoft calls this, “processor affinity”. To see an example, use the following steps.

  1. Launch an instance of Notepad.
  2. Open Windows Task Manager and choose Processes.
  3. Open the context (right click) menu on notepad.exe and choose Set Affinity….

This brings up the Processor Affinity dialog box.

As you can see, all the CPUs are allowed to run this instance of notepad.exe. You can uncheck a few CPUs to exclude them. Windows is smart enough to allow any scheduled operations to continue to completion on disabled CPUs. It then saves its state at the next scheduling event, and resumes those operations on another CPU. To ensure that only one thread in each core is able to run a process, you uncheck every other core. This effectively disables HT Technology for this process. For example:

Of course, this can be tedious when you have a large number of cores. Remember that the x1.32xlarge has 128 CPUs. Luckily, you can set the affinity of a running process from PowerShell using the Get-Process cmdlet. For example:

PS C:\&gt; (Get-Process -Name 'notepad').ProcessorAffinity = 0x55;

The ProcessorAffinity attribute takes a bitmask in hexadecimal format. 0x55 in hex is equivalent to 01010101 in binary. Think of the binary encoding as 1=enabled and 0=disabled. This is slightly confusing, but we work left to right so that CPU 0 is the rightmost bit and CPU 7 is the leftmost bit. Therefore, 01010101 means that the first thread in each CPU is enabled just as it was in the diagram earlier.

The calculator built into Windows includes a “programmer view” that helps you convert from hexadecimal to binary. In addition, the ProcessorAffinity attribute is a 64-bit number. Therefore, you can only configure the processor affinity on systems up to 64 CPUs. At the moment, only the x1.32xlarge has more than 64 vCPUs.

In the preceding examples, you changed the processor affinity of a running process. Sometimes, you want to start a process with the affinity already configured. You can do this using the start command. The start command includes an affinity flag that takes a hexadecimal number like the PowerShell example earlier.

C:\Users\Administrator&gt;start /affinity 55 notepad.exe

It is interesting to note that a child process inherits the affinity from its parent. For example, the following commands create a batch file that launches Notepad, and starts the batch file with the affinity set. If you examine the instance of Notepad launched by the batch file, you see that the affinity has been applied to as well.

C:\Users\Administrator&gt;echo notepad.exe > test.bat
C:\Users\Administrator&gt;start /affinity 55 test.bat

This means that you can set the affinity of your task scheduler and any tasks that the scheduler starts inherits the affinity. So, you can disable every other thread when you launch the scheduler and effectively disable HT Technology for all of the tasks as well. Be sure to test this point, however, as some schedulers override the normal inheritance behavior and explicitly set processor affinity when starting a child process.

Conclusion

While the Windows operating system does not allow you to disable logical CPUs, you can set processor affinity on individual processes. You also learned that Windows Server 2016 enumerates CPUs in a different order than Linux. Therefore, you can effectively disable HT Technology by restricting a process to every other CPU. Finally, you learned how to set affinity of both new and running processes using Task Manager, PowerShell, and the start command.

Note: this technical approach has nothing to do with control over software licensing, or licensing rights, which are sometimes linked to the number of “CPUs” or “cores.” For licensing purposes, those are legal terms, not technical terms. This post did not cover anything about software licensing or licensing rights.

If you have questions or suggestions, please comment below.

New – GPU-Powered Streaming Instances for Amazon AppStream 2.0

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-gpu-powered-streaming-instances-for-amazon-appstream-2-0/

We launched Amazon AppStream 2.0 at re:Invent 2016. This application streaming service allows you to deliver Windows applications to a desktop browser.

AppStream 2.0 is fully managed and provides consistent, scalable performance by running applications on general purpose, compute optimized, and memory optimized streaming instances, with delivery via NICE DCV – a secure, high-fidelity streaming protocol. Our enterprise and public sector customers have started using AppStream 2.0 in place of legacy application streaming environments that are installed on-premises. They use AppStream 2.0 to deliver both commercial and line of business applications to a desktop browser. Our ISV customers are using AppStream 2.0 to move their applications to the cloud as-is, with no changes to their code. These customers focus on demos, workshops, and commercial SaaS subscriptions.

We are getting great feedback on AppStream 2.0 and have been adding new features very quickly (even by AWS standards). So far this year we have added an image builder, federated access via SAML 2.0, CloudWatch monitoring, Fleet Auto Scaling, Simple Network Setup, persistent storage for user files (backed by Amazon S3), support for VPC security groups, and built-in user management including web portals for users.

New GPU-Powered Streaming Instances
Many of our customers have told us that they want to use AppStream 2.0 to deliver specialized design, engineering, HPC, and media applications to their users. These applications are generally graphically intensive and are designed to run on expensive, high-end PCs in conjunction with a GPU (Graphics Processing Unit). Due to the hardware requirements of these applications, cost considerations have traditionally kept them out of situations where part-time or occasional access would otherwise make sense. Recently, another requirement has come to the forefront. These applications almost always need shared, read-write access to large amounts of sensitive data that is best stored, processed, and secured in the cloud. In order to meet the needs of these users and applications, we are launching two new types of streaming instances today:

Graphics Desktop – Based on the G2 instance type, Graphics Desktop instances are designed for desktop applications that use the CUDA, DirectX, or OpenGL for rendering. These instances are equipped with 15 GiB of memory and 8 vCPUs. You can select this instance family when you build an AppStream image or configure an AppStream fleet:

Graphics Pro – Based on the brand-new G3 instance type, Graphics Pro instances are designed for high-end, high-performance applications that can use the NVIDIA APIs and/or need access to large amounts of memory. These instances are available in three sizes, with 122 to 488 GiB of memory and 16 to 64 vCPUs. Again, you can select this instance family when you configure an AppStream fleet:

To learn more about how to launch, run, and scale a streaming application environment, read Scaling Your Desktop Application Streams with Amazon AppStream 2.0.

As I noted earlier, you can use either of these two instance types to build an AppStream image. This will allow you to test and fine tune your applications and to see the instances in action.

Streaming Instances in Action
We’ve been working with several customers during a private beta program for the new instance types. Here are a few stories (and some cool screen shots) to show you some of the applications that they are streaming via AppStream 2.0:

AVEVA is a world leading provider of engineering design and information management software solutions for the marine, power, plant, offshore and oil & gas industries. As part of their work on massive capital projects, their customers need to bring many groups of specialist engineers together to collaborate on the creation of digital assets. In order to support this requirement, AVEVA is building SaaS solutions that combine the streamed delivery of engineering applications with access to a scalable project data environment that is shared between engineers across the globe. The new instances will allow AVEVA to deliver their engineering design software in SaaS form while maximizing quality and performance. Here’s a screen shot of their Everything 3D app being streamed from AppStream:

Nissan, a Japanese multinational automobile manufacturer, trains its automotive specialists using 3D simulation software running on expensive graphics workstations. The training software, developed by The DiSti Corporation, allows its specialists to simulate maintenance processes by interacting with realistic 3D models of the vehicles they work on. AppStream 2.0’s new graphics capability now allows Nissan to deliver these training tools in real time, with up to date content, to a desktop browser running on low-cost commodity PCs. Their specialists can now interact with highly realistic renderings of a vehicle that allows them to train for and plan maintenance operations with higher efficiency.

Cornell University is an American private Ivy League and land-grant doctoral university located in Ithaca, New York. They deliver advanced 3D tools such as AutoDesk AutoCAD and Inventor to students and faculty to support their course work, teaching, and research. Until now, these tools could only be used on GPU-powered workstations in a lab or classroom. AppStream 2.0 allows them to deliver the applications to a web browser running on any desktop, where they run as if they were on a local workstation. Their users are no longer limited by available workstations in labs and classrooms, and can bring their own devices and have access to their course software. This increased flexibility also means that faculty members no longer need to take lab availability into account when they build course schedules. Here’s a copy of Autodesk Inventor Professional running on AppStream at Cornell:

Now Available
Both of the graphics streaming instance families are available in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) Regions and you can start streaming from them today. Your applications must run in a Windows 2012 R2 environment, and can make use of DirectX, OpenGL, CUDA, OpenCL, and Vulkan.

With prices in the US East (Northern Virginia) Region starting at $0.50 per hour for Graphics Desktop instances and $2.05 per hour for Graphics Pro instances, you can now run your simulation, visualization, and HPC workloads in the AWS Cloud on an economical, pay-by-the-hour basis. You can also take advantage of fast, low-latency access to Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3), AWS Lambda, Amazon Redshift, and other AWS services to build processing workflows that handle pre- and post-processing of your data.

Jeff;