Tag Archives: high performance computing

Deploying a Burstable and Event-driven HPC Cluster on AWS Using SLURM, Part 1

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploying-a-burstable-and-event-driven-hpc-cluster-on-aws-using-slurm-part-1/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

When you execute high performance computing (HPC) workflows on AWS, you can take advantage of the elasticity and concomitant scale associated with recruiting resources for your computational workloads. AWS offers a variety of services, solutions, and open source tools to deploy, manage, and dynamically destroy compute resources for running various types of HPC workloads.

Best practices in deploying HPC resources on AWS include creating much of the infrastructure on-demand, and making it as ephemeral and dynamic as possible. Traditional HPC clusters use a resource scheduler that maintains a set of computational resources and distributes those resources over a collection of queued jobs.

With a central resource scheduler, all users have a single point of entry to a broad range of compute. Traditionally, many of these schedulers managed on-premises systems. They weren’t offered dynamically as much as cloud-based HPC clusters, and they usually only needed to manage a largely static set of resources.

However, many schedulers now support the ability to burst into AWS or manage a dynamically changing HPC environment through plugins, connectors, and custom scripting. Some of the more common resource schedulers include:

SLURM

Simple Linux Resource Manager (SLURM) by SchedMD is one such popular scheduler. Using a derivative of SLURM’s elastic power plugin, you can coordinate the launch of a set of compute nodes with the appropriate CPU/disk/GPU/network topologies. You standup the compute resources for the job, instead of trying to fit a job within a set of pre-existing compute topologies.

We have recently released an example implementation of the SLURM bursting capability in the AWS Samples GitHub repo.

The following diagram shows the reference architecture.

Deployment

Download the aws-plugin-for-slurm directory locally. Use the following AWS CLI commands to sync the directory with an S3 bucket to be referenced later. For more detailed instructions on the deployment follow the README within the GitHub.

git clone https://github.com/aws-samples/aws-plugin-for-slurm.git 
aws s3 sync aws-plugin-for-slurm/ s3://<bucket-name>

Included is a CloudFormation script, which you use to stand up the VPC and subnets, as well as the headnode. In AWS CloudFormation, choose Create Stack and import the slurm_headnode-clouformation.yml script.

The CloudFormation script lays down the landing zone with the appropriate network topology. The headnode is based on the publicly available CentOS 7.5 available in the AWS Marketplace. The latest security packages are installed with the dependencies needed to install SLURM.

I have found that scheduling performance is best if the source is compiled at runtime, which the CloudFormation script takes care of. The script sets up the headnode as a single controller. However, with minor modifications, it can be set up in a highly available manner with a backup SLURM controller.

After the deployment moves to CREATE_COMPLETEstatus in CloudFormation, use SSH to connect to the slurm headnode:

ssh -i <path/to/private/key.pem> [email protected]<public-ip-address>

Create a new sbatch job submission file by running the following commands in the vi/nano text editor:

#!/bin/bash
#SBATCH —nodes=2
#SBATCH —ntasks-per-node=1
#SBATCH —cpus-per-task=4
#SBATCH —constraint=[us-west-2a]

env
sleep 60

This job submission script requests two nodes to be allocated, running one task per node and using four CPUs. The constraint is optional but allows SLURM to allocate the job among the available zones.

The elasticity of the cluster comes in setting the slurm.conf parameters SuspendProgram and ResumeProgram in the slurm.conf file.

SuspendTime=60
ResumeTimeout=250
TreeWidth=60000
SuspendExcNodes=ip-10-0-0-251
SuspendProgram=/nfs/slurm/bin/slurm-aws-shutdown.sh
ResumeProgram=/nfs/slurm/bin/slurm-aws-startup.sh
ResumeRate=0
SuspendRate=0

You can set the responsiveness of the scaling on AWS by modifying SuspendTime. Do not set a value for ResumeRateor SuspendRate, as the underlying SuspendProgram and ResumeProgramscripts have API calls that impose their own rate limits. If you find that your API call rate limit is reached at scale (approximately 1000 nodes/sec), you can set ResumeRate and SuspendRate accordingly.

If you are familiar with SLURM’s configuration options, you can make further modifications by editing the /nfs/slurm/etc/slurm.conf.d/slurm_nodes.conffile. That file contains the node definitions, with some minor modifications. You can schedule GPU-based workloads separate from CPU to instantiate a heterogeneous cluster layout. You also get more flexibility running tightly coupled workloads alongside loosely coupled jobs, as well as job array support. For additional commands administrating the SLURM cluster, see the SchedMD SLURM documentation.

The initial state of the cluster shows that no compute resources are available. Run the sinfocommand:

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*         up  infinite     0   n/a 
gpu          up  infinite     0   n/a 
[[email protected] ~]$ 

The job described earlier is submitted with the sbatchcommand:

sbatch test.sbatch

The power plugin allocates the requested number of nodes based on your job definition and runs the Amazon EC2 API operations to request those instances.

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*      up    infinite      2 alloc# ip-10-0-1-[6-7]
gpu       up    infinite      2 alloc# ip-10-0-1-[6-7]
[[email protected] ~]$ 

The log file is located at /var/log/power_save.log.

Wed Sep 12 18:37:45 UTC 2018 Resume invoked 
/nfs/slurm/bin/slurm-aws-startup.sh ip-10-0-1-[6-7]

After the request job is complete, the compute nodes remain idle for the duration of the SuspendTime=60value in the slurm.conffile.

[[email protected] ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all*         up  infinite     2  idle ip-10-0-1-[6-7]
gpu          up  infinite     2  idle ip-10-0-1-[6-7]
[[email protected] ~]$ 

Ideally, ensure that other queued jobs have an opportunity to run on the current infrastructure, assuming that the job requirements are fulfilled by the compute nodes.

If the job requirements are not fulfilled and there are no more jobs in the queue, the aws-slurm shutdown script takes over and terminates the instance. That’s one of the benefits of an elastic cluster.

Wed Sep 12 18:42:38 UTC 2018 Suspend invoked /nfs/slurm/bin/slurm-aws-shutdown.sh ip-10-0-1-[6-7]
{
    "TerminatingInstances": [
        {
            "InstanceId": "i-0b4c6ec4945afe52e", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            }, 
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
            }
        }
    ]
}
{
    "TerminatingInstances": [
        {
            "InstanceId": "i-0f3139a52f2602c60", 
            "CurrentState": {
                "Code": 32, 
                "Name": "shutting-down"
            }, 
            "PreviousState": {
                "Code": 16, 
                "Name": "running"
            }
        }
    ]
}

The SLURM elastic compute plugin provisions the compute resources based on the scheduler queue load. In this example implementation, you are distributing a set of compute nodes to take advantage of scale and capacity across all Availability Zones within an AWS Region.

With a minor modification on the VPC layer, you can use this same plugin to stand up compute resources across multiple Regions. With this implementation, you can truly take advantage of a global HPC footprint.

“Imagine creating your own custom cluster mix of GPUs, CPUs, storage, memory, and networking – just the way you want it, then running your experiment, getting the results, and then tearing it all down.” — InsideHPC. Innovation Unbound: What Would You do with a Million Cores?

Part 2

In Part 2 of this post series, you integrate this cluster with AWS native services to handle scalable job submission and monitoring using Amazon API Gateway, AWS Lambda, and Amazon CloudWatch.

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

Serverless @ re:Invent 2017

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/serverless-reinvent-2017/

At re:Invent 2014, we announced AWS Lambda, what is now the center of the serverless platform at AWS, and helped ignite the trend of companies building serverless applications.

This year, at re:Invent 2017, the topic of serverless was everywhere. We were incredibly excited to see the energy from everyone attending 7 workshops, 15 chalk talks, 20 skills sessions and 27 breakout sessions. Many of these sessions were repeated due to high demand, so we are happy to summarize and provide links to the recordings and slides of these sessions.

Over the course of the week leading up to and then the week of re:Invent, we also had over 15 new features and capabilities across a number of serverless services, including AWS Lambda, Amazon API Gateway, AWS [email protected], AWS SAM, and the newly announced AWS Serverless Application Repository!

AWS Lambda

Amazon API Gateway

  • Amazon API Gateway Supports Endpoint Integrations with Private VPCs – You can now provide access to HTTP(S) resources within your VPC without exposing them directly to the public internet. This includes resources available over a VPN or Direct Connect connection!
  • Amazon API Gateway Supports Canary Release Deployments – You can now use canary release deployments to gradually roll out new APIs. This helps you more safely roll out API changes and limit the blast radius of new deployments.
  • Amazon API Gateway Supports Access Logging – The access logging feature lets you generate access logs in different formats such as CLF (Common Log Format), JSON, XML, and CSV. The access logs can be fed into your existing analytics or log processing tools so you can perform more in-depth analysis or take action in response to the log data.
  • Amazon API Gateway Customize Integration Timeouts – You can now set a custom timeout for your API calls as low as 50ms and as high as 29 seconds (the default is 30 seconds).
  • Amazon API Gateway Supports Generating SDK in Ruby – This is in addition to support for SDKs in Java, JavaScript, Android and iOS (Swift and Objective-C). The SDKs that Amazon API Gateway generates save you development time and come with a number of prebuilt capabilities, such as working with API keys, exponential back, and exception handling.

AWS Serverless Application Repository

Serverless Application Repository is a new service (currently in preview) that aids in the publication, discovery, and deployment of serverless applications. With it you’ll be able to find shared serverless applications that you can launch in your account, while also sharing ones that you’ve created for others to do the same.

AWS [email protected]

[email protected] now supports content-based dynamic origin selection, network calls from viewer events, and advanced response generation. This combination of capabilities greatly increases the use cases for [email protected], such as allowing you to send requests to different origins based on request information, showing selective content based on authentication, and dynamically watermarking images for each viewer.

AWS SAM

Twitch Launchpad live announcements

Other service announcements

Here are some of the other highlights that you might have missed. We think these could help you make great applications:

AWS re:Invent 2017 sessions

Coming up with the right mix of talks for an event like this can be quite a challenge. The Product, Marketing, and Developer Advocacy teams for Serverless at AWS spent weeks reading through dozens of talk ideas to boil it down to the final list.

From feedback at other AWS events and webinars, we knew that customers were looking for talks that focused on concrete examples of solving problems with serverless, how to perform common tasks such as deployment, CI/CD, monitoring, and troubleshooting, and to see customer and partner examples solving real world problems. To that extent we tried to settle on a good mix based on attendee experience and provide a track full of rich content.

Below are the recordings and slides of breakout sessions from re:Invent 2017. We’ve organized them for those getting started, those who are already beginning to build serverless applications, and the experts out there already running them at scale. Some of the videos and slides haven’t been posted yet, and so we will update this list as they become available.

Find the entire Serverless Track playlist on YouTube.

Talks for people new to Serverless

Advanced topics

Expert mode

Talks for specific use cases

Talks from AWS customers & partners

Looking to get hands-on with Serverless?

At re:Invent, we delivered instructor-led skills sessions to help attendees new to serverless applications get started quickly. The content from these sessions is already online and you can do the hands-on labs yourself!
Build a Serverless web application

Still looking for more?

We also recently completely overhauled the main Serverless landing page for AWS. This includes a new Resources page containing case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Well-Architected Lens: Focus on Specific Workload Types

Post Syndicated from Philip Fitzsimons original https://aws.amazon.com/blogs/architecture/well-architected-lens-focus-on-specific-workload-types/

Customers have been building their innovations on AWS for over 11 years. During that time, our solutions architects have conducted tens of thousands of architecture reviews for our customers. In 2012 we created the “Well-Architected” initiative to share with you best practices for building in the cloud, and started publishing them in 2015. We recently released an updated Framework whitepaper, and a new Operational Excellence Pillar whitepaper to reflect what we learned from working with customers every day. Today, we are pleased to announce a new concept called a “lens” that allows you to focus on specific workload types from the well-architected perspective.

A well-architected review looks at a workload from a general technology perspective, which means it can’t provide workload-specific advice. For example, there are additional best practices when you are building high-performance computing (HPC) or serverless applications. Therefore, we created the concept of a lens to focus on what is different for those types of workloads.

In each lens, we document common scenarios we see — specific to that workload — providing reference architectures and a walkthrough. The lens also provides design principles to help you understand how to architect these types of workloads for the cloud, and questions for assessing your own architecture.

Today, we are releasing two lenses:

Well-Architected: High-Performance Computing (HPC) Lens <new>
Well-Architected: Serverless Applications Lens <new>

We expect to create more lenses over time, and evolve them based on customer feedback.

Philip Fitzsimons, Leader, AWS Well-Architected Team