Tag Archives: Amazon SageMaker HyperPod

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

2025-12-03 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-checkpointless-and-elastic-training-on-amazon-sagemaker-hyperpod/

Today, we’re announcing two new AI model training features within Amazon SageMaker HyperPod: checkpointless training, an approach that mitigates the need for traditional checkpoint-based recovery by enabling peer-to-peer state recovery, and elastic training, enabling AI workloads to automatically scale based on resource availability.

Checkpointless training – Checkpointless training eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes. Accelerate your AI model development, reclaim days from development timelines, and confidently scale training workflows to thousands of AI accelerators.
Elastic training – Elastic training maximizes cluster utilization as training workloads automatically expand to use idle capacity as it becomes available, and contract to yield resources as higher-priority workloads like inference volumes peak. Save hours of engineering time per week spent reconfiguring training jobs based on compute availability.

Rather than spending time managing training infrastructure, these new training techniques mean that your team can concentrate entirely on enhancing model performance, ultimately getting your AI models to market faster. By eliminating the traditional checkpoint dependencies and fully utilizing available capacity, you can significantly reduce model training completion times.

Checkpointless training: How it works
Traditional checkpoint-based recovery has these sequential job stages: 1) job termination and restart, 2) process discovery and network setup, 3) checkpoint retrieval, 4) data loader initialization, and 5) training loop resumption. When failures occur, each stage can become a bottleneck and training recovery can take up to an hour on self-managed training clusters. The entire cluster must wait for every single stage to complete before training can resume. This can lead to the entire training cluster sitting idle during recovery operations, which increases costs and extends the time to market.

Checkpointless training removes this bottleneck entirely by maintaining continuous model state preservation across the training cluster. When failures occur, the system instantly recovers by using healthy peers, avoiding the need for a checkpoint-based recovery that requires restarting the entire job. As a result, checkpointless training enables fault recovery in minutes.

Checkpointless training is designed for incremental adoption and built on four core components that work together: 1) collective communications initialization optimizations, 2) memory-mapped data loading that enables caching, 3) in-process recovery, and 4) checkpointless peer-to-peer state replication. These components are orchestrated through the HyperPod training operator that is used to launch the job. Each component optimizes a specific step in the recovery process, and together they enable automatic detection and recovery of infrastructure faults in minutes with zero manual intervention, even with thousands of AI accelerators. You can progressively enable each of these features as your training scales.

The latest Amazon Nova models were trained using this technology on tens of thousands of accelerators. Additionally, based on internal studies on cluster sizes ranging between 16 GPUs to over 2,000 GPUs, checkpointless training showcased significant improvements in recovery times, reducing downtime by over 80% compared to traditional checkpoint-based recovery.

To learn more, visit HyperPod Checkpointless Training in the Amazon SageMaker AI Developer Guide.

Elastic training: How it works
On clusters that run different types of modern AI workloads, accelerator availability can change continuously throughout the day as short-duration training runs complete, inference spikes occur and subside, or resources free up from completed experiments. Despite this dynamic availability of AI accelerators, traditional training workloads remain locked into their initial compute allocation, unable to take advantage of idle accelerators without manual intervention. This rigidity leaves valuable GPU capacity unused and prevents organizations from maximizing their infrastructure investment.

Elastic training transforms how training workloads interact with cluster resources. Training jobs can automatically scale up to utilize available accelerators and gracefully contract when resources are needed elsewhere, all while maintaining training quality.

Workload elasticity is enabled through the HyperPod training operator that orchestrates scaling decisions through integration with the Kubernetes control plane and resource scheduler. It continuously monitors cluster state through three primary channels: pod lifecycle events, node availability changes, and resource scheduler priority signals. This comprehensive monitoring enables near-instantaneous detection of scaling opportunities, whether from newly available resources or requests from higher-priority workloads.

The scaling mechanism relies on adding and removing data parallel replicas. When additional compute resources become available, new data parallel replicas join the training job, accelerating throughput. Conversely, during scale-down events (for example, when a higher-priority workload requests resources), the system scales down by removing replicas rather than terminating the entire job, allowing training to continue at reduced capacity.

Across different scales, the system preserves the global batch size and adapts learning rates, preventing model convergence from being adversely impacted. This enables workloads to dynamically scale up or down to utilize available AI accelerators without any manual intervention.

You can start elastic training through the HyperPod recipes for publicly available foundation models (FMs) including Llama and GPT-OSS. Additionally, you can modify your PyTorch training scripts to add elastic event handlers, which enable the job to dynamically scale.

To learn more, visit the HyperPod Elastic Training in the Amazon SageMaker AI Developer Guide. To get started, find the HyperPod recipes available in the AWS GitHub repository.

Now available
Both features are available in all the Regions in which Amazon SageMaker HyperPod is available. You can use these training techniques without additional cost. To learn more, visit the SageMaker HyperPod product page and SageMaker AI pricing page.

Give it a try and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

Introducing Amazon Nova Forge: Build your own frontier models using Nova

2025-12-02 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-nova-forge-build-your-own-frontier-models-using-nova/

Organizations are rapidly expanding their use of generative AI across all parts of the business. Applications requiring deep domain expertise or specific business context need models that truly understand their proprietary knowledge, workflows, and unique requirements.

While techniques like prompt engineering and Retrieval Augmented Generation (RAG) work well for many use cases, they have fundamental limitations when it comes to embedding specialized knowledge into a model’s core understanding. Supervised fine-tuning and reinforcement learning help in customizing the model, but they operate too late in the development lifecycle, layering modifications on top of models that are a fully trained, and therefore difficult to steer to specific domains of interest.

When organizations attempt deeper customization through Continued Pre-Training (CPT) using only their proprietary data, they often encounter catastrophic forgetting, where models lose their foundational capabilities as they learn new content. At the same time, the data, compute, and cost needed for training a model from scratch are still a prohibitive barrier for most organizations.

Today, we’re introducing Amazon Nova Forge, a new service to build your own frontier models using Nova. Nova Forge customers can start their development from early model checkpoints, blend their datasets with Amazon Nova-curated training data, and host their custom models securely on AWS. Nova Forge is the easiest and most cost-effective way to build your own frontier model.

Use cases and applications
Nova Forge is designed for organizations with access to proprietary or industry-specific data who want to build AI that truly understands their domain. This includes:

Manufacturing and automation – Building models that understand specialized processes, equipment data, and industry-specific workflows
Research and development – Creating models trained on proprietary research data and domain-specific knowledge
Content and media – Developing models that understand brand voice, content standards, and specific moderation requirements
Specialized industries – Training models on industry-specific terminology, regulations, and best practices

Depending on the specific use cases, Nova Forge can be used to add differentiated capabilities, enhance task-specific accuracy, reduce costs, and lower latency.

How Nova Forge works
Nova Forge addresses the limitations of current customization approaches by allowing you to start model development from early checkpoints across pre-training, mid-training, and post-training phases. You can blend your proprietary data with Amazon Nova-curated data throughout all training phases, running training using proven recipes on Amazon SageMaker AI fully managed infrastructure. This data mixing approach significantly reduces catastrophic forgetting compared to training with raw data alone, helping preserve foundational skills—including core intelligence, general instruction following capabilities, and safety benefits—while incorporating your specialized knowledge.

Nova Forge provides the ability to use reward functions in your own environment for reinforcement learning (RL). This allows the model to learn from feedback generated in environments that are representative of your use cases. Beyond single-step evaluations, you can also use your own orchestrator to manage multi-turn rollouts, enabling RL training for complex agent workflows and sequential decision-making tasks. Whether you’re using chemistry tools to score molecular designs, or robotics simulations that reward efficient task completion and penalize collisions, you can connect your proprietary environments directly.

You can also take advantage of the built-in responsible AI toolkit available in Nova Forge to configure the safety and content moderation settings of your model. You can adjust settings to meet your specific business needs in areas like safety, security, and handling of sensitive content.

Getting started with Nova Forge
Nova Forge integrates seamlessly with your existing AWS workflows. You can use the familiar tools and infrastructure in Amazon SageMaker AI to run your training, then import your custom Nova models as private models on Amazon Bedrock. This gives you the same security, consistent APIs, and broader AWS integrations as any model in Amazon Bedrock.

In Amazon SageMaker Studio, you can now build your frontier model with Amazon Nova.

To start building the model, choose which checkpoint to use: pre-trained, mid-trained, or post-trained. You can also upload your dataset here or use existing datasets.

You can blend your training data by mixing in curated datasets provided by Nova. These datasets, categorized by domain, can help your model to preserve general performance and prevent overfitting or catastrophic forgetting.

Optionally, you can choose to use Reinforcement Fine-Tuning (RFT) to improve factual accuracy and reduce hallucinations in specific domains.

When training completes, import the model into Amazon Bedrock and start using it in your applications.

Things to know
Amazon Nova Forge is available in the US East (N. Virginia) AWS Region. The program includes access to multiple Nova model checkpoints, training recipes to mix proprietary data with Amazon Nova-curated training data, proven training recipes, and integration with Amazon SageMaker AI and Amazon Bedrock.

Learn more in the Amazon Nova User Guide and explore Nova Forge from the Amazon SageMaker AI console.

Organizations interested in expert assistance can also reach out to our Generative AI Innovation Center for additional support with their model development initiatives.

— Danilo

AWS Weekly Roundup: Single GPU P5 instances, Advanced Go Driver, Amazon SageMaker HyperPod and more (August 18, 2025)

2025-08-18 Prasad Rao

Post Syndicated from Prasad Rao original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-single-gpu-p5-instances-advanced-go-driver-amazon-sagemaker-hyperpod-and-more-august-18-2025/

Let me start this week’s update with something I’m especially excited about – the upcoming BeSA (Become a Solutions Architect) cohort. BeSA is a free mentoring program that I host along with a few other AWS employees on a volunteer basis to help people excel in their cloud careers. Last week, the instructors’ lineup was finalized for the 6-week cohort starting September 6. The cohort will focus on migration and modernization on AWS. Visit the BeSA website to learn more.

Another highlight for me last week was the announcement of six new AWS Heroes for their technical leadership and exceptional contributions to the AWS community. Read the full announcement to learn more about these community leaders.

Last week’s launches
Here are some launches from last week that got my attention:

Amazon EC2 Single GPU P5 instances are now generally available — You can right-size your machine learning (ML) and high performance computing (HPC) resources cost-effectively with the new Amazon Elastic Compute Cloud (Amazon EC2) P5 instance size with one NVIDIA H100 GPU.
AWS Advanced Go Driver is generally available — You can now use the AWS Advanced Go Driver with Amazon Relational Database Service (Amazon RDS) and Amazon Aurora PostgreSQL-Compatible and MySQL-Compatible database clusters for faster switchover and failover times, Federated Authentication, and authentication with AWS Secrets Manager or AWS Identity and Access Management (IAM). You can install the PostgreSQL and MySQL packages for Windows, Mac, or Linux, by following the installation guides in GitHub.
Expanded support for Cilium with Amazon EKS Hybrid Nodes — Cilium is a Cloud Native Computing Foundation (CNCF) graduated project that provides core networking capabilities for Kubernetes workloads. Now, you can receive support from AWS for a broader set of Cilium features when using Cilium with Amazon EKS Hybrid Nodes including application ingress, in-cluster load balancing, Kubernetes network policies, and kube-proxy replacement mode.
Amazon SageMaker AI now supports P6e-GB200 UltraServers — You can accelerate training and deployment of foundational models (FMs) at trillion-parameter scale by using up to 72 NVIDIA Blackwell GPUs under one NVLink domain with the new P6e-GB200 UltraServer support in Amazon SageMaker HyperPod and Model Training.
Amazon SageMaker HyperPod now supports fine-grained quota allocation of compute resources, topology-aware-scheduling of LLM tasks and custom Amazon Machine Images (AMIs) — You can allocate fine-grained compute quota for GPU, Trainium accelerator, vCPU, and vCPU memory within an instance to optimize compute resource distribution. With topology-aware scheduling, you can schedule your large language model (LLM) tasks on an optimal network topology to minimize network communication and enhance training efficiency. Using custom AMIs, you can deploy clusters with pre-configured, security-hardened environments that meet your specific organizational requirements.

Additional updates
Here are some additional news items and blog posts that I found interesting:

Celebrating 10 years of Amazon Aurora innovation — Join the livestream event on August 21, 2025, to celebrate a decade of Aurora database innovation.
AWS named as a Leader in 2025 Gartner Magic Quadrant for Strategic Cloud Platform Services — For the fifteenth consecutive year, Gartner has named AWS a Leader in the Gartner Magic Quadrant for Strategic Cloud Platform Services (SCPS), making AWS the longest-running Magic Quadrant Leader.
Introducing AWS Cloud Control API (CCAPI) MCP Server — You can now use natural language to managing cloud infrastructure using CCAPI MCP Server. You can create, read, update, delete, and list resources using natural language.
Introducing Amazon Bedrock AgentCore Identity — AgentCore Identity provides a centralized capability to manage agent identities, securing credentials, and supporting seamless integration with AWS and third-party services through Sigv4, standardized OAuth 2.0 flows, and API key.
Introducing Amazon Bedrock AgentCore Gateway — A fully managed service to connect AI agents with tools and services. It serves as a centralized tool server, providing a unified interface where agents can discover, access, and invoke tools.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS and AWS Community events:

AWS re:Invent 2025 (December 1-5, 2025, Las Vegas) — The AWS flagship annual conference offering collaborative innovation through peer-to-peer learning, expert-led discussions, and invaluable networking opportunities.
AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Coming up soon are summits in Johannesburg (August 20) and Toronto (September 4).
AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Adria (September 5), Baltic (September 10), Aotearoa (September 18), and South Africa (September 20).

Join the AWS Builder Center to learn, build, and connect with builders in the AWS community. Browse here for upcoming in-person and virtual developer-focused events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

– Prasad

Announcing Amazon Nova customization in Amazon SageMaker AI

2025-07-16 Betty Zheng (郑予彬)

Post Syndicated from Betty Zheng (郑予彬) original https://aws.amazon.com/blogs/aws/announcing-amazon-nova-customization-in-amazon-sagemaker-ai/

Today, we’re announcing a suite of customization capabilities for Amazon Nova in Amazon SageMaker AI. Customers can now customize Nova Micro, Nova Lite, and Nova Pro across the model training lifecycle, including pre-training, supervised fine-tuning, and alignment. These techniques are available as ready-to-use Amazon SageMaker recipes with seamless deployment to Amazon Bedrock, supporting both on-demand and provisioned throughput inference.

Amazon Nova foundation models power diverse generative AI use cases across industries. As customers scale deployments, they need models that reflect proprietary knowledge, workflows, and brand requirements. Prompt optimization and retrieval-augmented generation (RAG) work well for integrating general-purpose foundation models into applications, however business-critical workflows require model customization to meet specific accuracy, cost, and latency requirements.

Choosing the right customization technique
Amazon Nova models support a range of customization techniques including: 1) supervised fine-tuning, 2) alignment, 3) continued pre-training, and 4) knowledge distillation. The optimal choice depends on goals, use case complexity, and the availability of data and compute resources. You can also combine multiple techniques to achieve your desired outcomes with the preferred mix of performance, cost, and flexibility.

Supervised fine-tuning (SFT) customizes model parameters using a training dataset of input-output pairs specific to your target tasks and domains. Choose from the following two implementation approaches based on data volume and cost considerations:

Parameter-efficient fine-tuning (PEFT) — updates only a subset of model parameters through lightweight adapter layers such as LoRA (Low-Rank Adaptation). It offers faster training and lower compute costs compared to full fine-tuning. PEFT-adapted Nova models are imported to Amazon Bedrock and invoked using on-demand inference.
Full fine-tuning (FFT) — updates all the parameters of the model and is ideal for scenarios when you have extensive training datasets (tens of thousands of records). Nova models customized through FFT can also be imported to Amazon Bedrock and invoked for inference with provisioned throughput.

Alignment steers the model output towards desired preferences for product-specific needs and behavior, such as company brand and customer experience requirements. These preferences may be encoded in multiple ways, including empirical examples and policies. Nova models support two preference alignment techniques:

Direct preference optimization (DPO) — offers a straightforward way to tune model outputs using preferred/not preferred response pairs. DPO learns from comparative preferences to optimize outputs for subjective requirements such as tone and style. DPO offers both a parameter-efficient version and a full-model update version. The parameter-efficient version supports on-demand inference.
Proximal policy optimization (PPO) — uses reinforcement learning to enhance model behavior by optimizing for desired rewards such as helpfulness, safety, or engagement. A reward model guides optimization by scoring outputs, helping the model learn effective behaviors while maintaining previously learned capabilities.

Continued pre-training (CPT) expands foundational model knowledge through self-supervised learning on large quantities of unlabeled proprietary data, including internal documents, transcripts, and business-specific content. CPT followed by SFT and alignment through DPO or PPO provides a comprehensive way to customize Nova models for your applications.

Knowledge distillation transfers knowledge from a larger “teacher” model to a smaller, faster, and more cost-efficient “student” model. Distillation is useful in scenarios where customers do not have adequate reference input-output samples and can leverage a more powerful model to augment the training data. This process creates a customized model of teacher-level accuracy for specific use cases and student-level cost-effectiveness and speed.

Here is a table summarizing the available customization techniques across different modalities and deployment options. Each technique offers specific training and inference capabilities depending on your implementation requirements.

Recipe	Modality	Training		Inference
Recipe	Modality	Amazon Bedrock	Amazon SageMaker	Amazon Bedrock On-demand	Amazon Bedrock Provisioned Throughput
Supervised fine tuning	Text, image, video
Parameter-efficient fine-tuning (PEFT)
Full fine-tuning
Direct preference optimization (DPO)	Text, image, video
Parameter-efficient DPO
Full model DPO
Proximal policy optimization (PPO)	Text-only
Continuous pre-training	Text-only
Distillation	Text-only

Early access customers, including Cosine AI, Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL), Volkswagen, Amazon Customer Service, and Amazon Catalog Systems Service, are already successfully using Amazon Nova customization capabilities.

Customizing Nova models in action
The following walks you through an example of customizing the Nova Micro model using direct preference optimization on an existing preference dataset. To do this, you can use Amazon SageMaker Studio.

Launch your SageMaker Studio in the Amazon SageMaker AI console and choose JumpStart, a machine learning (ML) hub with foundation models, built-in algorithms, and pre-built ML solutions that you can deploy with a few clicks.

Then, choose Nova Micro, a text-only model that delivers the lowest latency responses at the lowest cost per inference among the Nova model family, and then choose Train.

Next, you can choose a fine-tuning recipe to train the model with labeled data to enhance performance on specific tasks and align with desired behaviors. Choosing the Direct Preference Optimization offers a straightforward way to tune model outputs with your preferences.

When you choose Open sample notebook, you have two environment options to run the recipe: either on the SageMaker training jobs or SageMaker Hyperpod:

Choose Run recipe on SageMaker training jobs when you don’t need to create a cluster and train the model with the sample notebook by selecting your JupyterLab space.

Alternately, if you want to have a persistent cluster environment optimized for iterative training processes, choose Run recipe on SageMaker HyperPod. You can choose a HyperPod EKS cluster with at least one restricted instance group (RIG) to provide a specialized isolated environment, which is required for such Nova model training. Then, choose your JupyterLabSpace and Open sample notebook.

This notebook provides an end-to-end walkthrough for creating a SageMaker HyperPod job using a SageMaker Nova model with a recipe and deploying it for inference. With the help of a SageMaker HyperPod recipe, you can streamline complex configurations and seamlessly integrate datasets for optimized training jobs.

In SageMaker Studio, you can see that your SageMaker HyperPod job has been successfully created and you can monitor it for further progress.

After your job completes, you can use a benchmark recipe to evaluate if the customized model performs better on agentic tasks.

For comprehensive documentation and additional example implementations, visit the SageMaker HyperPod recipes repository on GitHub. We continue to expand the recipes based on customer feedback and emerging ML trends, ensuring you have the tools needed for successful AI model customization.

Availability and getting started
Recipes for Amazon Nova on Amazon SageMaker AI are available in US East (N. Virginia). Learn more about this feature by visiting the Amazon Nova customization webpage and Amazon Nova user guide and get started in the Amazon SageMaker AI console.

–Betty

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/accelerate-foundation-model-training-and-fine-tuning-with-new-amazon-sagemaker-hyperpod-recipes/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod recipes to help data scientists and developers of all skill sets to get started training and fine-tuning foundation models (FMs) in minutes with state-of-the-art performance. They can now access optimized recipes for training and fine-tuning popular publicly available FMs such as Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce time to train FMs by up to 40 percent and scale across more than a thousand compute resources in parallel with preconfigured distributed training libraries. With SageMaker HyperPod, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of compute resources.

SageMaker HyperPod recipes include a training stack tested by AWS, removing tedious work experimenting with different model configurations, eliminating weeks of iterative evaluation and testing. The recipes automate several critical steps, such as loading training datasets, applying distributed training techniques, automating checkpoints for faster recovery from faults, and managing the end-to-end training loop.

With a simple recipe change, you can seamlessly switch between GPU- or Trainium-based instances to further optimize training performance and reduce costs. You can easily run workloads in production on SageMaker HyperPod or SageMaker training jobs.

SageMaker HyperPod recipes in action
To get started, visit the SageMaker HyperPod recipes GitHub repository to browse training recipes for popular publicly available FMs.

You only need to edit straightforward recipe parameters to specify an instance type and the location of your dataset in cluster configuration, then run the recipe with a single line command to achieve state-of-art performance.

You need to edit the recipe config.yaml file to specify the model and cluster type after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 install -r requirements.txt.
$ cd ./recipes_collections
$ vim config.yaml

The recipes support SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker training jobs. For example, you can set up a cluster type (Slurm orchestrator), a model name (Meta Llama 3.1 405B language model), an instance type (ml.p5.48xlarge), and your data locations, such as storing the training data, results, logs, and so on.

defaults:
- cluster: slurm # support: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # name of model to be trained
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or other supported cluster instances
base_results_dir: # Location(s) to store the results, checkpoints, logs etc.

You can optionally adjust model-specific training parameters in this YAML file, which outlines the optimal configuration, including the number of accelerator devices, instance type, training precision, parallelization and sharding techniques, the optimizer, and logging to monitor experiments through TensorBoard.

run:
  name: llama-405b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
restore_from_path: null
trainer:
  devices: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  name: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Start training from pretrained model
model:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # other model-specific params

To run this recipe in SageMaker HyperPod with Slurm, you must prepare the SageMaker HyperPod cluster following the cluster setup instruction.

Then, connect to the SageMaker HyperPod head node, access the Slurm controller, and copy the edited recipe. Next, you run a helper file to generate a Slurm submission script for the job that you can use for a dry run to inspect the content before starting the training job.

$ python3 main.py --config-path recipes_collection --config-name=config

After training completion, the trained model is automatically saved to your assigned data location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, install the requirements, and edit the recipe (cluster: k8s) on your laptop. Then, create a link between your laptop and running the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
  "recipes.run.name": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.model.data.train_dir": "<your_train_data_dir>",
  "recipes.model.data.val_dir": "<your_val_data_dir>",
}'

You can also run recipe on SageMaker training jobs using SageMaker Python SDK. The following example is running PyTorch training scripts on SageMaker training jobs with overriding training recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           role=<role>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As training progresses, the model checkpoints are stored on Amazon Simple Storage Service (Amazon S3) with the fully automated checkpointing capability, enabling faster recovery from training faults and instance restarts.

Now available
Amazon SageMaker HyperPod recipes are now available in the SageMaker HyperPod recipes GitHub repository. To learn more, visit the SageMaker HyperPod product page and the Amazon SageMaker AI Developer Guide.

Give SageMaker HyperPod recipes a try and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

Meet your training timelines and budgets with new Amazon SageMaker HyperPod flexible training plans

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/meet-your-training-timelines-and-budgets-with-new-amazon-sagemaker-hyperpod-flexible-training-plans/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod flexible training plans to help data scientists train large foundation models (FMs) within their timelines and budgets and save them weeks of effort in managing the training process based on compute availability.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce the time to train FMs by up to 40 percent and scale across thousands of compute resources in parallel with preconfigured distributed training libraries and built-in resiliency. Most generative AI model development tasks need accelerated compute resources in parallel. Our customers struggle to find timely access to compute resources to complete their training within their timeline and budget constraints.

With today’s announcement, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of the compute resources. Within a few steps, you can identify training completion date, budget, compute resources requirements, create optimal training plans, and run fully managed training jobs, without needing manual intervention.

SageMaker HyperPod training plans in action
To get started, go to the Amazon SageMaker AI console, choose Training plans in the left navigation pane, and choose Create training plan.

For example, choose your preferred training date and time (10 days), instance type and count (16 ml.p5.48xlarge) for SageMaker HyperPod cluster, and choose Find training plan.

SageMaker HyperPod suggests a training plan that is split into two five-day segments. This includes the total upfront price for the plan.

If you accept this training plan, add your training details in the next step and choose Create your plan.

After creating your training plan, you can see the list of training plans. When you’ve created a training plan, you have to pay upfront for the plan within 12 hours. One plan is in the Active state and already started, with all the instances being used. The second plan is Scheduled to start later, but you can already submit jobs that start automatically when the plan begins.

In the active status, the compute resources are available in SageMaker HyperPod, resume automatically after pauses in availability, and terminates at the end of the plan. There is a first segment currently running and another segment queued up to run after the current segment.

This is similar to the Managed Spot training in SageMaker AI, where SageMaker AI takes care of instance interruptions and continues the training with no manual intervention. To learn more, visit the SageMaker HyperPod training plans in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker AI pricing page.

Give HyperPod training plans a try in the Amazon SageMaker AI console and send feedback to AWS re:Post for SageMaker AI or through your usual AWS Support contacts.

— Channy

Maximize accelerator utilization for model development with new Amazon SageMaker HyperPod task governance

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/maximize-accelerator-utilization-for-model-development-with-new-amazon-sagemaker-hyperpod-task-governance/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod task governance, a new innovation to easily and centrally manage and maximize GPU and Tranium utilization across generative AI model development tasks, such as training, fine-tuning, and inference.

Customers tell us that they’re rapidly increasing investment in generative AI projects, but they face challenges in efficiently allocating limited compute resources. The lack of dynamic, centralized governance for resource allocation leads to inefficiencies, with some projects underutilizing resources while others stall. This situation burdens administrators with constant replanning, causes delays for data scientists and developers, and results in untimely delivery of AI innovations and cost overruns due to inefficient use of resources.

With SageMaker HyperPod task governance, you can accelerate time to market for AI innovations while avoiding cost overruns due to underutilized compute resources. With a few steps, administrators can set up quotas governing compute resource allocation based on project budgets and task priorities. Data scientists or developers can create tasks such as model training, fine-tuning, or evaluation, which SageMaker HyperPod automatically schedules and executes within allocated quotas.

SageMaker HyperPod task governance manages resources, automatically freeing up compute from lower-priority tasks when high-priority tasks need immediate attention. It does this by pausing low-priority training tasks, saving checkpoints, and resuming them later when resources become available. Additionally, idle compute within a team’s quota can be automatically used to accelerate another team’s waiting tasks.

Data scientists and developers can continuously monitor their task queues, view pending tasks, and adjust priorities as needed. Administrators can also monitor and audit scheduled tasks and compute resource usage across teams and projects and, as a result, they can adjust allocations to optimize costs and improve resource availability across the organization. This approach promotes timely completion of critical projects while maximizing resource efficiency.

Getting started with SageMaker HyperPod task governance
Task governance is available for Amazon EKS clusters in HyperPod. Find Cluster Management under HyperPod Clusters in the Amazon SageMaker AI console for provisioning and managing clusters. As an administrator, you can streamline the operation and scaling of HyperPod clusters through this console.

When you choose a HyperPod cluster, you can see a new Dashboard, Tasks, and Policies tab in the cluster detail page.

1. New dashboard
In the new dashboard, you can see an overview of cluster utilization, team-based, and task-based metrics.

First, you can view both point-in-time and trend-based metrics for critical compute resources, including GPU, vCPU, and memory utilization, across all instance groups.

Next, you can gain comprehensive insights into team-specific resource management, focusing on GPU utilization versus compute allocation across teams. You can use customizable filters for teams and cluster instance groups to analyze metrics such as allocated GPUs/CPUs for tasks, borrowed GPUs/CPUs, and GPU/CPU utilization.

You can also assess task performance and resource allocation efficiency using metrics such as counts of running, pending, and preempted tasks, as well as average task runtime and wait time. To gain comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate with Amazon CloudWatch Container Insights or Amazon Managed Grafana.

2. Create and manage a cluster policy
To enable task prioritization and fair-share resource allocation, you can configure a cluster policy that prioritizes critical workloads and distributes idle compute across teams defined in compute allocations.

To configure priority classes and fair sharing of borrowed compute in cluster settings, choose Edit in the Cluster policy section.

You can define how tasks waiting in queue are admitted for task prioritization: First-come-first-serve by default or Task ranking. When you choose task ranking, tasks waiting in queue will be admitted in the priority order defined in this cluster policy. Tasks of same priority class will be executed on a first-come-first-serve basis.

You can also configure how idle compute is allocated across teams: First-come-first-serve or Fair-share by default. The fair-share setting enables teams to borrow idle compute based on their assigned weights, which are configured in relative compute allocations. This enables every team to get a fair share of idle compute to accelerate their waiting tasks.

In the Compute allocation section of the Policies page, you can create and edit compute allocations to distribute compute resources among teams, enable settings that allow teams to lend and borrow idle compute, configure preemption of their own low-priority tasks, and assign fair-share weights to teams.

In the Team section, set a team name and a corresponding Kubernetes namespace will be created for your data science and machine learning (ML) teams to use. You can set a fair-share weight for a more equitable distribution of unused capacity across your teams and enable the preemption option based on task priority, allowing higher-priority tasks to preempt lower-priority ones.

In the Compute section, you can add and allocate instance type quotas to teams. Additionally, you can allocate quotas for instance types not yet available in the cluster, allowing for future expansion.

You can enable teams to share idle compute resources by allowing them to lend their unused capacity to other teams. This borrowing model is reciprocal: teams can only borrow idle compute if they are also willing to share their own unused resources with others. You can also specify the borrow limit that enables teams to borrow compute resources over their allocated quota.

3. Run your training task in SageMaker HyperPod cluster
As a data scientist, you can submit a training job and use the quota allocated for your team, using the HyperPod Command Line Interface (CLI) command. With the HyperPod CLI, you can start a job and specify the corresponding namespace that has the allocation.

$ hyperpod start-job --name smpv2-llama2 --namespace hyperpod-ns-ml-engineers
Successfully created job smpv2-llama2
$ hyperpod list-jobs --all-namespaces
{
 "jobs": [
  {
   "Name": "smpv2-llama2",
   "Namespace": "hyperpod-ns-ml-engineers",
   "CreationTime": "2024-09-26T07:13:06Z",
   "State": "Running",
   "Priority": "fine-tuning-priority"
  },
  ...
 ]
}

In the Tasks tab, you can see all tasks in your cluster. Each task has different priority and capacity need according to its policy. If you run another task with higher priority, the existing task will be suspended and that task can run first.

OK, now let’s check out a demo video showing what happens when a high-priority training task is added while running a low-priority task.

To learn more, visit SageMaker HyperPod task governance in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod task governance is now available in US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions. You can use HyperPod task governance without additional cost. To learn more, visit the SageMaker HyperPod product page.

Give HyperPod task governance a try in the Amazon SageMaker AI console and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

P.S. Special thanks to Nisha Nadkarni, a senior generative AI specialist solutions architect at AWS for her contribution in creating a HyperPod testing environment.

Noise

Tag Archives: Amazon SageMaker HyperPod

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

Introducing Amazon Nova Forge: Build your own frontier models using Nova

AWS Weekly Roundup: Single GPU P5 instances, Advanced Go Driver, Amazon SageMaker HyperPod and more (August 18, 2025)

Announcing Amazon Nova customization in Amazon SageMaker AI

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

Meet your training timelines and budgets with new Amazon SageMaker HyperPod flexible training plans

Maximize accelerator utilization for model development with new Amazon SageMaker HyperPod task governance

The collective thoughts of the interwebz