Tag Archives: Optimize AI/ML workloads for sustainability

Optimize AI/ML workloads for sustainability: Part 3, deployment and monitoring

2022-04-27 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-3-deployment-and-monitoring/

We’re celebrating Earth Day 2022 from 4/22 through 4/29 with posts that highlight how to build, maintain, and refine your workloads for sustainability.

AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up 90 percent of the cost of an ML model. Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle.

In this series, we’re following the phases of the Well-Architected machine learning lifecycle (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production.

If you missed the first parts of this series, in Part 1, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing. In Part 2, we identified ways to reduce the environmental impact of developing, training, and tuning ML models.

Figure 1. ML lifecycle

Deployment

Select sustainable AWS Regions

As mentioned in Part 1, select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to deploy your model.

Align SLAs with sustainability goals

Define SLAs that support your sustainability goals while meeting your business requirements:

If your users can tolerate some latency, deploy your model on asynchronous endpoints to reduce resources that are idle between tasks and minimize the impact of load spikes. Asynchronous endpoints will automatically scale the instance count to zero when there are no requests to process, so you only maintain an inference infrastructure when your endpoint is processing requests.
If your workload doesn’t require high availability, deploy it to a single Availability Zone to reduce the cloud resources you consume. Adjusting availability is an example of a conscious trade off you can make to meet your sustainability targets.
When you don’t need real-time inference, use Amazon SageMaker batch transform. Unlike persistent endpoints, clusters are decommissioned when batch transform jobs finish so you don’t continuously maintain an inference infrastructure.

Use efficient silicon

For CPU-based ML inference, use AWS Graviton3. These processors offer the best performance per watt in Amazon Elastic Compute Cloud (Amazon EC2). They use up to 60% less energy than comparable EC2 instances. Graviton3 processors deliver up to three times better performance compared to Graviton2 processors for ML workloads, and they support bfloat16.

For deep learning workloads, the Amazon EC2 Inf1 instances (based on custom designed AWS Inferentia chips) deliver 2.3 times higher throughput and 80% lower cost compared to g4dn instances. Inf1 has 50% higher performance per watt than g4dn, which makes it the most sustainable ML accelerator Amazon EC2 offers.

Make efficient use of GPU

Use Amazon Elastic Inference to attach just the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type or Amazon Elastic Container Service (Amazon ECS) task.

While training jobs batch process hundreds of data samples in parallel, inference jobs usually process a single input in real time, and thus consume a small amount of GPU compute. Elastic Inference allows you to reduce the cost and environmental impact of your inference by using GPU resources more efficiently.

Optimize models for inference

Improve efficiency of your models by compiling them into optimized forms with the following:

Various open-source libraries (like Treelite for decision tree ensembles)
Third-party tools like Hugging Face Infinity, which allows you to speed up transformer models and run inference not only on GPU but also on CPU.
SageMaker Neo’s runtime consumes as little as one-tenth the footprint of a deep learning framework and optimizes models to perform up to 25 time faster with no loss in accuracy (example with XGBoost).

Deploying more efficient models means you need fewer resources for inference.

Deploy multiple models behind a single endpoint

SageMaker provides three methods to deploy multiple models to a single endpoint to improve endpoint utilization:

Host multiple models in one container behind one endpoint. Multi-model endpoints are served using a single container. This can help you cut up to 90 percent of your inference costs and carbon emissions.
Host multiple models that use different containers behind one endpoint.
Host a linear sequence of containers in an inference pipeline behind a single endpoint.

Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.

Right-size your inference environment

Right-size your endpoints by using metrics from Amazon CloudWatch or by using the Amazon SageMaker Inference Recommender. This tool can run load testing jobs and recommend the proper instance type to host your model. When you use the appropriate instance type, you limit the carbon emission associated with over-provisioning.

If your workload has intermittent or unpredictable traffic, configure autoscaling inference endpoints in SageMaker to optimize your endpoints. Autoscaling monitors your endpoints and dynamically adjusts their capacity to maintain steady and predictable performance using as few resources as possible. You can also try Serverless Inference (in preview), which automatically launches compute resources and scales them in and out depending on traffic, which eliminates idle resources.

Consider inference at the edge

When working on Internet of Things (IoT) use cases, evaluate if ML inference at the edge can reduce the carbon footprint of your workload. To do this, consider factors like the compute capacity of your devices, their energy consumption, or the emissions related to data transfer to the cloud. When deploying ML models to edge devices, consider using SageMaker Edge Manager, which integrates with SageMaker Neo and AWS IoT Greengrass (Figure 2).

Figure 2. Run inference at the edge with SageMaker Edge

Device manufacturing represents 32-57 percent of the global Information Communication Technology carbon footprint. If your ML model is optimized, it requires less compute resources. You can then perform inference on lower specification machines, which minimizes the environmental impact of the device manufacturing and uses less energy.

The following techniques compress the size of models for deployment, which speeds up inference and saves energy without significant loss of accuracy:

Pruning removes weights (learnable parameters) that don’t contribute much to the model.
Quantization represents numbers with the low-bit integers without incurring significant loss in accuracy. Specifically, you can reduce resource usage by replacing the parameters in an inference model with half-precision (16 bit), bfloat16 (16 bit, but the same dynamic range as 32 bit), or 8-bit integers instead of the usual single-precision floating-point (32 bit) values.

Archive or delete unnecessary artifacts

Compress and reduce the volume of logs you keep during the inference phase. By default, CloudWatch retains logs indefinitely. By setting limited retention time for your inference logs, you’ll avoid the carbon footprint of unnecessary log storage. Also delete unused versions of your models and custom container images from your repositories.

Monitoring

Retrain only when necessary

Monitor your ML model in production and only retrain if it’s required. Because of model drift, robustness, or new ground truth data being available, models usually need to be retrained. Instead of retraining arbitrarily, monitor your ML model in production, automate your model drift detection and only retrain when your model’s predictive performance has fallen below defined KPIs.

Consider SageMaker Pipelines, AWS Step Functions Data Science SDK for Amazon SageMaker, or third-party tools to automate your retraining pipelines.

Measure results and improve

To monitor and quantify improvements during the inference phase, track the following metrics:

Resources provisioned for your endpoints (InstanceType and AcceleratorType)
Efficient use of these resources (CPUUtilization, GPUUtilization, GPUMemoryUtilization, MemoryUtilization, and DiskUtilization) in the CloudWatch Console

For storage:

The total size of the data captured by Amazon SageMaker Model Monitor using Amazon S3 Storage Lens
The size of your CloudWatch log groups

Conclusion

AI/ML workloads can be energy intensive, but as called out by UN and mentioned in the last IPCC report, AI can contribute to mitigation of climate change and the achievement of several Sustainable Development Goals. As technology builders, it’s our responsibility to make sustainable use of AI and ML.

In this blog post series, we presented best practices you can use to make sustainability-conscious architectural decisions and reduce the environmental impact for your AI/ML workloads.

About the Well-Architected Framework

These practices are part of the Sustainability Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.

Optimize AI/ML workloads for sustainability: Part 2, model development

2022-03-21 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/

More complexity often means using more energy, and machine learning (ML) models are becoming bigger and more complex. And though ML hardware is getting more efficient, the energy required to train these ML models is increasing sharply.

In this series, we’re following the phases of the Well-Architected machine learning lifecycle (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 2, we examine the model development phase and show you how to train, tune, and evaluate your ML model to help you reduce your carbon footprint.

If you missed the first part of this series, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing.

Figure 1. ML lifecycle

Model building

Define acceptable performance criteria

When you build an ML model, you’ll likely need to make trade-offs between your model’s accuracy and its carbon footprint. When we focus only on the model’s accuracy, we “ignore the economic, environmental, or social cost of reaching the reported accuracy.” Because the relationship between model accuracy and complexity is at best logarithmic, training a model longer or looking for better hyperparameters only leads to a small increase in performance.

Establish performance criteria that support your sustainability goals while meeting your business requirements, not exceeding them.

Select energy-efficient algorithms

Begin with a simple algorithm to establish a baseline. Then, test different algorithms with increasing complexity to observe whether performance has improved. If so, compare the performance gain against the difference in resources required.

Try to find simplified versions of algorithms. This will help you use less resources to achieve a similar outcome. For example, DistilBERT, a distilled version of BERT, has 40% fewer parameters, runs 60% faster, and preserves 97% of BERT’s performance.

Use pre-trained or partially pre-trained models

Consider techniques to avoid training a model from scratch:

Transfer Learning: Use a pre-trained source model and reuse it as the starting point for a second task. For example, a model trained on ImageNet (14 million images) can generalize with other datasets.
Incremental Training: Use artifacts from an existing model on an expanded dataset to train a new model.

Optimize your deep learning models to accelerate training

Compile your DL models from their high-level language representation to hardware-optimized instructions to reduce training time. You can achieve this with open-source compilers or Amazon SageMaker Training Compiler, which can speed up training of DL models by up to 50% by more efficiently using SageMaker GPU instances.

Start with small experiments, datasets, and compute resources

Experiment with smaller datasets in your development notebook. This allows you to iterate quickly with limited carbon emission.

Automate the ML environment

When building your model, use Lifecycle Configuration Scripts to automatically stop idle SageMaker Notebook instances. If you are using SageMaker Studio, install the auto-shutdown Jupyter extension to detect and stop idle resources.

Use the fully managed training process provided by SageMaker to automatically launch training instances and shut them down as soon as the training job is complete. This minimizes idle compute resources and thus limits the environmental impact of your training job.

Adopt a serverless architecture for your MLOps pipelines. For example, orchestration tools like AWS Step Functions or SageMaker Pipelines only provision resources when work needs to be done. This way, you’re not maintaining compute infrastructure 24/7.

Model training

Select sustainable AWS Regions

Use a debugger

A debugger like SageMaker Debugger can identify training problems like system bottlenecks, overfitting, saturated activation functions, and under-utilization of system resources. It also provides built-in rules like LowGPUUtilization or Overfit. These rules monitor your workload and will automatically stop a training job as soon as it detects a bug (Figure 2), which helps you avoid unnecessary carbon emissions.

Figure 2. Automatically stop buggy training jobs with SageMaker Debugger

Optimize the resources of your training environment

Reference the recommended instance types for the algorithm you’ve selected in the SageMaker documentation. For example, for DeepAR, you should start with a single CPU instance and only switch to GPU and multiple instances when necessary.

Right size your training jobs with Amazon CloudWatch metrics that monitor the utilization of resources like CPU, GPU, memory, and disk utilization.

Consider Managed Spot Training, which takes advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity and can save you up to 90% in cost compared to On-Demand instances. By shaping your demand for the existing supply of EC2 instance capacity, you will improve your overall resource efficiency and reduce idle capacity of the overall AWS Cloud.

Use efficient silicon

Use AWS Trainium for optimized for DL training workloads. It is expected to be our most energy efficient processor for this purpose.

Archive or delete unnecessary training artifacts

Organize your ML experiments with SageMaker Experiments to clean up training resources you no longer need.

Reduce the volume of logs you keep. By default, CloudWatch retains logs indefinitely. By setting limited retention time for your notebooks and training logs, you’ll avoid the carbon footprint of unnecessary log storage.

Model tuning and evaluation

Use efficient cross-validation techniques for hyperparameter optimization

Prefer Bayesian search over random search (and avoid grid search). Bayesian search makes intelligent guesses about the next set of parameters to pick based on the prior set of trials. It typically requires 10 times fewer jobs than random search, and thus 10 times less compute resources, to find the best hyperparameters.

Limit the maximum number of concurrent training jobs. Running hyperparameter tuning jobs concurrently gets more work done quickly. However, a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute resources.

Carefully choose the number of hyperparameters and their ranges. You get better results and use less compute resources by limiting your search to a few parameters and small ranges of values. If you know that a hyperparameter is log-scaled, convert it to further improve the optimization.

Use warm-start hyperparameter tuning

Use warm-start to leverage the learning gathered in previous tuning jobs to inform which combinations of hyperparameters to search over in the new tuning job. This technique avoids restarting hyperparameter optimization jobs from scratch and thus reduces the compute resources needed.

Measure results and improve

To monitor and quantify improvements of your training jobs, track the following metrics:

Resources provisioned for your training jobs (InstanceCount, InstanceType, and VolumeSizeInGB)
Efficient use of these resources (CPUUtilization, GPUUtilization, GPUMemoryUtilization, MemoryUtilization, and DiskUtilization) in the SageMaker Console, the CloudWatch Console or your SageMaker Debugger Profiling Report

For storage:

The total size of your Amazon Simple Storage Service (Amazon S3) buckets and storage class distribution, using Amazon S3 Storage Lens
The size of your CloudWatch log groups

Conclusion

In this blog post, we discussed techniques and best practices to reduce the energy required to build, train, and evaluate your ML models.

We also provided recommendations for the tuning process as it makes up a large part of the carbon impact of building an ML model. During hyperparameter and neural design search, hundreds of versions of a given model are created, trained, and evaluated before identifying an optimal design.

In the next post, we’ll continue our sustainability journey through the ML lifecycle and discuss the best practices you can follow when deploying and monitoring your model in production.

Want to learn more? Check out the Sustainability Pillar of the AWS Well-Architected Framework, the Architecting for sustainability session at re:Invent 2021, and other blog posts on architecting for sustainability.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data

2022-02-10 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/

Training artificial intelligence (AI) services and machine learning (ML) workloads uses a lot of energy—and they are becoming bigger and more complex. As an example, the Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models study estimates that a single training session for a language model like GPT-3 can have a carbon footprint similar to traveling 703,808 kilometers by car.

Although ML uses a lot of energy, it is also one of the best tools we have to fight the effects of climate change. For example, we’ve used ML to help deliver food and pharmaceuticals safely and with much less waste, reduce the cost and risk involved in maintaining wind farms, restore at-risk ecosystems, and predict and understand extreme weather.

In this series of three blog posts, we’ll provide guidance from the Sustainability Pillar of the AWS Well-Architected Framework to reduce the carbon footprint of your AI/ML workloads.

This first post follows the first three phases provided in the Well-Architected machine learning lifecycle (Figure 1):

Business goal identification
ML problem framing
Data processing (data collection, data preprocessing, feature engineering)

You’ll learn best practices for each phase to help you review and refine your workloads to maximize utilization and minimize waste and the total resources deployed and powered to support your workload.

Figure 1. ML lifecycle

Business goal identification

Define the overall environmental impact or benefit

Measure your workload’s impact and its contribution to the overall sustainability goals of the organization. Questions you should ask:

How does this workload support our overall sustainability mission?
How much data will we have to store and process? What is the impact of training the model? How often will we have to re-train?
What are the impacts resulting from customer use of this workload?
What will be the productive output compared with this total impact?

Asking these questions will help you establish specific sustainability objectives and success criteria to measure against in the future.

ML problem framing

Identify if ML is the right solution

Always ask if AI/ML is right for your workload. There is no need to use computationally intensive AI when a simpler, more sustainable approach might succeed just as well.

For example, using ML to route Internet of Things (IoT) messages may be unwarranted; you can express the logic with a Rules Engine.

Consider AI services and pre-trained models

Once you decide if AI/ML is the right tool, consider whether the workload needs to be developed as a custom model.

Many workloads can use the managed AWS AI services shown in Figure 2. Using these services means that you won’t need the associated resources to collect/store/process data and to prepare/train/tune/deploy an ML model.

Figure 2. Managed AWS AI services

If adopting a fully managed AI service is not appropriate, evaluate if you can use pre-existing datasets, algorithms, or models. AWS Marketplace offers over 1,400 ML-related assets that customers can subscribe to. You can also fine-tune an existing model starting from a pre-trained model, like those available on Hugging Face. Using pre-trained models from third parties can reduce the resources you need for data preparation and model training.

Select sustainable Regions

Select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to host your data and workloads.

Data processing (data collection, data preprocessing, feature engineering)

Avoid datasets and processing duplication

Evaluate if you can avoid data processing by using existing publicly available datasets like AWS Data Exchange and Open Data on AWS (which includes the Amazon Sustainability Data Initiative). They offer weather and climate datasets, satellite imagery, air quality or energy data, among others. When you use these curated datasets, it avoids duplicating the compute and storage resources needed to download the data from the providers, store it in the cloud, organize, and clean it.

For internal data, you can also reduce duplication and rerun of feature engineering code across teams and projects by using a feature storage, such as Amazon SageMaker Feature Store.

Once your data is ready for training, use pipe input mode to stream it from Amazon Simple Storage Service (Amazon S3) instead of copying it to Amazon Elastic Block Store (Amazon EBS). This way, you can reduce the size of your EBS volumes.

Minimize idle resources with serverless data pipelines

Adopt a serverless architecture for your data pipeline so it only provisions resources when work needs to be done. For example, when you use AWS Glue and AWS Step Functions for data ingestion and preprocessing, you are not maintaining compute infrastructure 24/7. As shown in Figure 3, Step Functions can orchestrate AWS Glue jobs to create event-based serverless ETL/ELT pipelines.

Figure 3. Orchestrating data preparation with AWS Glue and Step Functions

Implement data lifecycle policies aligned with your sustainability goals

Classify data to understand its significance to your workload and your business outcomes. Use this information to determine when you can move data to more energy-efficient storage or safely delete it.

Manage the lifecycle of all your data and automatically enforce deletion timelines to minimize the total storage requirements of your workload using Amazon S3 Lifecycle policies. The Amazon S3 Intelligent-Tiering storage class will automatically move your data to the most sustainable access tier when access patterns change.

Define data retention periods that support your sustainability goals while meeting your business requirements, not exceeding them.

Adopt sustainable storage options

Use the appropriate storage tier to reduce the carbon impact of your workload. On Amazon S3, for example, you can use energy-efficient, archival-class storage for infrequently accessed data, as shown in Figure 4. And if you can easily recreate an infrequently accessed dataset, use the Amazon S3 One Zone-IA class to reduce by 3x or more its carbon footprint.

Figure 4. Data access patterns for Amazon S3

Don’t over-provision block storage for notebooks and use object storage services like Amazon S3 for common datasets.

Tip: You can check the free disk space on your SageMaker Notebooks using !df -h.

Select efficient file formats and compression algorithms

Use efficient file formats such as Parquet or ORC to train your models. Compared to CSV, they can help you reduce your storage by up to 87%.

Migrating to a more efficient compression algorithm can also greatly contribute to your storage reduction efforts. For example, Zstandard produces 10–15% smaller files than Gzip at the same compression speed. Some SageMaker built-in algorithms accept x-recordio-protobuf input, which can be streamed directly from Amazon S3 instead of being copied to a notebook instance.

Minimize data movement across networks

Compress your data before moving it over the network.

Minimize data movement across networks when selecting a Region; store your data close to your producers and train your models close to your data.

Measure results and improve

To monitor and quantify improvements, track the following metrics:

Total size of your S3 buckets and storage class distribution, using Amazon S3 Storage Lens
DiskUtilization metric of your SageMaker processing jobs
StorageBytes metric of your SageMaker Studio shared storage volume

Conclusion

In this blog post, we discussed the importance of defining the overall environmental impact or benefit of your ML workload and why managed AI services or pre-trained ML models are sustainable alternatives to custom models. You also learned best practices to reduce the carbon footprint of your ML workload in the data processing phase.

In the next post, we will continue our sustainability journey through the ML lifecycle and discuss the best practices you can follow in the model development phase.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Deployment

Select sustainable AWS Regions

Align SLAs with sustainability goals

Use efficient silicon

Make efficient use of GPU

Optimize models for inference

Deploy multiple models behind a single endpoint

Right-size your inference environment

Consider inference at the edge

Archive or delete unnecessary artifacts

Monitoring

Retrain only when necessary

Measure results and improve

Conclusion

Other posts in this series

About the Well-Architected Framework

Model building

Define acceptable performance criteria

Select energy-efficient algorithms

Use pre-trained or partially pre-trained models

Optimize your deep learning models to accelerate training

Start with small experiments, datasets, and compute resources

Automate the ML environment

Model training

Select sustainable AWS Regions

Use a debugger

Optimize the resources of your training environment

Use efficient silicon

Archive or delete unnecessary training artifacts

Model tuning and evaluation

Use efficient cross-validation techniques for hyperparameter optimization

Use warm-start hyperparameter tuning

Measure results and improve

Conclusion

Other posts in this series

Business goal identification

Define the overall environmental impact or benefit

ML problem framing

Identify if ML is the right solution

Consider AI services and pre-trained models

Select sustainable Regions

Data processing (data collection, data preprocessing, feature engineering)

Avoid datasets and processing duplication

Minimize idle resources with serverless data pipelines

Implement data lifecycle policies aligned with your sustainability goals

Adopt sustainable storage options

Select efficient file formats and compression algorithms

Minimize data movement across networks

Measure results and improve

Conclusion

The collective thoughts of the interwebz