Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/
This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager
The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently. With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.
This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.
Overview of P3dn instance upgrades
The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.
The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.
The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.
|Processor||Intel Xeon E5-2686 v4||Intel Skylake 8175 (w/ AVX 512)|
|GPU||8x 16 GB NVIDIA Tesla V100||8x 32 GB NVIDIA Tesla V100|
|RAM||488 GB||768 GB|
|Network||25 Gbps ENA||100 Gbps ENA + EFA|
|GPU Interconnect||NVLink – 300 GB/s||NVLink – 300 GB/s|
P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.
The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.
Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.
Optimizing your P3 family
First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.
Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread:
These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.
Enter the following command:
/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100
The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.
You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).
The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.
This post deploys the following stack:
- Amazon Linux 2 with NVIDIA Driver 430 and CUDA 10. Native support with EFA is released in the AWS-Managed Deep Learning AMI (DLAMI).
- EFA Driver 1.5
- Compiled NCCL 2.4.8
- OpenMPI 4.0.0
- On the Amazon EC2 console, create a security group.
Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.
- Modify the user variables in the packer build script so that the variables are compatible with your environment.
The following is the modification code for your variables:
3. Build and Launch the AMI by running the following packer script:
Packer build nvidia-efa-fsx-al2.yml
This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.
- Launch a second instance in a cluster placement group so you can run two node tests.
- Enter the following code to make sure that all components are built correctly:
- The following output of the commend will confirm that the build is using EFA :
INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.
INFO: Function: main Line: 53: NET/OFI Received 1 network devices
INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0
INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
Synthetic two-node performance
This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.
When launching the two-node cluster, complete the following steps:
- Place the instances in the cluster placement group.
- SSH into one of the nodes.
- Fill out the hosts file.
- Run the two-node test with the following code:
/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100
This test makes sure that the node performance works the way it is supposed to.
The following graph compares the NCCL bandwidth performance using
-x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.
Now that you have run the two node tests, you can move on to a deep learning use case.
FAIRSEQ ML training on a P3dn cluster
Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.
After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.
To install fairseq from source and develop locally, complete the following steps:
- Copy FAIRSEQ source code to one of the P3dn instance.
- Copy FAIRSEQ Training data in the data folder.
- Copy FAIRSEQ Test Data in the data folder.
git clone https://github.com/pytorch/fairseq
pip install -- editable .
Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:
- Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
- Create a custom AMI.
- Build the other 31 instances from the custom AMI.
Use the following scripts for distributed All Reduce FAIRSEQ Training :
export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
python train.py data-bin/wmt18_en_de_bpej32k \
--clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
--lr 0.0005 --source-lang en --target-lang de \
--label-smoothing 0.1 --upsample-primary 16 \
--attention-dropout 0.1 --dropout 0.3 --max-tokens 3584 \
--log-interval 100 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --fp16 \
--max-update 500000 --seed 3 --save-interval-updates 16000 \
--share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
--warmup-updates 4000 --min-lr 1e-09 \
--distributed-port 12597 --distributed-world-size 32 \
--distributed-init-method 'tcp://172.31.43.34:9218' --distributed-rank $RANK \
--device-id $LOCAL_RANK \
--max-epoch 3 \
Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.
EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!