Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

AWS Machine Learning Blog

This is a guest post co-written with Meta’s PyTorch team and is a continuation of Part 1 of this series, where we demonstrate the performance and ease of running PyTorch 2.0 on AWS.
Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. In the last few years, the size of current generation models has increased significantly, and they require modern tools and infrastructure to be trained efficiently and at scale. PyTorch Distributed Data Parallelism (DDP) helps process data at scale in a simple and robust manner, but it requires the model to fit on one GPU. The PyTorch Fully Sharded Data Parallel (FSDP) library breaks this barrier by enabling model sharding to train large models across data parallel workers.
Distributed model training requires a cluster of worker nodes that can scale. Amazon Elastic Kubernetes Service (Amazon EKS) is a popular Kubernetes-conformant service that greatly simplifies the process of running AI/ML workloads, making it more manageable and less time-consuming.
In this blog post, AWS collaborates with Meta’s PyTorch team to discuss how to use the PyTorch FSDP library to achieve linear scaling of deep learning models on AWS seamlessly using Amazon EKS and AWS Deep Learning Containers (DLCs). We demonstrate this through a step-by-step implementation of training 7B, 13B, and 70B Llama2 models using Amazon EKS with 16 Amazon Elastic Compute Cloud (Amazon EC2) p4de.24xlarge instances (each with 8 NVIDIA A100 Tensor Core GPUs and each GPU with 80 GB HBM2e memory) or 16 EC2 p5.48xlarge instances (each with 8 NVIDIA H100 Tensor Core GPUs and each GPU with 80 GB HBM3 memory), achieving near linear scaling in throughput and ultimately enabling faster training time.
The following scaling chart shows that the p5.48xlarge instances offer 87% scaling efficiency with FSDP Llama2 fine-tuning in a 16-node cluster configuration.

Challenges of training LLMs
Businesses are increasingly adopting LLMs for a range of tasks, including virtual assistants, translation, content creation, and computer vision, to enhance the efficiency and accuracy in a variety of applications.
However, training or fine-tuning these large models for a custom use case requires a large amount of data and compute power, which adds to the overall engineering complexity of the ML stack. This is also due to limited memory available on a single GPU, which restricts the size of the model that can be trained, and also limits the per-GPU batch size used during training.
To address this challenge, various model parallelism techniques such as DeepSpeed ZeRO and PyTorch FSDP were created to allow you to overcome this barrier of limited GPU memory. This is done by adopting a sharded data parallel technique, where each accelerator holds just a slice (a shard) of a model replica instead of the entire model replica, which dramatically reduces the memory footprint of the training job.
This post demonstrates how you can use PyTorch FSDP to fine-tune the Llama2 model using Amazon EKS. We achieve this by scaling out compute and GPU capacity to address the model requirements.
FSDP overview
In PyTorch DDP training, each GPU (referred to as a worker in the context of PyTorch) holds a complete copy of the model, including the model weights, gradients, and optimizer states. Each worker processes a batch of data and, at the end of the backward pass, uses an all-reduce operation to synchronize gradients across different workers.
Having a replica of the model on each GPU restricts the size of the model that can be accommodated in a DDP workflow. FSDP helps overcome this limitation by sharding model parameters, optimizer states, and gradients across data parallel workers while still preserving the simplicity of data parallelism.
This is demonstrated in the following diagram, where in the case of DDP, each GPU holds a complete copy of the model state, including the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). In FSDP, each GPU holds only a slice of the model state, including the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). Using FSDP results in a significantly smaller GPU memory footprint compared to DDP across all workers, enabling the training of very large models or using larger batch sizes for training jobs.

This, however, comes at the cost of increased communication overhead, which is mitigated through FSDP optimizations such as overlapping communication and computation processes with features like pre-fetching. For more detailed information, refer to Getting Started with Fully Sharded Data Parallel (FSDP).
FSDP offers various parameters that allow you to tune the performance and efficiency of your training jobs. Some of the key features and capabilities of FSDP include:

Transformer wrapping policy
Flexible mixed precision
Activation checkpointing
Various sharding strategies to suit different network speeds and cluster topologies:

FULL_SHARD – Shard model parameters, gradients, and optimizer states
HYBRID_SHARD – Full shard within a node DDP across nodes; supports a flexible sharding group for a full replica of the model (HSDP)
SHARD_GRAD_OP – Shard only gradients and optimizer states
NO_SHARD – Similar to DDP

For more information about FSDP, refer to Efficient Large-Scale Training with Pytorch FSDP and AWS.
The following figure shows how FSDP works for two data parallel processes.

Solution overview
In this post, we set up a compute cluster using Amazon EKS, which is a managed service to run Kubernetes in the AWS Cloud and on-premises data centers. Many customers are embracing Amazon EKS to run Kubernetes-based AI/ML workloads, taking advantage of its performance, scalability, reliability, and availability, as well as its integrations with AWS networking, security and other services.
For our FSDP use case, we use the Kubeflow Training Operator on Amazon EKS, which is a Kubernetes-native project that facilitates fine-tuning and scalable distributed training for ML models. It supports various ML frameworks, including PyTorch, which you can use to deploy and manage PyTorch training jobs at scale.
Utilizing the PyTorchJob custom resource of Kubeflow Training Operator, we run training jobs on Kubernetes with a configurable number of worker replicas which allows us to optimize resource utilization.
The following are a few components of the training operator that play a role in our Llama2 fine-tuning use case:

A centralized Kubernetes controller that orchestrates distributed training jobs for PyTorch.
PyTorchJob, a Kubernetes custom resource for PyTorch, provided by the Kubeflow Training Operator, to define and deploy Llama2 training jobs on Kubernetes.
etcd, which is related to the implementation of the rendezvous mechanism for coordinating the distributed training of PyTorch models. Thisetcdserver, as part of the rendezvous process, facilitates the coordination and synchronization of the participating workers during distributed training.

The following diagram illustrates the solution architecture.

Most of the details will be abstracted by the automation scripts that we use to run the Llama2 example.
We use the following code references in this use case:

End-to-end fsdp example
Llama-recipes example

What is Llama2?
Llama2 is a LLM pre-trained on 2 trillion tokens of text and code. It is one of the largest and most powerful LLMs available today You can use Llama2 for a variety of tasks, including natural language processing (NLP), text generation, and translation. For more information, refer to Getting started with Llama.
Llama2 is available in three different model sizes:

Llama2-70b – This is the largest Llama2 model, with 70 billion parameters. It is the most powerful Llama2 model and can be used for the most demanding tasks.
Llama2-13b – This is a medium-sized Llama2 model, with 13 billion parameters. It is a good balance between performance and efficiency, and can be used for a variety of tasks.
Llama2-7b – This is the smallest Llama2 model, with 7 billion parameters. It is the most efficient Llama2 model, and can be used for tasks that don’t require the highest level of performance.

This post enables you to fine-tune all of these models on Amazon EKS. To provide a simple and reproducible experience of creating an EKS cluster and running FSDP jobs on it, we use the aws-do-eks project. The example will also work with a pre-existing EKS cluster.
A scripted walkthrough is available on GitHub for an out-of-the-box experience. In the following sections, we explain the end-to-end process in more detail.
Provision the solution infrastructure
For the experiments described in this post, we use clusters with p4de (A100 GPU) and p5 (H100 GPU) nodes.
Cluster with p4de.24xlarge nodes
For our cluster with p4de nodes, we use the following eks-gpu-p4de-odcr.yaml script:

export ODCR_ID=

cat > ./eks-gpu-p4de-odcr.yaml
Go to Source
02/04/2024 – 00:02 /Kanwaljit Khurmi
Twitter: @hoffeldtcom

Hoffeldt

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

About Admin

About Admin

You May Also Like