Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism

This article is divided into five parts; they are: • Introduction to Fully Sharded Data Parallel • Preparing Model for FSDP Training • Training Loop with FSDP • Fine-Tuning FSDP Behavior • Checkpointing FSDP Models Sharding is a term originally used in database management systems, where it refers to dividing a database into smaller units, called shards, to improve performance.

Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism

Some language models are too large to train on a single GPU. In addition to creating the model as a pipeline of stages, as in Pipeline Parallelism, you can split the model across multiple GPUs using Fully Sharded Data Parallelism (FSDP). In this article, you will learn how to use FSDP to split models for training. In particular, you will learn about:

  • The idea of sharding and how FSDP works
  • How to use FSDP in PyTorch

Let’s get started!

This article is divided into five parts; they are: 

     • Introduction to Fully Sharded Data Parallel

     • Preparing Model for FSDP Training

     • Training Loop with FSDP

     • Fine-Tuning FSDP Behavior

     • Checkpointing FSDP Models