Train Your Large Model on Multiple GPUs with Pipeline Parallelism

This article is divided into six parts; they are: • Pipeline Parallelism Overview • Model Preparation for Pipeline Parallelism • Stage and Pipeline Schedule • Training Loop • Distributed Checkpointing • Limitations of Pipeline Parallelism Pipeline parallelism means creating the model as a pipeline of stages.

Train Your Large Model on Multiple GPUs with Pipeline Parallelism

Some language models are too large to train on a single GPU. When the model fits on a single GPU but cannot be trained with a large batch size, you can use data parallelism. However, when the model is too large to fit on a single GPU, you need to split it across multiple GPUs. In this article, you will learn how to use pipeline parallelism to split models for training. In particular, you will learn about:

  • What is pipeline parallelism
  • How to use pipeline parallelism in PyTorch
  • How to save and restore the model with pipeline parallelism

Let’s get started!

This article is divided into six parts; they are: 

• Pipeline Parallelism Overview

• Model Preparation for Pipeline Parallelism

• Stage and Pipeline Schedule

• Training Loop

• Distributed Checkpointing

• Limitations of Pipeline Parallelism