Train Your Large Model on Multiple GPUs with Tensor Parallelism

This article is divided into five parts; they are: • An Example of Tensor Parallelism • Setting Up Tensor Parallelism • Preparing Model for Tensor Parallelism • Train a Model with Tensor Parallelism • Combining Tensor Parallelism with FSDP Tensor parallelism originated from the Megatron-LM paper.

Train Your Large Model on Multiple GPUs with Tensor Parallelism

Tensor parallelism is a model-parallelism technique that shards a tensor along a specific dimension. It distributes the computation of a tensor across multiple devices with minimal communication overhead. This technique is suitable for models with very large parameter tensors where even a single matrix multiplication is too large to fit on a single GPU. In this article, you will learn how to use tensor parallelism. In particular, you will learn about:

  • What is tensor parallelism
  • How to design a tensor parallel plan
  • How to apply tensor parallelism in PyTorch

Let’s get started!

This article is divided into five parts; they are:

     • An Example of Tensor Parallelism

     • Setting Up Tensor Parallelism

     • Preparing Model for Tensor Parallelism

     • Train a Model with Tensor Parallelism

     • Combining Tensor Parallelism with FSDP