Train Your Large Model on Multiple GPUs with Tensor Parallelism
This article is divided into five parts; they are: • An Example of Tensor Parallelism • Setting Up Tensor Parallelism • Preparing Model for Tensor Parallelism • Train a Model with Tensor Parallelism • Combining Tensor Parallelism with FSDP Tensor parallelism originated from the Megatron-LM paper.
Tensor parallelism is a model-parallelism technique that shards a tensor along a specific dimension. It distributes the computation of a tensor across multiple devices with minimal communication overhead. This technique is suitable for models with very large parameter tensors where even a single matrix multiplication is too large to fit on a single GPU. In this article, you will learn how to use tensor parallelism. In particular, you will learn about:
- What is tensor parallelism
- How to design a tensor parallel plan
- How to apply tensor parallelism in PyTorch
Let’s get started!
This article is divided into five parts; they are:
• An Example of Tensor Parallelism
• Setting Up Tensor Parallelism
• Preparing Model for Tensor Parallelism
• Train a Model with Tensor Parallelism
• Combining Tensor Parallelism with FSDP





