Recently, diffusion models have become powerful tools in various fields, like image and 3D object generation. Their success comes from their ability to handle denoising tasks with different types of noise, efficiently turning random noise into the target data distribution through repeated denoising steps. Using Transformer-based structures, it has been shown that adding more parameters usually improves performance. However, training and running these models is costly. This is because the deep networks are dense, meaning every example uses all parameters, leading to high computational costs as they scale up.

The current method, Conditional Computation, is a promising scaling technique that aims to increase model capacity while keeping the training and inference costs constant. This is done by using only a subset of parameters for each example. Another method, the Mixture of Experts (MoEs), combines the outputs of sub-models, or experts through an input-dependent router and has been successfully used in various fields. In the field of NLP, top-k gating in LSTMs has been introduced, along with auxiliary losses to keep the experts balanced. Lastly, in MoEs for diffusion models, research has been done using multiple expert models, each focusing on a specific range of timesteps.

Researchers from Kunlun Inc. Beijing, China have proposed DiT-MoE, a new version of the DiT architecture for image generation. DiT-MoE modifies some of the dense feedforward layers in DiT by replacing them with sparse MoE layers. In these layers, each image token is directed to a specific subset of experts, which are MLP layers. Moreover, the architecture includes two main designs, one is sharing part of experts to capture common knowledge and the second is balancing expert loss to reduce redundancy in different routed experts. This paper thoroughly analyzes how these features make it possible to train a parameter-efficient MoE diffusion model and observe some fascinating patterns in expert routing from various perspectives.

The AdamW optimizer is used without weight decay across all datasets, with a constant learning rate. An exponential moving average (EMA) of the DiT-MoE weights is used during training with a decay rate of 0.9999, and the results are based on this EMA model. The proposed models are trained on an Nvidia A100 GPU using the ImageNet dataset at various resolutions. Moreover, techniques like classifier-free guidance are applied during training, and a pre-trained variational autoencoder model from Stable Diffusion on huggingface2 is used. The performance of image generation is evaluated using the Frechet Inception Distance (FID), a common metric for assessing the quality of generated images.

The evaluation results on conditional image generation across various metrics show excellent performance compared to dense competitors. On the class-conditional ImageNet 256×256 dataset, the DiT-MoE model achieves an FID score of 1.72, outperforming all previous models with different architectures. Moreover, DiT-MoE uses only 1.5 billion parameters and significantly outperforms Transformer-based competitors like Large-DiT-3B, Large-DiT-7B, and LlamaGen-3B. This shows the potential of MoE in diffusion models. Similar improvements are seen in almost all evaluation metrics on the class-conditional ImageNet 512×512 dataset.

DiT-MoE: A New Version of the DiT Architecture for Image Generation - image  on https://aiquantumintelligence.com

In summary, researchers have developed DiT-MoE, an updated version of the DiT architecture for image generation. DiT-MoE enhances some of the dense feedforward layers in DiT by replacing them with sparse MoE layers. This method uses sparse conditional computation to train large diffusion transformer models, leading to efficient inference and significant enhancement in image generation tasks. Also, simple designs are used to utilize model sparsity efficiently based on inputs. This paper presents the start of exploring large-scale conditional computation for diffusion models. Future work includes training more stable and faster heterogeneous expert architectures and improving knowledge distillation.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here


DiT-MoE: A New Version of the DiT Architecture for Image Generation - image photo-sajjad-Ansari-150x150 on https://aiquantumintelligence.com

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.





Source link