In computer vision, backbones are fundamental components of many deep learning models. Downstream activities like categorization, detection, and segmentation rely on the features extracted by the backbone. There has been an explosion of new pretraining strategies and backbone architectures in recent years. As a result, practitioners have challenges choosing which backbone is ideal for their specific activity and data set.
The Battle of the Backbones (BoB) is a new large-scale benchmark that compares many popular publicly available pretrained checkpoints and randomly initialized baselines on various downstream tasks. Researchers at New York University, Johns Hopkins University, University of Maryland, Georgia Institute of Technology, Inria, and Meta AI Research developed it. The BoB findings shed light on the relative merits of various backbone topologies and pretraining strategies.
The study found some interesting things, including:
- Pretrained supervised convolutional networks typically perform better than transformers. This is likely because supervised convolutional networks are accessible and trained on larger datasets. On the other hand, self-supervised models perform better than their supervised analogs when comparing results across the same-sized datasets.
- Compared to CNNs, ViTs are more sensitive to the number of parameters and the quantity of pretraining data. This indicates that training ViTs may necessitate more data and processing power than training CNNs. The accuracy, compute cost, and practitioners should consider data availability trade-offs before settling on a backbone architecture.
- The degree of correlation between task performance is high. The best BoB backbones function admirably in a wide variety of scenarios.
- End-to-end tweaking helps transformers more than CNNs do on dense prediction jobs. This indicates that transformers may be more task- and dataset-dependent than CNNs.
- Vision-language modeling using CLIP models and other promising advanced architectures. CLIP pretraining is the best among the vanilla vision transformers, even compared to ImageNet-21k supervised trained backbones. This data demonstrates that pretraining in vision language can improve results in computer vision tasks. The authors advise professionals to investigate pre-trained backbones available through CLIP.
The state of the art of computer vision frameworks is mapped out in BoB. However, the area is dynamic, with ongoing progress on novel architectures and pretraining techniques. Therefore, the team thinks it’s vital to constantly evaluate and compare new infrastructures and find ways to boost performance.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.