And so, it appears that the answer is not a fight to the death between CNNs and Transformers (see the many overindulgent eulogies for LSTMs), but rather something a bit more romantic. Not only does the adoption of 2D convolutions in hierarchical transformers like CvT and PVTv2 conveniently create multiscale features, reduce the complexity of self-attention, and simplify architecture by alleviating the need for positional encoding, but these models also employ residual connections, another inherited trait of their progenitors. The complementary strengths of transformers and CNNs have been brought together in viable offspring.
So is the era of ResNet over? It would certainly seem so, although any paper will surely need to include this indefatigable backbone for comparison for some time to come. It is important to remember, however, that there are no losers here, just a new generation of powerful and transferable feature extractors for all to enjoy, if they know where to look. Parameter efficient models like PVTv2 democratize research of more complex architectures by offering powerful feature extraction with a small memory footprint, and deserve to be added to the list of standard backbones for benchmarking new architectures.
This article has focused on how the cross-pollination of convolutional operations and self-attention has given us the evolution of hierarchical feature transformers. These models have shown dominant performance and parameter efficiency at small scales, making them ideal feature extraction backbones (especially in parameter-constrained environments). However, there is a lack of exploration into whether the efficiencies and inductive biases that these models capitalize on at smaller scales can transfer to large-scale success and threaten the dominance of pure ViTs at much higher parameter counts.
Large Multimodal Models (LMMS) like Large Language and Visual Assistant (LLaVA) and other applications that require a natural language understanding of visual data rely on Contrastive Language–Image Pretraining (CLIP) embeddings generated from ViT-L features, and therefore inherit the strengths and weaknesses of ViT. If research into scaling hierarchical transformers shows that their benefits, such as multiscale features that enhance fine-grained understanding, enable them to to achieve better or similar performance with greater parameter efficiency than ViT-L, it would have widespread and immediate practical impact on anything using CLIP: LMMs, robotics, assistive technologies, augmented/virtual reality, content moderation, education, research, and many more applications affecting society and industry could be improved and made more efficient, lowering the barrier for development and deployment of these technologies.