Models of visual language are strong and flexible. Next, token prediction may be used to create a variety of vision and cross-modality tasks, such as picture captioning, visual question answering, visual grounding, and even segmentation. As VLMs are scaled up, useful skills like in-context learning also appear along with the enhancement of downstream activities. It is more difficult to train a VLM from the start with the same NLP performance as well-trained pure language models like LLaMA2, as introducing a big language model is already a difficult task. Consequently, it makes sense to look at the process of training a VLM using a pre-trained language model that is readily available.
The widely used shallow alignment techniques, represented by BLIP-2, transfer the image characteristics into the language model’s input embedding space using a trainable Q-Former or a linear layer, which connects a frozen pretrained vision encoder and language model. While this approach converges quickly, it does not perform as well as training the language and vision modules simultaneously, such as PaLI-X. When it comes to chat-style VLM that was taught using shallow alignment techniques, such as MiniGPT-4, LLAVA, and VisualGLM, the poor visual comprehension skills show up as hallucinations. Is it feasible to enhance the big language model’s visual understanding skills without sacrificing its natural language processing (NLP) capabilities?
CogVLM responds with a “yes.” Researchers from Zhipu AI and Tsinghua University introduced CogVLM. This powerful open-source visual language foundation model believes the lack of deep integration between language and visual information is the primary reason for the shallow alignment approaches’ subpar performance. This idea came from comparing the two approaches to effective finetuning: p-tuning learns a task prefix embedding in the input. LoRA uses a low-rank matrix to adjust the model weights in each layer. LoRA functions more effectively and steadily as a result. Since the picture features in the shallow alignment techniques behave similarly to the prefix embedding in p-tuning, a similar occurrence may also occur in VLM.
The following are more specific causes of p-tuning and shallow alignment’s decreased performance:
1. Text tokens train the language model’s frozen weights. The input text area only perfectly matches visual characteristics. The visual characteristics may, therefore, no longer match the input distribution of the weights in the deep layers following multi-layer modifications.
2. The writing style and caption length of the picture captioning job, for instance, may only be encoded into the visual characteristics in the shallow alignment approaches during pretraining. The coherence between the visual elements and the content could be stronger. Adapting the language model to the image-text combined training, as used by Qwen-VL and PaLI, is one potential remedy.
However, this unnecessarily impairs NLP, which may impact text-centered activities like creating image-based poetry or providing context for pictures. Making the language model trainable during VLM pretraining, according to PaLM-E, will result in catastrophic forgetting and a loss of 87.3% in the NLG performance for the 8B language model. Instead, CogVLM enhances the language model with a trainable visual expert. Each layer uses a separate QKV matrix for the picture features in the sequence and an MLP layer for the text characteristics. The visual expert maintains the same FLOPs but increases the number of parameters. If there isn’t an image in the input sequence, the behaviors are the same as in the original language model since all parameters are fixed.
On 14 typical cross-modal benchmarks, such as: 1) image captioning datasets (NoCaps, Flicker30k, COCO), 2) VQA datasets (VQAv2, OKVQA, GQA, TextVQA, VizWiz), and 3) image captioning datasets (SecondBest), their CogVLM-17B trained from Vicuna-7B achieves state-of-the-art or the second-best performance. 3) multiple choice datasets (TDIUC, ScienceQA); 4) visual grounding datasets (RefCOCO, RefCOCO+, RefCOCOg, Visual7W). Not included in this study is the CogVLM-28B-zh that they trained from ChatGLM-12B to support both Chinese and English for commercial use. Since the majority of the most well-known VLMs in the past, such as Flamingo, SimVLM, Coca, BEIT-3, GIT2, PaLI, and PaLI-X, are closed-source, it is anticipated that CogVLM’s open-sourcing will have a significant positive impact on visual understanding research and industrial application.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.