The fields of Natural Language Processing (NLP) and Natural Language Generation (NLG) have undergone amazing transformations since the introduction of Large Language Models (LLMs) and multimodal foundation models. These models, which include GPT4V, Claude, and Gemini, combine visual encoders and LLMs. 

Present-day foundation models have shown remarkable performance when presented with text-only or combined image and text inputs. However, an important question arises: Will their capacities change according to the kind of input they are served?

In order to answer this question, a team of researchers has presented IsoBench, a benchmark dataset containing challenges from four important domains: games, science, mathematics, and algorithms. There are several isomorphic representations for every problem in IsoBench, including textual, mathematical, and graphic formats. Because of this diversity, performance disparities resulting from different forms of representation can be thoroughly examined.

The team has shared that IsoBench can be used as a tool to diagnose discrepancies in model performance caused by the input representation by giving detailed feedback. A recurring pattern is seen in a variety of foundation models as models show a predilection for textual representations on the same topic. For example, Claude-3 Opus performs 28.7 points lower when given photos instead of text when assessed on all issues in IsoBench. When presented with image inputs instead of text, GPT-4 Turbo and Gemini Pro both exhibit performance decreases of 18.7 and 14.9 points, respectively.

Two prompting strategies, IsoCombination and IsoScratchPad, have been proposed to mitigate this reported bias and enhance model performance. IsoScratchPad focuses on enabling translations between multiple input forms, whereas IsoCombination considers combinations of diverse input representations. 

By utilizing the advantages of various input modalities, these strategies can lessen the performance disparities between foundation models. The team has shown through experiments that IsoCombination and IsoScratchPad both improve model performance, presenting intriguing directions for further study and advancement in multimodal AI systems.

The team has summarized their primary contributions as follows.

  1. IsoBench, an extensive test dataset with 1,630 samples has been introduced that spans a number of topics, including chess, physics, chemistry, and discrete and applied mathematics. Comprehensive multimodal performance evaluations are made possible by the many isomorphic input representations that each sample has, including textual formats specific to the domain and visual formats. 
  1. Using IsoBench, the team has evaluated eight well-known foundation models and found a recurring pattern, which is multimodal models outperform image-based prompts when it comes to text-only prompts. 
  1. The team has also suggested two methods to bridge the performance gaps between various input modalities. While IsoScratchPad (IsoSP) translates visual inputs into textual representations during inference, IsoCombination (IsoCB) mixes input modalities.
  1. Based on their research, the team has found that in some cases, IsoCB and IsoSP can improve multimodal foundation models’ performance by almost ten percentage points. By using these strategies, the observed bias towards textual representations is lessened, and the model performs better with a variety of input modalities.

Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

Source link