Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework

LMMs have made significant strides in vision-language understanding but still need help reasoning over large-scale image collections, limiting their real-world applications like visual search and querying extensive datasets such as personal photo libraries. Existing benchmarks for multi-image question-answering are constrained, typically involving up to 30 images per question, which needs to address the complexities of […] The post Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework appeared first on MarkTechPost.

Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework

LMMs have made significant strides in vision-language understanding but still need help reasoning over large-scale image collections, limiting their real-world applications like visual search and querying extensive datasets such as personal photo libraries. Existing benchmarks for multi-image question-answering are constrained, typically involving up to 30 images per question, which needs to address the complexities of large-scale retrieval tasks. To overcome these limitations, new benchmarks like DocHaystack and InfoHaystack have been introduced, requiring models to retrieve and reason across collections of up to 1,000 documents. This shift presents new challenges, significantly expanding the scope of visual question-answering and retrieval tasks.

Retrieval-augmented generation (RAG) frameworks enhance LMMs by integrating retrieval systems with generative models, enabling them to process extensive image-text datasets effectively. While RAG approaches have been widely explored in text-based tasks, their application in vision-language contexts has gained momentum with models like MuRAG, RetVQA, and MIRAGE. These frameworks utilize advanced retrieval methods, such as relevance encoders and CLIP-based training, to filter and process large image collections. Building on these advancements, the proposed V-RAG framework leverages multiple vision encoders and introduces a question-document relevance module, offering superior performance on the DocHaystack and InfoHaystack benchmarks. This sets a new standard for large-scale visual retrieval and reasoning, addressing critical gaps in existing LMM capabilities.

Researchers from KAUST, the University of Sydney, and IHPC, A*STAR, introduced two benchmarks, DocHaystack and InfoHaystack, to evaluate LMMs on large-scale visual document retrieval and reasoning tasks. These benchmarks simulate real-world scenarios by requiring models to process up to 1,000 documents per query, addressing the limitations of smaller datasets. They also proposed V-RAG, a vision-centric retrieval-augmented generation framework that combines specialized vision encoders and a relevance assessment module. V-RAG achieved a 9% and 11% improvement in Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks, significantly advancing retrieval and reasoning capabilities for LMMs.

To improve document retrieval and reasoning, the DocHaystack and InfoHaystack benchmarks ensure each question yields a unique, document-specific answer. These benchmarks address ambiguity using a three-step curation pipeline: filtering general questions with an LLM, manual review for specificity, and removing questions answerable through general knowledge. The Vision-centric Retrieval-Augmented Generation (V-RAG) framework enhances retrieval from extensive datasets using a vision encoder ensemble and an LLM-based filtering module. Relevant documents are ranked and refined to focus on specific subsets. Questions and selected documents are then processed by LLMs for accurate answers, emphasizing vision-based understanding.

The experiments section details the training setup, metrics, baselines, and results for evaluating the V-RAG framework. Metrics include Recall@1, @3, and @5 for document retrieval and a GPT-4o-mini-based model evaluation for VQA tasks. V-RAG outperforms baselines like BM25, CLIP, and OpenCLIP across DocHaystack and InfoHaystack benchmarks, achieving superior recall and accuracy scores. Fine-tuning with curated distractor images enhances VQA robustness. Ablation studies reveal the importance of combining multiple encoders and the VLM-filter module, significantly improving retrieval accuracy. V-RAG’s top performance across challenging benchmarks highlights its effectiveness in large-scale multimodal document understanding and retrieval tasks.

In conclusion, the study introduces DocHaystack and InfoHaystack, benchmarks designed to assess LMMs in large-scale document retrieval and reasoning tasks. Current benchmarks for multi-image question-answering are limited to small datasets, failing to reflect real-world complexities. The proposed V-RAG framework integrates multiple vision encoders and a relevance filtering module to address this, enhancing retrieval precision and reasoning capabilities. V-RAG outperforms baseline models, achieving up to 11% higher Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks. By enabling efficient processing of thousands of images, V-RAG significantly improves LMM performance in large-scale image retrieval and complex reasoning scenarios.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

???? [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework appeared first on MarkTechPost.