Big vision-language models, or LVLMs, can interpret visual cues and provide easy replies for users to interact with. This is accomplished by skillfully fusing large language models (LLMs) with large-scale visual instruction finetuning. Nevertheless, LVLMs only need hand-crafted or LLM-generated datasets for alignment by supervised fine-tuning (SFT). Although it works well to change LVLMs from caption generators to models that obey instructions, LVLMs can still produce replies that are hurtful, ill-intentioned, or useless. This suggests that they still need to be more aligned with human preferences. Furthermore, while previous research encourages the organization of visual instruction tuning samples in multi-turn forms, the LVLMs’ capacity to interact is limited by the weak connections and interdependence between different turns. Here, the interaction ability assesses how well LVLMs can adjust their replies using the prior context in multi-turn interactions. These two drawbacks limit the practical use of LVLMs as visual helpers.

The research team from  SRI International and the University of Illinois Urbana-Champaign presents DRESS, an LVLM that is uniquely taught using Natural Language Feedback (NLF) produced by LLMs in this work (refer to Figure 1). The research team instructs LLMs to provide fine-grained feedback on the LVLM’s replies by providing them with specific rules and extensive photo annotation. In keeping with the process of creating human-aligned LLMs, this feedback annotation considers the three H criteria: helpfulness, honesty, and harmlessness. The feedback measures the replies’ overall quality along the 3H criteria and provides a numerical score and NLF. The research team’s method divides NLF into critique and refining. This is a novel classification. While the refinement NLF offers precise recommendations to LVLMs on improving their replies to align with the ground truth reference, the critique NLF evaluates the responses’ strengths and faults. This classification provides a natural application of two kinds of NLF to make LVLMs more palatable to humans and enhance their interaction capabilities.

Meet 'DRESS': A Large Vision Language Model (LVLM) that Align and Interact with Humans via Natural Language Feedback - image  on
Figure 1: Researchers direct DRESS to use natural language input, which is divided into two categories, critique and refinement, to enhance both alignment with human preferences and interaction capacity.

The research team generalizes the conditional reinforcement learning technique to meet the non-differentiable character of NLF and trains the LVLMs with such feedback. Specifically, the research team uses linguistic modeling (LM) loss on the replies to train DRESS to generate equivalent responses conditioned on the two NLFs. The research team refines DRESS by analyzing and interpreting the numerical results to match user preferences better. Through multi-turn interactions during inference, the research team trains DRESS to learn the meta-skill of refining its original replies by employing refinement NLF.

The research team assesses DRESS on multi-turn interactions, adversarial prompting for harmlessness assessment, picture captioning for honesty assessment, and open-ended visual question responding for helpfulness evaluation. The experiments’ findings show that, compared to earlier LVLMs, DRESS can provide replies that align with human values and have superior interaction capabilities that allow it to learn from feedback and modify responses as needed efficiently. To their knowledge, the research team’s effort is the first to address the interaction ability and all three 3H criteria for LVLMs.

The research team’s contributions are summed up as follows:

• The research team suggests using natural language feedback (NLF), which may be divided into critique and refining NLF, to enhance LVLMs’ ability to interact and align with human preferences.

• By training the model to provide matching responses conditioned on the NLF, the research team generalizes the conditional reinforcement learning method to accommodate the non-differentiable NLF successfully. Compared to the previous SOTA, the research team’s suggested model, DRESS, demonstrates relative improvements of 9.76%, 11.52%, and 21.03% based on a systematic evaluation of helpfulness, honesty, and harmlessness alignment.

• The research group generates and makes 63K annotated language NLF examples available for public use, including 3H characteristics. Furthermore, the research team created a publicly available dataset of 4.7K samples for harmlessness alignment and LVLM assessment.

Check out the Paper and DatasetAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

Source link