ChatPaper.aiChatPaper

利用视觉价值模型扩展推理时间搜索,以提高视觉理解能力。

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

December 4, 2024
作者: Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
cs.AI

摘要

尽管视觉语言模型(VLMs)取得了显著进展,但目前缺乏有效方法来通过扩展推理时计算来提高响应质量。这种能力被认为是最近大型语言模型研究中自我改进模型的核心步骤之一。在本文中,我们提出了Vision Value Model(VisVM),它可以引导VLM推理时搜索,以生成具有更好视觉理解的响应。具体来说,VisVM不仅评估当前搜索步骤中生成的句子质量,还预测可能由当前步骤产生的后续句子的质量,从而提供长期价值。通过这种方式,VisVM可以引导VLM避免生成容易产生幻觉或细节不足的句子,从而产生更高质量的响应。实验结果表明,与贪婪解码和其他视觉奖励信号的搜索方法相比,VisVM引导的搜索显著提高了VLM生成具有更丰富视觉细节和更少幻觉的描述性标题的能力。此外,我们发现使用VisVM引导的标题对模型进行自训练可以改善VLM在各种多模态基准测试中的性能,表明了发展自我改进VLM的潜力。我们的价值模型和代码可在https://github.com/si0wang/VisVM 上获取。
English
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.

Summary

AI-Generated Summary

PDF72December 6, 2024