通過視覺價值模型擴展推論時間搜索以提高視覺理解能力。

摘要

儘管視覺語言模型（VLMs）取得了重大進展，但目前仍缺乏有效方法來提升推論時計算的品質。這種能力被認為是近期大型語言模型研究中自我改進模型的核心步驟。本文提出了視覺價值模型（VisVM），可引導VLM推論時的搜索，以生成具有更好視覺理解的回應。具體而言，VisVM不僅評估當前搜索步驟中生成的句子品質，還預測可能由當前步驟產生的後續句子品質，從而提供長期價值。通過這種方式，VisVM引導VLM遠離生成容易出現幻覺或細節不足的句子，從而產生更高品質的回應。實驗結果表明，VisVM引導的搜索顯著提升了VLM生成具有更豐富視覺細節且幻覺較少的描述性標題的能力，相較於貪婪解碼和其他視覺獎勵信號搜索方法。此外，我們發現使用VisVM引導標題對模型進行自我訓練，改善了VLM在各種多模式基準上的表現，顯示了發展自我改進VLM的潛力。我們的價值模型和程式碼可在https://github.com/si0wang/VisVM 上找到。

English

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.

通過視覺價值模型擴展推論時間搜索以提高視覺理解能力。

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

摘要

Summary

熱門論文

1比特LLM時代：所有大型語言模型都在1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通過強化學習激勵LLM中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技術報告
Qwen2.5 Technical Report

Support

摘要

Summary

熱門論文

1比特LLM時代：所有大型語言模型都在1.58比特。The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通過強化學習激勵LLM中的推理能力DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技術報告Qwen2.5 Technical Report

1比特LLM時代：所有大型語言模型都在1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通過強化學習激勵LLM中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技術報告
Qwen2.5 Technical Report