ChatPaper.aiChatPaper

總結:用於大型視覺語言模型的令牌級偵探獎勵模型

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

October 7, 2024
作者: Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen
cs.AI

摘要

儘管獎勵模型在改進多模式大型語言模型方面取得成功,但獎勵模型本身仍然粗糙且包含極少信息。值得注意的是,現有的獎勵模型僅通過為任何文本分配單一二元反饋來模仿人類標註,而不論文本長度如何。在多模式語言模型的領域中,這些模型需要處理圖像和文本,一個天真的獎勵模型可能會對文本產生隱含偏見,並且與圖像的關聯性降低。在本文中,我們提出了一種基於標記級別的偵探獎勵模型(TLDR),以為每個文本標記提供細緻的標註。我們首先介紹一種基於干擾的方法來生成合成的困難負例及其標記級別標籤,以訓練TLDR模型。然後我們展示了TLDR模型的豐富用途,既可以幫助現成模型自我校正其生成,也可以作為幻覺評估工具。最後,我們展示了TLDR模型可以將人類標註速度提高3倍,以獲得更廣泛範圍的高質量視覺語言數據。
English
Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a Token-Level Detective Reward Model (TLDR) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

Summary

AI-Generated Summary

PDF172November 16, 2024