ChatPaper.aiChatPaper

利用自合成数据提升多模态基础模型的认知能力与可解释性

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

February 19, 2025
作者: Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu
cs.AI

摘要

大型多模态模型(LMMs)在广泛的视觉任务中展现了卓越的能力。然而,它们在细粒度视觉推理方面往往表现欠佳,难以识别特定领域的目标,也无法为其预测提供合理的解释。为解决这一问题,我们提出了一种新颖的视觉拒绝采样框架,通过自我合成数据来提升LMMs的认知与可解释性。具体而言,视觉微调需要图像、查询及目标答案。我们的方法首先生成包含可验证视觉特征的可解释答案,这些特征基于专家定义的概念,并依据其与图像内容的契合度精心挑选。每轮微调后,我们采用无奖励模型的过滤机制,筛选出最高质量的可解释答案用于下一轮调优。通过这种数据合成与微调的迭代过程,模型生成准确且合理解释的能力逐步提升。实验结果表明,该方法在提升专业视觉分类任务的准确性和可解释性方面效果显著。
English
Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

Summary

AI-Generated Summary

PDF73February 21, 2025