对称视觉对比优化:以最少对比图像实现视觉-语言模型对齐
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
February 19, 2025
作者: Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber
cs.AI
摘要
近期研究表明,大型视觉-语言模型(VLMs)往往忽视图像内容,过度依赖语言模型的先验知识,导致在视觉基础任务中出现错误和幻觉。我们推测,这一问题的根源在于现有VLMs未经过明确训练以生成与图像细粒度细节准确对应的文本。为了增强VLM训练期间的视觉反馈,我们提出了S-VCO(对称视觉对比优化),这是一种新颖的微调目标,旨在引导模型捕捉重要视觉细节并将其与相应的文本标记对齐。为进一步促进这种细节对齐,我们引入了MVC,这是一个通过自动筛选和增强视觉反事实数据构建的配对图像-文本数据集,旨在通过涉及最小视觉对比的困难对比案例挑战模型。实验表明,我们的方法在涵盖多种能力和领域的多样化基准测试中持续提升了VLM的性能,实现了高达22%的幻觉减少,并在视觉中心及通用任务上取得了显著进步。值得注意的是,这些改进在视觉依赖性更高的基准测试中愈发明显。简而言之,S-VCO在保持甚至提升模型通用能力的同时,显著增强了VLM在视觉依赖任务上的表现。我们的代码已在https://s-vco.github.io/开源。
English
Recent studies have shown that Large Vision-Language Models (VLMs) tend to
neglect image content and over-rely on language-model priors, resulting in
errors in visually grounded tasks and hallucinations. We hypothesize that this
issue arises because existing VLMs are not explicitly trained to generate texts
that are accurately grounded in fine-grained image details. To enhance visual
feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive
Optimization), a novel finetuning objective that steers the model toward
capturing important visual details and aligning them with corresponding text
tokens. To further facilitate this detailed alignment, we introduce MVC, a
paired image-text dataset built by automatically filtering and augmenting
visual counterfactual data to challenge the model with hard contrastive cases
involving Minimal Visual Contrasts. Experiments show that our method
consistently improves VLM performance across diverse benchmarks covering
various abilities and domains, achieving up to a 22% reduction in
hallucinations, and significant gains in vision-centric and general tasks.
Notably, these improvements become increasingly pronounced in benchmarks with
higher visual dependency. In short, S-VCO offers a significant enhancement of
VLM's visually-dependent task performance while retaining or even improving the
model's general abilities. We opensource our code at https://s-vco.github.io/Summary
AI-Generated Summary