对称视觉对比优化：以最少对比图像实现视觉-语言模型对齐

摘要

近期研究表明，大型视觉-语言模型（VLMs）往往忽视图像内容，过度依赖语言模型的先验知识，导致在视觉基础任务中出现错误和幻觉。我们推测，这一问题的根源在于现有VLMs未经过明确训练以生成与图像细粒度细节准确对应的文本。为了增强VLM训练期间的视觉反馈，我们提出了S-VCO（对称视觉对比优化），这是一种新颖的微调目标，旨在引导模型捕捉重要视觉细节并将其与相应的文本标记对齐。为进一步促进这种细节对齐，我们引入了MVC，这是一个通过自动筛选和增强视觉反事实数据构建的配对图像-文本数据集，旨在通过涉及最小视觉对比的困难对比案例挑战模型。实验表明，我们的方法在涵盖多种能力和领域的多样化基准测试中持续提升了VLM的性能，实现了高达22%的幻觉减少，并在视觉中心及通用任务上取得了显著进步。值得注意的是，这些改进在视觉依赖性更高的基准测试中愈发明显。简而言之，S-VCO在保持甚至提升模型通用能力的同时，显著增强了VLM在视觉依赖任务上的表现。我们的代码已在https://s-vco.github.io/开源。

English

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/

对称视觉对比优化：以最少对比图像实现视觉-语言模型对齐

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

摘要

Summary

Support