FUSION:視覺-語言表徵的完全整合,實現深度跨模態理解
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
April 14, 2025
作者: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
cs.AI
摘要
我們介紹了FUSION,這是一系列多模態大型語言模型(MLLMs),採用了一種完全視覺-語言對齊與整合的範式。與現有方法主要依賴於LLM解碼階段的後期模態交互不同,我們的方法在整個處理流程中實現了深度、動態的整合。為此,我們提出了文本引導的統一視覺編碼,將文本信息融入視覺編碼中,以實現像素級的整合。我們進一步設計了上下文感知的遞歸對齊解碼,在解碼過程中根據文本上下文遞歸聚合視覺特徵,從而實現細粒度的、問題層次的語義整合。為了引導特徵映射並緩解模態差異,我們開發了雙監督語義映射損失。此外,我們通過一種新的數據合成方法構建了一個合成語言驅動的問答(QA)數據集,優先考慮高質量的QA對以優化文本引導的特徵整合。基於這些基礎,我們訓練了兩個規模的FUSION模型——3B和8B,並展示了我們的全模態整合方法在僅使用630個視覺標記的情況下顯著優於現有方法。值得注意的是,FUSION 3B在多數基準測試中超越了Cambrian-1 8B和Florence-VL 8B。即使在僅限於300個視覺標記的情況下,FUSION 3B仍然優於Cambrian-1 8B。我們的消融研究表明,在相同配置下,FUSION在超過一半的基準測試中優於LLaVA-NeXT,且無需動態分辨率,這凸顯了我們方法的有效性。我們公開了我們的代碼、模型權重和數據集。https://github.com/starriver030515/FUSION
English
We introduce FUSION, a family of multimodal large language models (MLLMs)
with a fully vision-language alignment and integration paradigm. Unlike
existing methods that primarily rely on late-stage modality interaction during
LLM decoding, our approach achieves deep, dynamic integration throughout the
entire processing pipeline. To this end, we propose Text-Guided Unified Vision
Encoding, incorporating textual information in vision encoding to achieve
pixel-level integration. We further design Context-Aware Recursive Alignment
Decoding that recursively aggregates visual features conditioned on textual
context during decoding, enabling fine-grained, question-level semantic
integration. To guide feature mapping and mitigate modality discrepancies, we
develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a
Synthesized Language-Driven Question-Answer (QA) dataset through a new data
synthesis method, prioritizing high-quality QA pairs to optimize text-guided
feature integration. Building on these foundations, we train FUSION at two
scales-3B, 8B-and demonstrate that our full-modality integration approach
significantly outperforms existing methods with only 630 vision tokens.
Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most
benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited
to 300 vision tokens. Our ablation studies show that FUSION outperforms
LLaVA-NeXT on over half of the benchmarks under same configuration without
dynamic resolution, highlighting the effectiveness of our approach. We release
our code, model weights, and dataset. https://github.com/starriver030515/FUSIONSummary
AI-Generated Summary