FUSION：視覺-語言表徵的完全整合，實現深度跨模態理解

摘要

我們介紹了FUSION，這是一系列多模態大型語言模型（MLLMs），採用了一種完全視覺-語言對齊與整合的範式。與現有方法主要依賴於LLM解碼階段的後期模態交互不同，我們的方法在整個處理流程中實現了深度、動態的整合。為此，我們提出了文本引導的統一視覺編碼，將文本信息融入視覺編碼中，以實現像素級的整合。我們進一步設計了上下文感知的遞歸對齊解碼，在解碼過程中根據文本上下文遞歸聚合視覺特徵，從而實現細粒度的、問題層次的語義整合。為了引導特徵映射並緩解模態差異，我們開發了雙監督語義映射損失。此外，我們通過一種新的數據合成方法構建了一個合成語言驅動的問答（QA）數據集，優先考慮高質量的QA對以優化文本引導的特徵整合。基於這些基礎，我們訓練了兩個規模的FUSION模型——3B和8B，並展示了我們的全模態整合方法在僅使用630個視覺標記的情況下顯著優於現有方法。值得注意的是，FUSION 3B在多數基準測試中超越了Cambrian-1 8B和Florence-VL 8B。即使在僅限於300個視覺標記的情況下，FUSION 3B仍然優於Cambrian-1 8B。我們的消融研究表明，在相同配置下，FUSION在超過一半的基準測試中優於LLaVA-NeXT，且無需動態分辨率，這凸顯了我們方法的有效性。我們公開了我們的代碼、模型權重和數據集。https://github.com/starriver030515/FUSION

English

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

FUSION：視覺-語言表徵的完全整合，實現深度跨模態理解

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

摘要

Summary

Support

Support