歐幾里德:通過合成高保真視覺描述來強化多模態LLM
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
December 11, 2024
作者: Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger
cs.AI
摘要
近年來,多模式大型語言模型(MLLMs)取得了快速進展,但仍然在低層次視覺感知(LLVP)方面遇到困難,特別是準確描述圖像的幾何細節能力。這種能力對於機器人技術、醫學影像分析和製造等領域的應用至關重要。本文首先介紹Geoperception,這是一個旨在評估MLLM準確從圖像轉錄2D幾何信息能力的基準。利用這個基準,我們展示了領先的MLLM存在的限制,然後進行了一項全面的實證研究,探索改善它們在幾何任務上表現的策略。我們的研究結果突顯了某些模型架構、訓練技術和數據策略的好處,包括使用高保真度合成數據和使用數據課程進行多階段訓練。值得注意的是,我們發現數據課程使模型能夠學習從頭開始無法學到的具有挑戰性的幾何理解任務。利用這些見解,我們開發了Euclid,這是一系列專門為強大的低層次幾何感知而優化的模型。儘管僅在合成多模式數據上進行訓練,Euclid表現出對新的幾何形狀具有強大的泛化能力。例如,在某些Geoperception基準任務上,Euclid的表現優於最佳的封閉源模型Gemini-1.5-Pro,最高可提高58.56%,在所有任務中平均提高10.65%。
English
Multimodal large language models (MLLMs) have made rapid progress in recent
years, yet continue to struggle with low-level visual perception (LLVP) --
particularly the ability to accurately describe the geometric details of an
image. This capability is crucial for applications in areas such as robotics,
medical image analysis, and manufacturing. In this paper, we first introduce
Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately
transcribe 2D geometric information from an image. Using this benchmark, we
demonstrate the limitations of leading MLLMs, and then conduct a comprehensive
empirical study to explore strategies for improving their performance on
geometric tasks. Our findings highlight the benefits of certain model
architectures, training techniques, and data strategies, including the use of
high-fidelity synthetic data and multi-stage training with a data curriculum.
Notably, we find that a data curriculum enables models to learn challenging
geometry understanding tasks which they fail to learn from scratch. Leveraging
these insights, we develop Euclid, a family of models specifically optimized
for strong low-level geometric perception. Although purely trained on synthetic
multimodal data, Euclid shows strong generalization ability to novel geometry
shapes. For instance, Euclid outperforms the best closed-source model,
Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and
10.65% on average across all tasks.Summary
AI-Generated Summary