歐幾里德：通過合成高保真視覺描述來強化多模態LLM

摘要

近年來，多模式大型語言模型（MLLMs）取得了快速進展，但仍然在低層次視覺感知（LLVP）方面遇到困難，特別是準確描述圖像的幾何細節能力。這種能力對於機器人技術、醫學影像分析和製造等領域的應用至關重要。本文首先介紹Geoperception，這是一個旨在評估MLLM準確從圖像轉錄2D幾何信息能力的基準。利用這個基準，我們展示了領先的MLLM存在的限制，然後進行了一項全面的實證研究，探索改善它們在幾何任務上表現的策略。我們的研究結果突顯了某些模型架構、訓練技術和數據策略的好處，包括使用高保真度合成數據和使用數據課程進行多階段訓練。值得注意的是，我們發現數據課程使模型能夠學習從頭開始無法學到的具有挑戰性的幾何理解任務。利用這些見解，我們開發了Euclid，這是一系列專門為強大的低層次幾何感知而優化的模型。儘管僅在合成多模式數據上進行訓練，Euclid表現出對新的幾何形狀具有強大的泛化能力。例如，在某些Geoperception基準任務上，Euclid的表現優於最佳的封閉源模型Gemini-1.5-Pro，最高可提高58.56％，在所有任務中平均提高10.65％。

English

Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

歐幾里德：通過合成高保真視覺描述來強化多模態LLM

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

摘要

Support