欧几里得:利用合成高保真视觉描述增强多模态LLM

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

December 11, 2024
作者: Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger
cs.AI

摘要

近年来,多模态大型语言模型(MLLMs)取得了快速进展,但仍然在低层次视觉感知(LLVP)方面存在困难,尤其是准确描述图像的几何细节的能力。这种能力对于机器人技术、医学图像分析和制造业等领域的应用至关重要。本文首先介绍了Geoperception,这是一个旨在评估MLLM准确从图像转录2D几何信息能力的基准。利用这一基准,我们展示了主流MLLM的局限性,然后进行了全面的实证研究,探讨改进它们在几何任务上性能的策略。我们的研究结果突出了某些模型架构、训练技术和数据策略的优势,包括使用高保真度合成数据和采用数据课程的多阶段训练。值得注意的是,我们发现数据课程使模型能够学习那些它们无法从头开始学习的具有挑战性的几何理解任务。利用这些见解,我们开发了Euclid,这是一系列专门针对强大的低层次几何感知进行优化的模型。尽管仅在合成多模态数据上进行训练,Euclid表现出对新颖几何形状的强大泛化能力。例如,在某些Geoperception基准任务上,Euclid的表现超过了最佳闭源模型Gemini-1.5-Pro高达58.56%,在所有任务平均上超过10.65%。
English
Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

Summary

AI-Generated Summary

PDF532December 13, 2024