GeoX:通过统一形式化的视觉-语言预训练解决几何问题

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

December 16, 2024
作者: Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang
cs.AI

摘要

尽管多模态大型语言模型(MLLMs)在一般任务上表现出色,但在要求理解图表、解释符号和进行复杂推理的自动几何问题解决(GPS)方面却遇到困难。这种限制源于它们在自然图像和文本上的预训练,以及在问题解决过程中缺乏自动验证。此外,当前的几何专家受限于其特定任务设计,使它们在更广泛的几何问题上效果较差。为此,我们提出了GeoX,一个专注于几何理解和推理任务的多模态大型模型。鉴于几何图表符号和自然图像文本之间存在显著差异,我们引入了单模态预训练,以开发图表编码器和符号解码器,增强对几何图像和语料库的理解。此外,我们引入了几何语言对齐,这是一种有效的预训练范式,弥合了单模态几何专家之间的模态差距。我们提出了一个生成器-采样器变压器(GS-Former),用于生成有区分性的查询,并从不均匀分布的几何信号中消除无信息的表示。最后,GeoX受益于视觉指导调整,使其能够将几何图像和问题作为输入,并生成可验证的解决方案。实验证明,GeoX在公认的基准测试中表现优于一般模型和几何专家,如GeoQA、UniGeo、Geometry3K和PGPS9k。
English
Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.

Summary

AI-Generated Summary

PDF42December 18, 2024