GeoX:通過統一形式化的視覺語言預訓練解決幾何問題
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
December 16, 2024
作者: Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang
cs.AI
摘要
儘管多模式大型語言模型(MLLMs)在一般任務上表現出色,但在需要理解圖表、解釋符號和進行複雜推理的自動幾何問題解決(GPS)方面卻遇到困難。這種限制源於它們在自然圖像和文本上的預訓練,以及在問題解決過程中缺乏自動驗證。此外,目前的幾何專家受限於其特定任務設計,使它們對於更廣泛的幾何問題效果較差。為此,我們提出了GeoX,一個專注於幾何理解和推理任務的多模式大型模型。考慮到幾何圖表-符號和自然圖像-文本之間的顯著差異,我們引入了單模式預訓練,以開發圖表編碼器和符號解碼器,增強對幾何圖像和文集的理解。此外,我們引入了幾何-語言對齊,一種有效的預訓練範式,彌合了單模式幾何專家之間的模態差距。我們提出了一種生成器-取樣器Transformer(GS-Former),用於生成具有辨識性的查詢並從不均勻分佈的幾何信號中消除無信息的表示。最後,GeoX從視覺指導調整中受益,使其能夠將幾何圖像和問題作為輸入並生成可驗證的解決方案。實驗表明,GeoX在公認的基準測試中(如GeoQA、UniGeo、Geometry3K和PGPS9k)表現優於一般模型和幾何專家。
English
Despite their proficiency in general tasks, Multi-modal Large Language Models
(MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands
understanding diagrams, interpreting symbols, and performing complex reasoning.
This limitation arises from their pre-training on natural images and texts,
along with the lack of automated verification in the problem-solving process.
Besides, current geometric specialists are limited by their task-specific
designs, making them less effective for broader geometric problems. To this
end, we present GeoX, a multi-modal large model focusing on geometric
understanding and reasoning tasks. Given the significant differences between
geometric diagram-symbol and natural image-text, we introduce unimodal
pre-training to develop a diagram encoder and symbol decoder, enhancing the
understanding of geometric images and corpora. Furthermore, we introduce
geometry-language alignment, an effective pre-training paradigm that bridges
the modality gap between unimodal geometric experts. We propose a
Generator-And-Sampler Transformer (GS-Former) to generate discriminative
queries and eliminate uninformative representations from unevenly distributed
geometric signals. Finally, GeoX benefits from visual instruction tuning,
empowering it to take geometric images and questions as input and generate
verifiable solutions. Experiments show that GeoX outperforms both generalists
and geometric specialists on publicly recognized benchmarks, such as GeoQA,
UniGeo, Geometry3K, and PGPS9k.Summary
AI-Generated Summary