GeoX: 통합된 형식화된 시각-언어 사전 훈련을 통한 기하 문제 해결

초록

다양한 일반 작업에서 뛰어난 능력을 보이지만, Multi-modal Large Language Models (MLLMs)는 기하 문제 해결 (GPS)에 어려움을 겪습니다. GPS는 다이어그램을 이해하고 기호를 해석하며 복잡한 추론을 요구하기 때문입니다. 이 한계는 MLLMs가 자연 이미지와 텍스트에 대한 사전 훈련을 받았으며 문제 해결 과정에서 자동 검증이 부족하기 때문에 발생합니다. 또한 현재의 기하학 전문가들은 과제별로 설계되어 있어 더 넓은 기하 문제에 대해 효과적이지 못합니다. 이에 우리는 기하 이해와 추론 작업에 중점을 둔 다중 모달 대규모 모델인 GeoX를 제안합니다. 기하 다이어그램-기호와 자연 이미지-텍스트 간의 중요한 차이를 고려하여, 우리는 다이어그램 인코더와 기호 디코더를 개발하기 위해 단모달 사전 훈련을 도입하여 기하 이미지와 말뭉치의 이해를 향상시킵니다. 더 나아가, 우리는 기하-언어 정렬을 소개하여 단모달 기하 전문가 사이의 모달 갭을 줄이는 효과적인 사전 훈련 패러다임을 제안합니다. 우리는 Generator-And-Sampler Transformer (GS-Former)를 제안하여 식별적인 쿼리를 생성하고 불균일하게 분포된 기하 신호에서 비정보적인 표현을 제거합니다. 마지막으로, GeoX는 시각적 지시 튜닝에서 이점을 얻어 기하 이미지와 질문을 입력으로 받아들이고 검증 가능한 해결책을 생성합니다. 실험 결과, GeoX가 GeoQA, UniGeo, Geometry3K, PGPS9k 등의 공개적으로 인정받는 벤치마크에서 일반 전문가 및 기하 전문가를 능가하는 것을 보여줍니다.

English

Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.

GeoX: 통합된 형식화된 시각-언어 사전 훈련을 통한 기하 문제 해결

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

초록

Support