跳跃推理曲线?追踪GPT-[n]和o-[n]模型在多模式拼图中推理表现的演变
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
February 3, 2025
作者: Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria
cs.AI
摘要
OpenAI 的 o1 和 o3 的发布标志着大型语言模型朝着先进推理能力的重大范式转变。值得注意的是,o3 在人工通用智能的抽象和推理语料库(ARC-AGI)上的新颖问题解决和技能习得方面超越了人类。然而,这一基准仅限于符号模式,而人类通常感知和推理涉及视觉和语言数据的多模态场景。因此,迫切需要研究多模态任务中的先进推理能力。为此,我们追踪 GPT-[n] 和 o-[n] 系列模型在具有挑战性的多模态难题上的演变,这些难题需要细粒度的视觉感知和抽象或算法推理。o1 的卓越表现几乎是 GPT-4o 的 750 倍计算成本,引发了对其效率的担忧。我们的结果显示,在模型迭代过程中,推理能力呈明显上升趋势,GPT 系列模型之间以及随后到 o1 之间存在显著的性能提升。然而,我们观察到 o1 模型在需要抽象推理的简单多模态难题上仍然存在困难。此外,它在算法难题中的表现也较差。我们计划持续跟踪该系列中的新模型,并相应地在本文中更新我们的结果。本评估中使用的所有资源均可在 https://github.com/declare-lab/LLM-PuzzleTest 上公开获取。
English
The releases of OpenAI's o1 and o3 mark a significant paradigm shift in Large
Language Models towards advanced reasoning capabilities. Notably, o3
outperformed humans in novel problem-solving and skill acquisition on the
Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI).
However, this benchmark is limited to symbolic patterns, whereas humans often
perceive and reason about multimodal scenarios involving both vision and
language data. Thus, there is an urgent need to investigate advanced reasoning
capabilities in multimodal tasks. To this end, we track the evolution of the
GPT-[n] and o-[n] series models on challenging multimodal puzzles, requiring
fine-grained visual perception with abstract or algorithmic reasoning. The
superior performance of o1 comes at nearly 750 times the computational cost of
GPT-4o, raising concerns about its efficiency. Our results reveal a clear
upward trend in reasoning capabilities across model iterations, with notable
performance jumps across GPT-series models and subsequently to o1. Nonetheless,
we observe that the o1 model still struggles with simple multimodal puzzles
requiring abstract reasoning. Furthermore, its performance in algorithmic
puzzles remains poor. We plan to continuously track new models in the series
and update our results in this paper accordingly. All resources used in this
evaluation are openly available https://github.com/declare-lab/LLM-PuzzleTest.Summary
AI-Generated Summary