缓慢感知:让我们逐步感知几何图形
Slow Perception: Let's Perceive Geometric Figures Step-by-step
December 30, 2024
作者: Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang
cs.AI
摘要
最近,“视觉 o1” 开始进入人们的视野,期望这种缓慢思考的设计能够解决视觉推理任务,特别是几何数学问题。然而,现实是目前的 LVLMs(大型视觉语言模型)甚至难以准确复制一个几何图形,更不用说真正理解几何形状中复杂的内在逻辑和空间关系了。我们认为准确复制(强感知)是视觉 o1 的第一步。因此,我们引入了“缓慢感知”(SP)的概念,指导模型逐渐感知基本的点线组合,就像我们人类逐步重建复杂的几何结构一样。SP 有两个阶段:a)感知分解。感知不是瞬间完成的。在这个阶段,复杂的几何图形被分解为基本的简单单元,以统一几何表示。b)感知流,承认准确追踪一条线并不是一项容易的任务。这个阶段旨在通过使用提出的“感知标尺”逐笔追踪每条线段,避免在回归线段时出现“长距离的视觉跳跃”。令人惊讶的是,这种类似人类感知方式享有一个推理时间缩放定律 —— 越慢越好。研究人员过去努力加快模型的感知速度,但我们再次减缓速度,让模型逐步仔细地阅读图像。
English
Recently, "visual o1" began to enter people's vision, with expectations that
this slow-thinking design can solve visual reasoning tasks, especially
geometric math problems. However, the reality is that current LVLMs (Large
Vision Language Models) can hardly even accurately copy a geometric figure, let
alone truly understand the complex inherent logic and spatial relationships
within geometric shapes. We believe accurate copying (strong perception) is the
first step to visual o1. Accordingly, we introduce the concept of "slow
perception" (SP), which guides the model to gradually perceive basic point-line
combinations, as our humans, reconstruct complex geometric structures
progressively. There are two-fold stages in SP: a) perception decomposition.
Perception is not instantaneous. In this stage, complex geometric figures are
broken down into basic simple units to unify geometry representation. b)
perception flow, which acknowledges that accurately tracing a line is not an
easy task. This stage aims to avoid "long visual jumps" in regressing line
segments by using a proposed "perceptual ruler" to trace each line
stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an
inference time scaling law -- the slower, the better. Researchers strive to
speed up the model's perception in the past, but we slow it down again,
allowing the model to read the image step-by-step and carefully.Summary
AI-Generated Summary