緩慢感知:讓我們逐步感知幾何圖形
Slow Perception: Let's Perceive Geometric Figures Step-by-step
December 30, 2024
作者: Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang
cs.AI
摘要
最近,“視覺 o1” 開始進入人們的視野,期望這種慢思考的設計能夠解決視覺推理任務,尤其是幾何數學問題。然而,目前的大視覺語言模型(LVLMs)甚至難以準確複製一個幾何圖形,更不用說真正理解幾何形狀內部複雜的邏輯和空間關係。我們認為準確複製(強感知)是視覺 o1 的第一步。因此,我們引入“慢感知”(SP)的概念,引導模型逐步感知基本的點線組合,就像我們人類逐步重構複雜的幾何結構一樣。慢感知有兩個階段:a)感知分解。感知並非瞬間完成。在這個階段,將複雜的幾何圖形分解為基本的簡單單元,以統一幾何表示。b)感知流,承認準確追踪一條線並不是一個容易的任務。這個階段旨在通過使用提出的“感知尺”逐筆追踪每條線段,避免在回歸線段時出現“長距離的視覺跳躍”。令人驚訝的是,這種類似人類感知方式享有一個推論時間縮放定律——越慢越好。過去,研究人員努力加快模型的感知速度,但我們再次放慢它,讓模型逐步且仔細地閱讀圖像。
English
Recently, "visual o1" began to enter people's vision, with expectations that
this slow-thinking design can solve visual reasoning tasks, especially
geometric math problems. However, the reality is that current LVLMs (Large
Vision Language Models) can hardly even accurately copy a geometric figure, let
alone truly understand the complex inherent logic and spatial relationships
within geometric shapes. We believe accurate copying (strong perception) is the
first step to visual o1. Accordingly, we introduce the concept of "slow
perception" (SP), which guides the model to gradually perceive basic point-line
combinations, as our humans, reconstruct complex geometric structures
progressively. There are two-fold stages in SP: a) perception decomposition.
Perception is not instantaneous. In this stage, complex geometric figures are
broken down into basic simple units to unify geometry representation. b)
perception flow, which acknowledges that accurately tracing a line is not an
easy task. This stage aims to avoid "long visual jumps" in regressing line
segments by using a proposed "perceptual ruler" to trace each line
stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an
inference time scaling law -- the slower, the better. Researchers strive to
speed up the model's perception in the past, but we slow it down again,
allowing the model to read the image step-by-step and carefully.Summary
AI-Generated Summary