LlamaV-o1: LLM에서 단계별 시각적 추론 재고하기

초록

추론은 복잡한 다단계 문제를 해결하는 데 필수적인 능력으로, 특히 순차적인 단계별 이해가 중요한 시각적 맥락에서는 특히 중요합니다. 기존 접근 방식은 시각적 추론을 평가하기 위한 포괄적인 프레임워크가 부족하며, 단계별 문제 해결을 강조하지 않습니다. 이에 우리는 대규모 언어 모델(LMMs)에서 단계별 시각적 추론을 발전시키기 위한 포괄적인 프레임워크를 제안합니다. 첫째, 다단계 추론 작업을 평가하기 위해 특별히 설계된 시각적 추론 벤치마크를 소개합니다. 이 벤치마크는 총 4천개 이상의 추론 단계를 포함한 복잡한 시각적 지각부터 과학적 추론까지 여덟 가지 다양한 범주의 도전 과제를 제시하여, LLMs의 정확하고 해석 가능한 시각적 추론 능력을 다단계로 견고하게 평가할 수 있습니다. 둘째, 개별 단계의 시각적 추론 품질을 평가하는 새로운 메트릭을 제안합니다. 이 제안된 메트릭은 전통적인 최종 작업 정확도 메트릭보다 추론 성능에 대한 더 깊은 통찰을 제공하며, 정확성과 논리적 일관성을 강조합니다. 셋째, 다단계 커리큘럼 학습 방식을 활용하여 훈련된 새로운 다중 모달 시각적 추론 모델인 LlamaV-o1을 제안합니다. 이 제안된 LlamaV-o1은 다단계 추론을 위해 설계되었으며, 구조화된 훈련 패러다임을 통해 단계별로 학습합니다. 포괄적인 실험 결과는 우리의 LlamaV-o1이 기존 오픈 소스 모델을 능가하며, 폐쇄 소스 프로프리터리 모델에 유리한 성과를 보인다는 것을 보여줍니다. 최근 Llava-CoT와 비교했을 때, 우리의 LlamaV-o1은 여섯 가지 벤치마크에서 평균 점수 67.3을 달성하며 추론 스케일링 시 5배 빠른 속도로 작동합니다. 우리의 벤치마크, 모델 및 코드는 공개적으로 이용 가능합니다.

English

Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.

LlamaV-o1: LLM에서 단계별 시각적 추론 재고하기

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

초록

Support