为何推理至关重要?多模态推理进展综述 (v1)
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
April 4, 2025
作者: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu
cs.AI
摘要
推理是人类智能的核心,它使得我们能够跨多种任务进行结构化的问题解决。近年来,大型语言模型(LLMs)在算术、常识及符号领域的推理能力取得了显著提升。然而,如何有效地将这些能力扩展到多模态场景中——即模型需要整合视觉与文本输入——仍然是一个重大挑战。多模态推理引入了诸多复杂性,例如处理不同模态间的信息冲突,这要求模型采用更为高级的解释策略。应对这些挑战不仅需要复杂的算法,还需建立评估推理准确性与一致性的稳健方法论。本文对文本及多模态LLMs中的推理技术进行了简明而深入的概述。通过全面且最新的比较,我们清晰地阐述了核心推理挑战与机遇,并着重介绍了训练后优化及测试时推理的实用方法。我们的研究为理论框架与实践应用之间架设了桥梁,提供了宝贵的洞见与指导,并为未来研究指明了清晰的方向。
English
Reasoning is central to human intelligence, enabling structured
problem-solving across diverse tasks. Recent advances in large language models
(LLMs) have greatly enhanced their reasoning abilities in arithmetic,
commonsense, and symbolic domains. However, effectively extending these
capabilities into multimodal contexts-where models must integrate both visual
and textual inputs-continues to be a significant challenge. Multimodal
reasoning introduces complexities, such as handling conflicting information
across modalities, which require models to adopt advanced interpretative
strategies. Addressing these challenges involves not only sophisticated
algorithms but also robust methodologies for evaluating reasoning accuracy and
coherence. This paper offers a concise yet insightful overview of reasoning
techniques in both textual and multimodal LLMs. Through a thorough and
up-to-date comparison, we clearly formulate core reasoning challenges and
opportunities, highlighting practical methods for post-training optimization
and test-time inference. Our work provides valuable insights and guidance,
bridging theoretical frameworks and practical implementations, and sets clear
directions for future research.Summary
AI-Generated Summary