為何推理至關重要?多模態推理進展綜述 (v1)
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
April 4, 2025
作者: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu
cs.AI
摘要
推理是人類智慧的核心,能夠在各種任務中實現結構化的問題解決。近年來,大型語言模型(LLMs)的進展極大地提升了其在算術、常識和符號領域的推理能力。然而,將這些能力有效擴展到多模態情境中——模型必須整合視覺和文本輸入——仍然是一個重大挑戰。多模態推理引入了複雜性,例如處理跨模態的衝突信息,這要求模型採用高級的解釋策略。應對這些挑戰不僅需要精密的算法,還需要評估推理準確性和連貫性的穩健方法。本文簡明而深入地概述了文本和多模態LLMs中的推理技術。通過全面且最新的比較,我們清晰地闡述了核心推理挑戰與機遇,並強調了訓練後優化和測試時推理的實用方法。我們的工作提供了寶貴的見解和指導,橋接了理論框架與實際應用,並為未來研究設定了明確的方向。
English
Reasoning is central to human intelligence, enabling structured
problem-solving across diverse tasks. Recent advances in large language models
(LLMs) have greatly enhanced their reasoning abilities in arithmetic,
commonsense, and symbolic domains. However, effectively extending these
capabilities into multimodal contexts-where models must integrate both visual
and textual inputs-continues to be a significant challenge. Multimodal
reasoning introduces complexities, such as handling conflicting information
across modalities, which require models to adopt advanced interpretative
strategies. Addressing these challenges involves not only sophisticated
algorithms but also robust methodologies for evaluating reasoning accuracy and
coherence. This paper offers a concise yet insightful overview of reasoning
techniques in both textual and multimodal LLMs. Through a thorough and
up-to-date comparison, we clearly formulate core reasoning challenges and
opportunities, highlighting practical methods for post-training optimization
and test-time inference. Our work provides valuable insights and guidance,
bridging theoretical frameworks and practical implementations, and sets clear
directions for future research.Summary
AI-Generated Summary