ChatPaper.aiChatPaper

TinyLLaVA-Video-R1:邁向更小型化的視訊推理多模態大模型

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

April 13, 2025
作者: Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
cs.AI

摘要

近期,通過強化學習提升大型多模態模型(LMMs)的推理能力取得了顯著進展。然而,現有研究大多基於數學和代碼等高推理強度的數據集,且研究者普遍選擇大規模模型作為基礎。我們認為,對於計算資源有限的研究者而言,探索小規模模型的推理能力仍具有重要價值。此外,使模型能夠在一般問答數據集上解釋其推理過程同樣意義重大。因此,我們提出了小規模視頻推理模型TinyLLaVA-Video-R1。該模型基於TinyLLaVA-Video,這是一個參數不超過4B、經過可追溯訓練的視頻理解模型。在對一般Video-QA數據集使用強化學習後,它不僅展現出顯著提升的推理與思維能力,還表現出“頓悟時刻”的湧現特性。此外,我們分享了一系列實驗發現,旨在為未來探索小規模模型的視頻推理(思維)能力提供實用見解。該模型可在https://github.com/ZhangXJ199/TinyLLaVA-Video-R1獲取。
English
Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

Summary

AI-Generated Summary

PDF112April 15, 2025