TinyLLaVA-Video-R1：迈向更小型化的视频推理多模态大模型

摘要

近期，通过强化学习提升大型多模态模型（LMMs）的推理能力取得了显著进展。然而，现有研究大多基于数学和代码等高推理强度数据集，且研究者普遍选择大规模模型作为基础。我们认为，对于计算资源有限的研究者而言，探索小规模模型的推理能力仍具有重要价值。此外，使模型能够在通用问答数据集上解释其推理过程同样意义重大。因此，我们提出了小规模视频推理模型TinyLLaVA-Video-R1。该模型基于TinyLLaVA-Video，一个参数不超过4B、经过可追溯训练的视频理解模型，不仅在通用视频问答数据集上应用强化学习后展现出显著提升的推理与思维能力，还表现出“顿悟时刻”的涌现特性。此外，我们分享了一系列实验发现，旨在为未来探索小规模模型的视频推理（思维）能力提供实用见解。该模型可通过https://github.com/ZhangXJ199/TinyLLaVA-Video-R1获取。

English

Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

TinyLLaVA-Video-R1：迈向更小型化的视频推理多模态大模型

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

摘要

Summary

Support

Support