ChatPaper.aiChatPaper

视频世界:从无标签视频中探索知识学习

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

January 16, 2025
作者: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
cs.AI

摘要

本研究探讨了深度生成模型是否能够仅通过视觉输入学习复杂知识,与目前主要关注大型语言模型(LLMs)等基于文本的模型形成对比。我们开发了VideoWorld,这是一个自回归视频生成模型,使用未标记的视频数据进行训练,并在基于视频的围棋和机器人控制任务中测试其知识获取能力。我们的实验揭示了两个关键发现:(1)仅使用视频训练提供了学习知识所需的足够信息,包括规则、推理和规划能力;(2)视觉变化的表示对知识获取至关重要。为了提高这一过程的效率和效力,我们引入了潜在动态模型(LDM)作为VideoWorld的关键组成部分。值得注意的是,VideoWorld 在 Video-GoBench 中仅使用一个拥有 3 亿参数的模型就达到了 5 丹职业水平,而无需依赖于强化学习中典型的搜索算法或奖励机制。在机器人任务中,VideoWorld 有效地学习了各种控制操作,并在不同环境中实现泛化,接近于 CALVIN 和 RLBench 中的 Oracle 模型的性能。这项研究为从视觉数据中获取知识开辟了新的途径,所有代码、数据和模型均已开源供进一步研究使用。
English
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.

Summary

AI-Generated Summary

PDF292January 21, 2025