InternLM-XComposer2.5-OmniLive:一种用于长期流式视频和音频交互的综合多模态系统
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
December 12, 2024
作者: Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
创建能够与环境长时间交互的人工智能系统,类似于人类认知,一直是一个长期的研究目标。最近在多模态大型语言模型(MLLMs)方面取得的进展在开放世界理解方面取得了重大进展。然而,连续和同时的流式感知、记忆和推理挑战仍然大部分未被探索。当前的MLLMs受限于它们的序列到序列架构,这限制了它们处理输入和生成响应的能力,类似于在感知时无法思考。此外,依赖长上下文存储历史数据对于长期交互来说是不切实际的,因为保留所有信息变得昂贵且低效。因此,与其依赖单一基础模型执行所有功能,这个项目从专业通用人工智能的概念中汲取灵感,并引入了分离的流式感知、推理和记忆机制,实现了与流式视频和音频输入的实时交互。所提出的框架InternLM-XComposer2.5-OmniLive(IXC2.5-OL)包括三个关键模块:(1)流式感知模块:实时处理多模态信息,将关键细节存储在记忆中,并在响应用户查询时触发推理。(2)多模态长期记忆模块:整合短期和长期记忆,将短期记忆压缩成长期记忆,以便高效检索和提高准确性。(3)推理模块:响应查询并执行推理任务,与感知和记忆模块协调。这个项目模拟了类人认知,使多模态大型语言模型能够随时间提供持续和适应性服务。
English
Creating AI systems that can interact with environments over long periods,
similar to human cognition, has been a longstanding research goal. Recent
advancements in multimodal large language models (MLLMs) have made significant
strides in open-world understanding. However, the challenge of continuous and
simultaneous streaming perception, memory, and reasoning remains largely
unexplored. Current MLLMs are constrained by their sequence-to-sequence
architecture, which limits their ability to process inputs and generate
responses simultaneously, akin to being unable to think while perceiving.
Furthermore, relying on long contexts to store historical data is impractical
for long-term interactions, as retaining all information becomes costly and
inefficient. Therefore, rather than relying on a single foundation model to
perform all functions, this project draws inspiration from the concept of the
Specialized Generalist AI and introduces disentangled streaming perception,
reasoning, and memory mechanisms, enabling real-time interaction with streaming
video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive
(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:
Processes multimodal information in real-time, storing key details in memory
and triggering reasoning in response to user queries. (2) Multi-modal Long
Memory Module: Integrates short-term and long-term memory, compressing
short-term memories into long-term ones for efficient retrieval and improved
accuracy. (3) Reasoning Module: Responds to queries and executes reasoning
tasks, coordinating with the perception and memory modules. This project
simulates human-like cognition, enabling multimodal large language models to
provide continuous and adaptive service over time.Summary
AI-Generated Summary