InternLM-XComposer2.5-OmniLive：一种用于长期流式视频和音频交互的综合多模态系统

摘要

创建能够与环境长时间交互的人工智能系统，类似于人类认知，一直是一个长期的研究目标。最近在多模态大型语言模型（MLLMs）方面取得的进展在开放世界理解方面取得了重大进展。然而，连续和同时的流式感知、记忆和推理挑战仍然大部分未被探索。当前的MLLMs受限于它们的序列到序列架构，这限制了它们处理输入和生成响应的能力，类似于在感知时无法思考。此外，依赖长上下文存储历史数据对于长期交互来说是不切实际的，因为保留所有信息变得昂贵且低效。因此，与其依赖单一基础模型执行所有功能，这个项目从专业通用人工智能的概念中汲取灵感，并引入了分离的流式感知、推理和记忆机制，实现了与流式视频和音频输入的实时交互。所提出的框架InternLM-XComposer2.5-OmniLive（IXC2.5-OL）包括三个关键模块：（1）流式感知模块：实时处理多模态信息，将关键细节存储在记忆中，并在响应用户查询时触发推理。（2）多模态长期记忆模块：整合短期和长期记忆，将短期记忆压缩成长期记忆，以便高效检索和提高准确性。（3）推理模块：响应查询并执行推理任务，与感知和记忆模块协调。这个项目模拟了类人认知，使多模态大型语言模型能够随时间提供持续和适应性服务。

English

Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

InternLM-XComposer2.5-OmniLive：一种用于长期流式视频和音频交互的综合多模态系统

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

摘要

Support