InternLM-XComposer2.5-OmniLive:一個全面的多模式系統,用於長期流式視頻和音頻互動
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
December 12, 2024
作者: Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
創建能夠長時間與環境互動,類似於人類認知的人工智能系統一直是一個久遠的研究目標。最近在多模態大型語言模型(MLLMs)方面取得的進展在開放世界理解方面取得了重大進展。然而,連續且同時的流式感知、記憶和推理挑戰仍然很少被探索。目前的MLLMs受到它們的序列到序列架構的限制,這限制了它們處理輸入並同時生成回應的能力,類似於在感知時無法思考。此外,依賴長上下文來存儲歷史數據對於長期互動來說是不切實際的,因為保留所有信息變得昂貴且低效。因此,與其依賴單一基礎模型執行所有功能,這個項目從專業綜合型人工智能的概念中汲取靈感,並引入了解耦的流式感知、推理和記憶機制,實現與流式視頻和音頻輸入的實時互動。所提出的框架InternLM-XComposer2.5-OmniLive(IXC2.5-OL)包括三個關鍵模塊:(1)流式感知模塊:實時處理多模態信息,將關鍵細節存儲在記憶中,並在回應用戶查詢時觸發推理。(2)多模態長期記憶模塊:整合短期和長期記憶,將短期記憶壓縮為長期記憶,以實現高效檢索和提高準確性。(3)推理模塊:回應查詢並執行推理任務,與感知和記憶模塊協調。這個項目模擬了類似人類認知的方式,使多模態大型語言模型能夠隨時間提供持續且適應性服務。
English
Creating AI systems that can interact with environments over long periods,
similar to human cognition, has been a longstanding research goal. Recent
advancements in multimodal large language models (MLLMs) have made significant
strides in open-world understanding. However, the challenge of continuous and
simultaneous streaming perception, memory, and reasoning remains largely
unexplored. Current MLLMs are constrained by their sequence-to-sequence
architecture, which limits their ability to process inputs and generate
responses simultaneously, akin to being unable to think while perceiving.
Furthermore, relying on long contexts to store historical data is impractical
for long-term interactions, as retaining all information becomes costly and
inefficient. Therefore, rather than relying on a single foundation model to
perform all functions, this project draws inspiration from the concept of the
Specialized Generalist AI and introduces disentangled streaming perception,
reasoning, and memory mechanisms, enabling real-time interaction with streaming
video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive
(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:
Processes multimodal information in real-time, storing key details in memory
and triggering reasoning in response to user queries. (2) Multi-modal Long
Memory Module: Integrates short-term and long-term memory, compressing
short-term memories into long-term ones for efficient retrieval and improved
accuracy. (3) Reasoning Module: Responds to queries and executes reasoning
tasks, coordinating with the perception and memory modules. This project
simulates human-like cognition, enabling multimodal large language models to
provide continuous and adaptive service over time.Summary
AI-Generated Summary