InternLM-XComposer2.5-OmniLive: 장기 스트리밍 비디오 및 오디오 상호 작용을 위한 포괄적인 멀티모달 시스템

초록

인간의 인지와 유사하게 환경과 오랜 기간 상호 작용할 수 있는 AI 시스템을 개발하는 것은 오랫동안 연구 목표였다. 최근의 다중 모달 대형 언어 모델 (MLLMs)의 발전으로 인해 개방 세계 이해에서 상당한 진전이 이루어졌다. 그러나 연속적이고 동시에 스트리밍 인식, 기억 및 추론의 도전은 여전히 크게 탐구되지 않았다. 현재 MLLMs는 입력을 처리하고 응답을 생성하는 능력을 제한하는 순차적 시퀀스 아키텍처로 제약을 받고 있어서 인식하는 동안에는 생각할 수 없는 것과 유사하다. 또한, 역사적 데이터를 저장하기 위해 긴 문맥에 의존하는 것은 장기간 상호 작용에는 비실용적이며 비효율적이다. 따라서, 모든 기능을 수행하기 위해 단일 기본 모델에 의존하는 대신, 이 프로젝트는 전문가 일반화된 AI 개념에서 영감을 받아 스트리밍 인식, 추론 및 기억 메커니즘을 분리하여 제안된 InternLM-XComposer2.5-OmniLive (IXC2.5-OL) 프레임워크를 소개한다. 이는 스트리밍 비디오 및 오디오 입력과 실시간 상호 작용을 가능하게 한다. 제안된 프레임워크는 세 가지 주요 모듈로 구성되어 있다: (1) 스트리밍 인식 모듈: 핵심 세부 정보를 기억에 저장하고 사용자 쿼리에 응답하기 위해 추론을 유도하는 다중 모달 정보를 실시간으로 처리한다. (2) 다중 모달 장기 기억 모듈: 단기 및 장기 기억을 통합하여 효율적인 검색 및 향상된 정확성을 위해 단기 기억을 장기 기억으로 압축한다. (3) 추론 모듈: 쿼리에 응답하고 추론 작업을 실행하여 인식 및 기억 모듈과 협력한다. 이 프로젝트는 인간과 유사한 인지를 시뮬레이션하여 다중 모달 대형 언어 모델이 시간이 지남에 따라 지속적이고 적응적인 서비스를 제공할 수 있게 한다.

English

Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

InternLM-XComposer2.5-OmniLive: 장기 스트리밍 비디오 및 오디오 상호 작용을 위한 포괄적인 멀티모달 시스템

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

초록

Support