기억, 검색 및 생성: 무한한 시각적 개념을 이해하는 개인화된 비서로서

초록

대형 언어 모델(LLM)의 개발은 다중 모달 언어 모델(MLLM)의 능력을 크게 향상시켰으며, 이를 일반적인 보조 기능으로 활용할 수 있게 되었습니다. 그러나 사용자별 지식의 부족으로 인해 그들의 일상 생활에서의 적용이 제한되고 있습니다. 본 논문에서는 MLLM의 개인화를 위한 검색 증강 개인화(RAP) 프레임워크를 소개합니다. 일반 MLLM으로부터 시작하여 세 가지 단계로 개인화된 보조 기능으로 변환합니다. (a) 기억: 사용자 관련 정보(예: 사용자의 이름, 아바타 및 기타 속성)를 저장하기 위한 키-값 데이터베이스를 설계합니다. (b) 검색: 사용자가 대화를 시작하면 RAP은 다중 모달 검색기를 사용하여 데이터베이스에서 관련 정보를 검색합니다. (c) 생성: 입력 쿼리와 검색된 개념 정보를 MLLM에 공급하여 개인화된, 지식 증강 응답을 생성합니다. 이전 방법과는 달리, RAP은 외부 데이터베이스를 업데이트하여 실시간 개념 편집을 허용합니다. 생성 품질을 더욱 향상시키고 사용자별 정보와 조율을 위해 데이터 수집을 위한 파이프라인을 설계하고 MLLM의 개인화 훈련을 위한 전용 데이터셋을 작성합니다. 이 데이터셋을 기반으로, 일련의 MLLM을 개인화된 다중 모달 보조 기능으로 훈련시킵니다. 대규모 데이터셋을 사전 훈련함으로써, RAP-MLLM은 추가적인 파인튜닝 없이 무한한 시각적 개념으로 일반화할 수 있습니다. 우리의 모델은 개인화된 이미지 캡션, 질문 응답 및 시각 인식과 같은 다양한 작업에서 우수한 유연성과 생성 품질을 보여줍니다. 코드, 데이터 및 모델은 https://github.com/Hoar012/RAP-MLLM에서 제공됩니다.

English

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.

기억, 검색 및 생성: 무한한 시각적 개념을 이해하는 개인화된 비서로서

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

초록

Summary

Support