記憶、檢索和生成:理解無限視覺概念作為您的個性化助手
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
October 17, 2024
作者: Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue
cs.AI
摘要
大型語言模型(LLMs)的發展顯著增強了多模態LLMs(MLLMs)作為通用助手的能力。然而,缺乏用戶特定知識仍然限制了它們在人類日常生活中的應用。在本文中,我們介紹了用於MLLMs個性化的檢索增強個性化(RAP)框架。從一個通用的MLLM開始,我們將其轉化為個性化助手的三個步驟。 (a) 記憶:我們設計了一個鍵值數據庫來存儲與用戶相關的信息,例如用戶的姓名、頭像和其他屬性。 (b) 檢索:當用戶啟動對話時,RAP將使用多模態檢索器從數據庫檢索相關信息。 (c) 生成:將輸入查詢和檢索到的概念信息餵入MLLMs以生成個性化、知識增強的回應。與以往方法不同,RAP允許通過更新外部數據庫來進行實時概念編輯。為了進一步提高生成質量並與用戶特定信息對齊,我們設計了一個數據收集流程並創建了一個用於MLLMs個性化訓練的專門數據集。基於該數據集,我們訓練了一系列個性化多模態助手的MLLMs。通過在大規模數據集上預訓練,RAP-MLLMs可以在不進行額外微調的情況下概括到無限的視覺概念。我們的模型在各種任務中展示出優秀的靈活性和生成質量,例如個性化圖像標題、問答和視覺識別。代碼、數據和模型可在https://github.com/Hoar012/RAP-MLLM找到。
English
The development of large language models (LLMs) has significantly enhanced
the capabilities of multimodal LLMs (MLLMs) as general assistants. However,
lack of user-specific knowledge still restricts their application in human's
daily life. In this paper, we introduce the Retrieval Augmented Personalization
(RAP) framework for MLLMs' personalization. Starting from a general MLLM, we
turn it into a personalized assistant in three steps. (a) Remember: We design a
key-value database to store user-related information, e.g., user's name, avatar
and other attributes. (b) Retrieve: When the user initiates a conversation, RAP
will retrieve relevant information from the database using a multimodal
retriever. (c) Generate: The input query and retrieved concepts' information
are fed into MLLMs to generate personalized, knowledge-augmented responses.
Unlike previous methods, RAP allows real-time concept editing via updating the
external database. To further improve generation quality and alignment with
user-specific information, we design a pipeline for data collection and create
a specialized dataset for personalized training of MLLMs. Based on the dataset,
we train a series of MLLMs as personalized multimodal assistants. By
pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual
concepts without additional finetuning. Our models demonstrate outstanding
flexibility and generation quality across a variety of tasks, such as
personalized image captioning, question answering and visual recognition. The
code, data and models are available at https://github.com/Hoar012/RAP-MLLM.Summary
AI-Generated Summary