記憶、檢索和生成：理解無限視覺概念作為您的個性化助手

摘要

大型語言模型（LLMs）的發展顯著增強了多模態LLMs（MLLMs）作為通用助手的能力。然而，缺乏用戶特定知識仍然限制了它們在人類日常生活中的應用。在本文中，我們介紹了用於MLLMs個性化的檢索增強個性化（RAP）框架。從一個通用的MLLM開始，我們將其轉化為個性化助手的三個步驟。 (a) 記憶：我們設計了一個鍵值數據庫來存儲與用戶相關的信息，例如用戶的姓名、頭像和其他屬性。 (b) 檢索：當用戶啟動對話時，RAP將使用多模態檢索器從數據庫檢索相關信息。 (c) 生成：將輸入查詢和檢索到的概念信息餵入MLLMs以生成個性化、知識增強的回應。與以往方法不同，RAP允許通過更新外部數據庫來進行實時概念編輯。為了進一步提高生成質量並與用戶特定信息對齊，我們設計了一個數據收集流程並創建了一個用於MLLMs個性化訓練的專門數據集。基於該數據集，我們訓練了一系列個性化多模態助手的MLLMs。通過在大規模數據集上預訓練，RAP-MLLMs可以在不進行額外微調的情況下概括到無限的視覺概念。我們的模型在各種任務中展示出優秀的靈活性和生成質量，例如個性化圖像標題、問答和視覺識別。代碼、數據和模型可在https://github.com/Hoar012/RAP-MLLM找到。

English

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.

記憶、檢索和生成：理解無限視覺概念作為您的個性化助手

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

摘要

Summary

Support

Support