ChatPaper.aiChatPaper

記憶、檢索和生成:理解無限視覺概念作為您的個性化助手

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

October 17, 2024
作者: Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue
cs.AI

摘要

大型語言模型(LLMs)的發展顯著增強了多模態LLMs(MLLMs)作為通用助手的能力。然而,缺乏用戶特定知識仍然限制了它們在人類日常生活中的應用。在本文中,我們介紹了用於MLLMs個性化的檢索增強個性化(RAP)框架。從一個通用的MLLM開始,我們將其轉化為個性化助手的三個步驟。 (a) 記憶:我們設計了一個鍵值數據庫來存儲與用戶相關的信息,例如用戶的姓名、頭像和其他屬性。 (b) 檢索:當用戶啟動對話時,RAP將使用多模態檢索器從數據庫檢索相關信息。 (c) 生成:將輸入查詢和檢索到的概念信息餵入MLLMs以生成個性化、知識增強的回應。與以往方法不同,RAP允許通過更新外部數據庫來進行實時概念編輯。為了進一步提高生成質量並與用戶特定信息對齊,我們設計了一個數據收集流程並創建了一個用於MLLMs個性化訓練的專門數據集。基於該數據集,我們訓練了一系列個性化多模態助手的MLLMs。通過在大規模數據集上預訓練,RAP-MLLMs可以在不進行額外微調的情況下概括到無限的視覺概念。我們的模型在各種任務中展示出優秀的靈活性和生成質量,例如個性化圖像標題、問答和視覺識別。代碼、數據和模型可在https://github.com/Hoar012/RAP-MLLM找到。
English
The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.

Summary

AI-Generated Summary

PDF92November 16, 2024