PUMA：利用多粒度視覺生成技術增強統一的MLLM

摘要

最近在多模態基礎模型方面取得了顯著進展，對於視覺-語言理解方面有了重大突破。最初的嘗試還探索了多模態大型語言模型（MLLMs）在視覺內容生成方面的潛力。然而，現有的研究尚未充分解決統一MLLM範式中不同圖像生成任務的不同粒度需求問題 - 從文本到圖像生成所需的多樣性，到圖像操作中所需的精確可控性。在這項工作中，我們提出了PUMA，即用多粒度視覺生成賦能統一MLLM。PUMA將多粒度視覺特徵統一為MLLM的輸入和輸出，優雅地應對統一MLLM框架中各種圖像生成任務的不同粒度要求。在多模態預訓練和任務特定指導調整之後，PUMA展現了在各種多模態任務中的熟練能力。這項工作代表了邁向真正統一MLLM的重要一步，該模型能夠適應各種視覺任務的粒度需求。代碼和模型將在https://github.com/rongyaofang/PUMA 上發布。

English

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in https://github.com/rongyaofang/PUMA.

PUMA：利用多粒度視覺生成技術增強統一的MLLM

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

摘要

Summary

Support

Support