大型語言和視覺模型的潛在幽靈

摘要

視覺指導調整的成功加速了大型語言和視覺模型（LLVMs）的發展。遵循調整過的大型語言模型（LLMs）的擴展法則，LLVMs進一步增加了其大小，達到了26B、34B，甚至80B個參數。儘管模型大小的增加帶來了顯著的性能提升，但這要求在訓練和推理方面需要更多的硬體資源。因此，自然存在著對於能夠在尺寸更小的情況下實現更大型模型性能的高效LLVMs的強烈需求。為了滿足這一需求，我們提出了一個新的高效LLVM家族，模型大小為0.5B、1.8B、3.8B和7B個參數，名為Phantom，它在有限結構中顯著增強了學習能力。通過在多頭自注意力（MHSA）期間暫時增加潛在的隱藏維度，我們使LLVMs準備在潛在層面上查看和理解更多的視覺語言知識，而不會顯著增加實際模型大小。為了最大化其優勢，我們引入了Phantom優化（PO），使用自回歸監督微調（SFT）和直接偏好優化（DPO）-類似概念，有效地遵循正確答案，同時消除不正確和模棱兩可的答案。Phantom在眾多更大型的開源和封閉源LLVMs中表現優異，將其定位為高效LLVMs領域中的領先解決方案。

English

The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs prepare to look and understand much more vision-language knowledge on the latent, without substantially increasing physical model sizes. To maximize its advantage, we introduce Phantom Optimization (PO) using both autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO)-like concept, which effectively follows correct answers while eliminating incorrect and ambiguous ones. Phantom outperforms numerous larger open- and closed-source LLVMs, positioning itself as a leading solution in the landscape of efficient LLVMs.

大型語言和視覺模型的潛在幽靈

Phantom of Latent for Large Language and Vision Models

摘要

Summary

Support

Support