大型語言和視覺模型的潛在幽靈
Phantom of Latent for Large Language and Vision Models
September 23, 2024
作者: Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro
cs.AI
摘要
視覺指導調整的成功加速了大型語言和視覺模型(LLVMs)的發展。遵循調整過的大型語言模型(LLMs)的擴展法則,LLVMs進一步增加了其大小,達到了26B、34B,甚至80B個參數。儘管模型大小的增加帶來了顯著的性能提升,但這要求在訓練和推理方面需要更多的硬體資源。因此,自然存在著對於能夠在尺寸更小的情況下實現更大型模型性能的高效LLVMs的強烈需求。為了滿足這一需求,我們提出了一個新的高效LLVM家族,模型大小為0.5B、1.8B、3.8B和7B個參數,名為Phantom,它在有限結構中顯著增強了學習能力。通過在多頭自注意力(MHSA)期間暫時增加潛在的隱藏維度,我們使LLVMs準備在潛在層面上查看和理解更多的視覺語言知識,而不會顯著增加實際模型大小。為了最大化其優勢,我們引入了Phantom優化(PO),使用自回歸監督微調(SFT)和直接偏好優化(DPO)-類似概念,有效地遵循正確答案,同時消除不正確和模棱兩可的答案。Phantom在眾多更大型的開源和封閉源LLVMs中表現優異,將其定位為高效LLVMs領域中的領先解決方案。
English
The success of visual instruction tuning has accelerated the development of
large language and vision models (LLVMs). Following the scaling laws of
instruction-tuned large language models (LLMs), LLVMs either have further
increased their sizes, reaching 26B, 34B, and even 80B parameters. While this
increase in model size has yielded significant performance gains, it demands
substantially more hardware resources for both training and inference.
Consequently, there naturally exists a strong need for efficient LLVMs that
achieve the performance of larger models while being smaller in size. To
achieve this need, we present a new efficient LLVM family with model sizes of
0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances
learning capabilities within limited structures. By temporarily increasing the
latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs
prepare to look and understand much more vision-language knowledge on the
latent, without substantially increasing physical model sizes. To maximize its
advantage, we introduce Phantom Optimization (PO) using both autoregressive
supervised fine-tuning (SFT) and direct preference optimization (DPO)-like
concept, which effectively follows correct answers while eliminating incorrect
and ambiguous ones. Phantom outperforms numerous larger open- and closed-source
LLVMs, positioning itself as a leading solution in the landscape of efficient
LLVMs.Summary
AI-Generated Summary