任務向量是跨模態的。
Task Vectors are Cross-Modal
October 29, 2024
作者: Grace Luo, Trevor Darrell, Amir Bar
cs.AI
摘要
我們研究了視覺語言模型(VLMs)的內部表示以及它們如何編碼任務表示。我們考慮通過示例或指令來指定的任務,使用文字或圖像輸入。令人驚訝的是,我們發現概念上相似的任務被映射到相似的任務向量表示,無論它們如何被指定。我們的研究結果表明,為了輸出答案,VLMs中的標記經歷三個不同階段:輸入、任務和答案,這個過程在不同的模態和規格下是一致的。我們在VLMs中識別的任務向量足夠通用,可以在一種模態(例如文本)中推導出,並轉移到另一種模態(例如圖像)。此外,我們發現將示例和基於指令的任務向量組合可以產生更好的任務表示。綜上所述,這些見解揭示了VLMs的潛在機制,特別是它們在不同模態和任務規格之間以共享方式表示任務的能力。專案頁面:https://task-vectors-are-cross-modal.github.io。
English
We investigate the internal representations of vision-and-language models
(VLMs) and how they encode task representations. We consider tasks specified
through examples or instructions, using either text or image inputs.
Surprisingly, we find that conceptually similar tasks are mapped to similar
task vector representations, regardless of how they are specified. Our findings
suggest that to output answers, tokens in VLMs undergo three distinct phases:
input, task, and answer, a process which is consistent across different
modalities and specifications. The task vectors we identify in VLMs are general
enough to be derived in one modality (e.g., text) and transferred to another
(e.g., image). Additionally, we find that ensembling exemplar and instruction
based task vectors produce better task representations. Taken together, these
insights shed light on the underlying mechanisms of VLMs, particularly their
ability to represent tasks in a shared manner across different modalities and
task specifications. Project page:
https://task-vectors-are-cross-modal.github.io.Summary
AI-Generated Summary