ChatPaper.aiChatPaper

任务向量是跨模态的。

Task Vectors are Cross-Modal

October 29, 2024
作者: Grace Luo, Trevor Darrell, Amir Bar
cs.AI

摘要

我们研究了视觉与语言模型(VLMs)的内部表示以及它们如何编码任务表示。我们考虑通过示例或指令指定的任务,使用文本或图像输入。令人惊讶的是,我们发现概念上相似的任务被映射到类似的任务向量表示,无论它们是如何指定的。我们的发现表明,为了输出答案,VLMs中的标记经历三个不同阶段:输入、任务和答案,这个过程在不同的模态和规范下是一致的。我们在VLMs中识别的任务向量足够通用,可以在一个模态(例如文本)中派生并转移到另一个模态(例如图像)。此外,我们发现合并示例和基于指令的任务向量会产生更好的任务表示。综合这些观点,这些发现揭示了VLMs的潜在机制,特别是它们代表任务的能力以共享方式跨越不同的模态和任务规范。项目页面:https://task-vectors-are-cross-modal.github.io。
English
We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: https://task-vectors-are-cross-modal.github.io.

Summary

AI-Generated Summary

PDF112November 16, 2024