大型多模态模型能够解释大型多模态模型中的特征。

摘要

最近在大型多模态模型（LMMs）方面取得的进展在学术界和工业界都带来了重大突破。一个引起关注的问题是，作为人类，我们如何理解它们内部的神经表示。本文通过提出一个多才多艺的框架来识别和解释LMMs内部语义，初步探讨了这个问题。具体来说，1）我们首先应用稀疏自动编码器（SAE）来将表示解开为人类可理解的特征。2）然后，我们提出了一个自动解释框架，通过LMMs自身解释SAE学习到的开放语义特征。我们利用这一框架来分析LLaVA-NeXT-8B模型，使用LLaVA-OV-72B模型，证明这些特征能有效地引导模型的行为。我们的研究结果有助于更深入地理解LMMs在特定任务中表现出色的原因，包括EQ测试，并阐明它们错误的本质以及可能的纠正策略。这些发现为LMMs的内部机制提供了新的见解，并暗示了与人类大脑认知过程的相似之处。

English

Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.

大型多模态模型能够解释大型多模态模型中的特征。

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

摘要

Summary

Support

Support