Gli Autoencoder Sparse Apprendono Caratteristiche Monosematiche nei Modelli Visione-Linguaggio

Abstract

Gli Autoencoder Sparse (SAE) hanno recentemente dimostrato di migliorare l'interpretabilità e la controllabilità nei Large Language Models (LLM). In questo lavoro, estendiamo l'applicazione degli SAE ai Vision-Language Models (VLM), come CLIP, e introduciamo un framework completo per valutare la monosemanticità nelle rappresentazioni visive. I nostri risultati sperimentali rivelano che gli SAE addestrati su VLM migliorano significativamente la monosemanticità dei singoli neuroni, mostrando anche rappresentazioni gerarchiche che si allineano bene con strutture definite da esperti (ad esempio, la tassonomia di iNaturalist). In particolare, dimostriamo che l'applicazione degli SAE per intervenire su un encoder visivo di CLIP consente di controllare direttamente l'output di LLM multimodali (ad esempio, LLaVA) senza alcuna modifica al modello sottostante. Questi risultati sottolineano la praticità e l'efficacia degli SAE come approccio non supervisionato per migliorare sia l'interpretabilità che il controllo dei VLM.

English

Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

Gli Autoencoder Sparse Apprendono Caratteristiche Monosematiche nei Modelli Visione-Linguaggio

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Abstract

Summary

Support

Support