圖感知同構注意力：Transformer 中適應性動態的機制

摘要

我們提出了一種修改Transformer架構的方法，通過將圖識別關聯推理整合到注意機制中，融合了圖神經網絡和語言建模的概念。基於注意力和圖論之間固有的聯繫，我們將Transformer的注意機制重新表述為圖操作，並提出了具有圖識別同構注意力的方法。該方法利用先進的圖建模策略，包括圖同構網絡（GIN）和主要鄰域聚合（PNA），來豐富關係結構的表示。我們的方法捕捉了複雜的依賴關係，並在各種任務中實現泛化，這表現為泛化差距減少和學習性能提高。此外，我們將圖識別注意力的概念擴展到引入稀疏GIN-Attention，一種利用稀疏GIN進行微調的方法。通過將注意力矩陣解釋為稀疏鄰接圖，該技術提高了預訓練基礎模型的適應性，並賦予其圖識別能力，同時帶來最小的計算開銷。與低秩適應（LoRA）等替代方法相比，稀疏GIN-Attention微調實現了改進的訓練動態和更好的泛化。我們討論了傳統注意機制中的潛在類似圖結構，提供了一種新的理解Transformer的透鏡。通過將Transformer演變為用於關聯推理的分層GIN模型，這種觀點對基礎模型開發具有深遠的影響，實現了能夠動態適應本地和全局依賴關係的架構的設計。生物信息學、材料科學、語言建模等領域的應用可能受益於這種關聯和序列數據建模的綜合，為可解釋和泛化建模策略奠定了基礎。

English

We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language modeling. Building on the inherent connection between attention and graph theory, we reformulate the Transformer's attention mechanism as a graph operation and propose Graph-Aware Isomorphic Attention. This method leverages advanced graph modeling strategies, including Graph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA), to enrich the representation of relational structures. Our approach captures complex dependencies and generalizes across tasks, as evidenced by a reduced generalization gap and improved learning performance. Additionally, we expand the concept of graph-aware attention to introduce Sparse GIN-Attention, a fine-tuning approach that employs sparse GINs. By interpreting attention matrices as sparse adjacency graphs, this technique enhances the adaptability of pre-trained foundational models with minimal computational overhead, endowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning achieves improved training dynamics and better generalization compared to alternative methods like low-rank adaption (LoRA). We discuss latent graph-like structures within traditional attention mechanisms, offering a new lens through which Transformers can be understood. By evolving Transformers as hierarchical GIN models for relational reasoning. This perspective suggests profound implications for foundational model development, enabling the design of architectures that dynamically adapt to both local and global dependencies. Applications in bioinformatics, materials science, language modeling, and beyond could benefit from this synthesis of relational and sequential data modeling, setting the stage for interpretable and generalizable modeling strategies.

圖感知同構注意力：Transformer 中適應性動態的機制

Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers

摘要

Support