透過同源模型引導和情境感知度量選擇對於長文本對齊具有影響力的樣本
Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement
October 21, 2024
作者: Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
cs.AI
摘要
對於有效處理具有極長內容的指令的大型語言模型的擴展尚未得到充分的探討。主要障礙在於構建一個為長篇內容對齊而設計的高質量長指令跟隨數據集。現有研究已嘗試通過合成長指令跟隨樣本來擴大可用數據量。然而,若沒有確定的策略來確保數據質量,盲目增加數據量可能會引入低質量樣本並限制最終性能。為彌合這一差距,我們旨在應對長篇內容對齊的獨特挑戰,即建模處理指令和冗長輸入內容的長距離依賴性。我們提出了GATEAU,一個新穎的框架,旨在通過利用精心設計的同源模型引導(HMG)和上下文感知度量(CAM)來識別富含長距離依賴關係的具有影響力和高質量樣本。具體而言,HMG 試圖通過使用具有不同上下文窗口的兩個同源模型的回應困惑分數來衡量由於長距離依賴性而生成相應回應的困難程度。此外,CAM 的作用是通過評估模型的注意力是否集中在重要部分,來衡量由於長距離依賴性而理解長輸入內容的困難程度。基於這兩種提出的方法,我們選擇最具挑戰性的樣本作為具有影響力的數據,以有效地構建長距離依賴性,從而實現LLM的更好性能。全面的實驗表明,GATEAU 能夠有效識別富含長距離依賴關係的樣本,並且在這些選定樣本上訓練的模型表現出更好的指令跟隨和長篇內容理解能力。
English
The expansion of large language models to effectively handle instructions
with extremely long contexts has yet to be fully investigated. The primary
obstacle lies in constructing a high-quality long instruction-following dataset
devised for long context alignment. Existing studies have attempted to scale up
the available data volume by synthesizing long instruction-following samples.
However, indiscriminately increasing the quantity of data without a
well-defined strategy for ensuring data quality may introduce low-quality
samples and restrict the final performance. To bridge this gap, we aim to
address the unique challenge of long-context alignment, i.e., modeling the
long-range dependencies for handling instructions and lengthy input contexts.
We propose GATEAU, a novel framework designed to identify the influential and
high-quality samples enriched with long-range dependency relations by utilizing
crafted Homologous Models' Guidance (HMG) and Contextual Awareness Measurement
(CAM). Specifically, HMG attempts to measure the difficulty of generating
corresponding responses due to the long-range dependencies, using the
perplexity scores of the response from two homologous models with different
context windows. Also, the role of CAM is to measure the difficulty of
understanding the long input contexts due to long-range dependencies by
evaluating whether the model's attention is focused on important segments.
Built upon both proposed methods, we select the most challenging samples as the
influential data to effectively frame the long-range dependencies, thereby
achieving better performance of LLMs. Comprehensive experiments indicate that
GATEAU effectively identifies samples enriched with long-range dependency
relations and the model trained on these selected samples exhibits better
instruction-following and long-context understanding capabilities.Summary
AI-Generated Summary