MIG:通过语义空间信息增益最大化实现指令微调的自动数据选择
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
April 18, 2025
作者: Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen
cs.AI
摘要
数据质量与多样性是构建高效指令微调数据集的关键。随着开源指令微调数据集的日益增多,从海量数据中自动筛选出高质量且多样化的子集显得尤为重要。现有方法通常优先考虑实例质量,并采用启发式规则来维持多样性。然而,这种缺乏对整体数据集全面考量的做法往往导致结果不尽如人意。此外,启发式规则多聚焦于嵌入空间中的距离或聚类,难以精准捕捉语义空间中复杂指令的意图。为弥合这一差距,我们提出了一种统一的方法来量化数据集的信息含量。该方法通过构建标签图来建模语义空间,并基于图中信息分布来量化多样性。基于此度量,我们进一步引入了一种高效采样方法,通过迭代选择数据样本来最大化语义空间中的信息增益(MIG)。在多种数据集和基础模型上的实验表明,MIG方法持续超越现有最先进技术。尤为突出的是,使用MIG方法从Tulu3数据集中采样5%进行微调的模型,其性能与在全数据集上训练的官方SFT模型相当,在AlpacaEval和Wildbench上的提升分别达到+5.73%和+6.89%。
English
Data quality and diversity are key to the construction of effective
instruction-tuning datasets. % With the increasing availability of open-source
instruction-tuning datasets, it is advantageous to automatically select
high-quality and diverse subsets from a vast amount of data. % Existing methods
typically prioritize instance quality and use heuristic rules to maintain
diversity. % However, this absence of a comprehensive view of the entire
collection often leads to suboptimal results. % Moreover, heuristic rules
generally focus on distance or clustering within the embedding space, which
fails to accurately capture the intent of complex instructions in the semantic
space. % To bridge this gap, we propose a unified method for quantifying the
information content of datasets. This method models the semantic space by
constructing a label graph and quantifies diversity based on the distribution
of information within the graph. % Based on such a measurement, we further
introduce an efficient sampling method that selects data samples iteratively to
Maximize the Information Gain (MIG) in semantic
space. % Experiments on various datasets and base models demonstrate that MIG
consistently outperforms state-of-the-art methods. % Notably, the model
fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance
to the official SFT model trained on the full dataset, with improvements of
+5.73\% on AlpacaEval and +6.89\% on Wildbench.Summary
AI-Generated Summary