CritiQ:基于人类偏好的数据质量准则挖掘
CritiQ: Mining Data Quality Criteria from Human Preferences
February 26, 2025
作者: Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
cs.AI
摘要
语言模型的高效运行高度依赖于优质数据。现有方法依赖于人工设计的启发式规则、现有模型的困惑度、训练分类器或精心设计的提示工程,这些方法不仅需要丰富的专家经验和大量的人工标注工作,还容易引入偏差。我们提出了CritiQ,一种新颖的数据选择方法,它仅需30对人工标注样本即可自动从人类偏好中挖掘数据质量标准,并实现高效的数据筛选。其核心组件CritiQ Flow采用一个管理代理来演化质量标准,并利用多个工作代理进行成对判断。我们构建了一个知识库,从先前工作中提取质量标准,以增强CritiQ Flow的性能。相较于基于困惑度和分类器的方法,语言描述的标准更具可解释性,且具备复用价值。在确定标准后,我们训练CritiQ评分器来赋予数据质量分数,并执行高效的数据选择。我们在代码、数学和逻辑领域验证了该方法的有效性,在人工标注的测试集上达到了高准确率。为了验证所选数据质量,我们持续训练Llama 3.1模型,并观察到在下游任务上的性能相较于均匀采样有所提升。消融实验验证了知识库和反思过程带来的益处。我们还分析了标准如何演化以及多数投票的有效性。
English
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
sim30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.Summary
AI-Generated Summary