COIG-P:一個高品質且大規模的中文偏好數據集,用於對齊人類價值觀
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
April 7, 2025
作者: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin
cs.AI
摘要
對齊大型語言模型(LLMs)與人類偏好已取得顯著成功。然而,現有的中文偏好數據集受限於規模小、領域覆蓋窄且缺乏嚴格的數據驗證。此外,依賴人工標註者進行指令和回應標註,極大地限制了人類偏好數據集的可擴展性。為應對這些挑戰,我們設計了一個基於LLM的中文偏好數據集註釋流程,無需人工干預。具體而言,我們爬取並精心篩選了92k高質量中文查詢,並使用15個主流LLMs生成和評分選擇-拒絕回應對。基於此,我們推出了COIG-P(中文開放指令通用者-偏好),這是一個高質量、大規模的中文偏好數據集,包含1,009k中文偏好對,涵蓋6個多樣化領域:聊天、代碼、數學、邏輯、小說和角色。在COIG-P的基礎上,為減少使用LLMs進行評分的開銷,我們訓練了一個8B規模的中文獎勵模型(CRM),並精心構建了一個中文獎勵基準(CRBench)。基於AlignBench liu2024alignbenchbenchmarkingchinesealignment的評估結果顯示,COIG-P顯著優於其他中文偏好數據集,並為Qwen2/2.5和Infinity-Instruct-3M-0625模型系列分別帶來了2%至12%的性能提升。CRBench上的結果表明,我們的CRM具有強大且穩健的評分能力。我們將其應用於過濾COIG-P測試集中的選擇-拒絕回應對,實驗顯示其在識別低質量樣本方面與GPT-4o相當,同時保持了高效性和成本效益。我們的代碼和數據已發佈於https://github.com/multimodal-art-projection/COIG-P。
English
Aligning large language models (LLMs) with human preferences has achieved
remarkable success. However, existing Chinese preference datasets are limited
by small scale, narrow domain coverage, and lack of rigorous data validation.
Additionally, the reliance on human annotators for instruction and response
labeling significantly constrains the scalability of human preference datasets.
To address these challenges, we design an LLM-based Chinese preference dataset
annotation pipeline with no human intervention. Specifically, we crawled and
carefully filtered 92k high-quality Chinese queries and employed 15 mainstream
LLMs to generate and score chosen-rejected response pairs. Based on it, we
introduce COIG-P (Chinese Open Instruction Generalist - Preference), a
high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese
preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel,
and Role. Building upon COIG-P, to reduce the overhead of using LLMs for
scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously
constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on
AlignBench liu2024alignbenchbenchmarkingchinesealignment show that that
COIG-P significantly outperforms other Chinese preference datasets, and it
brings significant performance improvements ranging from 2% to 12% for the
Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results
on CRBench demonstrate that our CRM has a strong and robust scoring ability. We
apply it to filter chosen-rejected response pairs in a test split of COIG-P,
and our experiments show that it is comparable to GPT-4o in identifying
low-quality samples while maintaining efficiency and cost-effectiveness. Our
codes and data are released in
https://github.com/multimodal-art-projection/COIG-P.Summary
AI-Generated Summary