Condor：通过知识驱动的数据综合和精化增强LLM对齐

摘要

监督微调（SFT）数据的质量在提升大型语言模型（LLMs）的对话能力中起着至关重要的作用。然而，随着LLMs变得更加先进，高质量人工标注的SFT数据的可用性已成为一个重要瓶颈，迫使更多依赖合成训练数据。在这项工作中，我们介绍了Condor，一个新颖的两阶段合成数据生成框架，结合了世界知识树和自我反思精化，以大规模生成高质量的SFT数据。我们的实验结果表明，仅在20K个Condor生成样本上微调的基础模型表现优于对照组。Condor中的额外精化阶段进一步使LLMs在各种规模（高达72B）上实现迭代自我改进，验证了我们方法的有效性。此外，我们对后训练中合成数据的扩展性进行的研究揭示了性能改进的巨大潜力，为未来研究开辟了有前途的途径。

English

The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

Condor：通过知识驱动的数据综合和精化增强LLM对齐

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

摘要

Summary

Support

Support