비판적인 미세 조정: 비판을 배우는 것이 모방을 배우는 것보다 더 효과적이다.

초록

지도 미세 조정(Supervised Fine-Tuning, SFT)은 주어진 지시에 대한 주석이 달린 응답을 모방하기 위해 언어 모델을 훈련하는 데 일반적으로 사용됩니다. 본 논문에서는 이 패러다임에 도전하며, 모델이 올바른 것을 단순히 모방하는 대신 잘못된 응답을 비평하는 방식인 비평 미세 조정(Critique Fine-Tuning, CFT)을 제안합니다. 비판적 사고를 강조하는 인간의 학습 과정에서 영감을 받은 CFT는 깊은 분석과 세밀한 이해를 촉진하여 표준 SFT에서 종종 간과되는 특성을 장려합니다. CFT의 효과를 검증하기 위해 우리는 GPT-4o를 선생님으로 사용하여 WebInstruct에서 50,000개의 샘플 데이터셋을 구성하고, (입력=[쿼리; 잘못된 응답], 출력=비평) 형식으로 비평을 생성합니다. 이 데이터셋에서 CFT는 Qwen2.5, Qwen2.5-Math, DeepSeek-Math 등과 같은 다양한 베이스 모델을 사용한 여섯 가지 수학 벤치마크에서 SFT보다 일관된 4-10%의 성능 향상을 보입니다. 더 나아가 MetaMath와 NuminaMath 데이터셋으로 확장하여 SFT보다 유사한 향상을 관찰합니다. 특히, 우리의 50,000개 샘플로 훈련된 Qwen2.5-Math-CFT 모델은 AceMath와 Qwen2.5-Math-Instruct와 같은 경쟁 모델보다 대부분의 벤치마크에서 뛰어나거나 우수한 성과를 보입니다. CFT는 잘못된 응답의 원천과 선생님 비평 모델에 대해 견고함을 보이는 소거 연구를 통해 입증됩니다. 이러한 결과를 통해 우리는 비평 중심 훈련이 언어 모델의 추론을 발전시키는 더 효과적인 대안을 제공한다고 주장합니다.

English

Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.

비판적인 미세 조정: 비판을 배우는 것이 모방을 배우는 것보다 더 효과적이다.

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

초록

Summary

Support