自洽性偏好優化

摘要

自我對齊是一個快速發展的研究領域，模型在沒有人類標註的情況下學習改進自身。然而，由於難以確定正確的獎勵，現有技術通常無法改進複雜的推理任務。一種已知能提高正確性的正交方法是自我一致性，在推論時應用多次取樣，以找到最一致的答案。在這項工作中，我們將自我一致性概念擴展到幫助訓練模型。因此，我們引入自我一致性偏好優化（ScPO），通過迭代訓練一致的答案，使其優於不一致的答案，應用於無監督的新問題。我們展示了ScPO在推理任務（如GSM8K和MATH）上比傳統的獎勵模型訓練取得了巨大改進，縮小了與帶有黃金答案或偏好的監督訓練之間的差距，並且將ScPO與標準監督學習結合可以進一步改善結果。在ZebraLogic上，ScPO微調Llama-3 8B，使其優於Llama-3 70B、Gemma-2 27B和Claude-3 Haiku。

English

Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.

自洽性偏好優化

Self-Consistency Preference Optimization

摘要

Summary

Support

Support