생각하는 LLMs: 사고 생성과 함께 일반적인 지시 따르기

초록

LLM은 일반적으로 사용자 질문에 답하거나 사람이 응답하는 방식과 유사하게 지시를 따릅니다. 그러나 표준 정렬 프레임워크에서는 답하기 전에 명시적으로 사고하는 기본 능력이 부족합니다. 사고는 추론과 계획이 필요한 복잡한 질문에 중요하지만 어떤 작업에도 적용될 수 있습니다. 우리는 기존 LLM에 이러한 사고 능력을 갖추기 위한 교육 방법을 제안합니다. 이를 통해 추가 인간 데이터 없이 일반적인 지시에 따라 사고하는 능력을 갖춘다. 우리는 가능한 사고 생성 영역을 탐색하고 최적화하는 반복적인 검색 및 최적화 절차를 통해 이를 달성합니다. 각 지시에 대해 사고 후보는 답변만을 평가하기 위해 판단 모델을 사용하여 점수를 매기고, 그런 다음 선호도 최적화를 통해 최적화됩니다. 이 절차가 AlpacaEval 및 Arena-Hard에서 우수한 성능을 보이며, 마케팅, 건강 및 일반 지식과 같은 비추론 범주에서 사고를 통한 이익을 보여주며, 더 전통적인 추론 및 문제 해결 작업에도 적용됨을 보여줍니다.

English

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

생각하는 LLMs: 사고 생성과 함께 일반적인 지시 따르기

Thinking LLMs: General Instruction Following with Thought Generation

초록

Summary

Support