대규모 언어 모델 추론의 핵심은 인내심입니다.

초록

최근 대형 언어 모델 분야에서 Chain of Thought (CoT) 접근법을 통해 특히 뚜렷한 발전이 있었는데, 이는 복잡한 문제 해결에 상당한 향상을 보여주었습니다. 그러나 기존 모델들은 사용자 선호도로 인해 자세한 추론을 포기하거나, 복잡한 추론 능력을 배우기 위해 방대하고 비싼 훈련 데이터가 필요하여 복잡한 작업 해결 가능성이 제한되는 경향이 있습니다. 이 간극을 좁히기 위해, 테스트 시 스케일링 개념을 따라, 새로운 지식이나 기술을 도입할 필요 없이 모델이 더 근면한 추론 스타일을 채택하도록 하는 간단한 방법을 제안합니다. 선호도 최적화 접근법을 적용하기 위해, 상세한 추론 과정을 긍정적 예로 생성하고 간단한 답변을 부정적 예로 사용하여 모델이 응답에서 철저함을 선호하도록 훈련시킵니다. 결과는 경량 데이터셋에서 훈련한 결과로 GSM8k에서 최대 6.7%의 성능 향상을 보여주었습니다.

English

Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice detailed reasoning for brevity due to user preferences, or require extensive and expensive training data to learn complicated reasoning ability, limiting their potential in solving complex tasks. To bridge this gap, following the concept of scaling test-time, we propose a simple method by encouraging models to adopt a more patient reasoning style without the need of introducing new knowledge or skills. To employ a preference optimization approach, we generate detailed reasoning processes as positive examples and simple answers as negative examples, thereby training the model to favor thoroughness in its responses. Our results demonstrate a performance increase of up to 6.7% on GSM8k with training just on a lightweight dataset.

대규모 언어 모델 추론의 핵심은 인내심입니다.

Patience Is The Key to Large Language Model Reasoning

초록

Support