LLM이 빠른 사고 대비 느린 사고를 위해 훈련될 때 LLM의 층에서 무슨 일이 발생했는가: 그래디언트 관점

초록

LLM의 사후 훈련에서 차이를 만드는 것은 무엇인가? 우리는 다른 응답 및 초기 모델로 훈련할 때, 대규모 언어 모델(Large Language Models, LLMs)의 다양한 레이어의 훈련 패턴을 그래디언트의 관점에서 조사합니다. 최근 chain-of-thoughts (CoT) 및 process rewards와 같은 추론 경로에서 LLM을 훈련하는 인기로 인해, 빠른 사고와 느린 사고가 레이어별 그래디언트에 어떤 영향을 미치는지에 특히 관심이 있습니다. 우리의 연구에서 CoT 없이 빠른 사고는 느린 사고(Detailed CoT)보다 큰 그래디언트와 레이어 간 그래디언트의 큰 차이를 보여줌으로써 후자가 가져오는 학습 안정성을 나타냅니다. 게다가, 사전 훈련된 LLM은 명령에 맞게 조정된 LLM보다 빠른 사고의 불안정성에 덜 영향을 받습니다. 또한, 다른 LLM을 훈련할 때 느린 사고와 빠른 사고 경로를 사용하는 경우 그래디언트 패턴이 응답의 정확성을 반영할 수 있는지 연구합니다. 결과는 느린 사고의 그래디언트가 올바른 및 관련 없는 추론 경로를 구별할 수 있다는 것을 보여줍니다. 비교적으로, 비추론적 지식 학습 작업에서 유사한 그래디언트 분석을 수행하지만, 응답 길이를 단순히 증가시키는 것은 느린 사고의 유사한 행동으로 이어지지 않습니다. 우리의 연구는 LLM 훈련의 기본적인 이해를 강화하고 효율성 및 안정성에 대한 혁신적인 통찰을 제공하여 일반화 가능한 System-2 에이전트를 구축하는 길을 열어줍니다. 우리의 코드, 데이터 및 그래디언트 통계는 다음에서 찾을 수 있습니다: https://github.com/MingLiiii/Layer_Gradient.

English

What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.

LLM이 빠른 사고 대비 느린 사고를 위해 훈련될 때 LLM의 층에서 무슨 일이 발생했는가: 그래디언트 관점

What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

초록

Support