通过思维干预有效控制推理模型
Effectively Controlling Reasoning Models through Thinking Intervention
March 31, 2025
作者: Tong Wu, Chong Xiang, Jiachen T. Wang, Prateek Mittal
cs.AI
摘要
推理增强型大语言模型(LLMs)在生成最终答案前,会显式地生成中间推理步骤,从而在复杂问题解决中表现卓越。本文中,我们展示了这一新兴生成框架为更精细地控制模型行为提供了独特机遇。我们提出了“思维干预”这一新范式,旨在通过策略性地插入或修改特定思维标记,显式引导LLMs的内部推理过程。我们在多项任务上进行了全面评估,包括IFEval上的指令遵循、SEP上的指令层级理解,以及XSTest和SORRY-Bench上的安全对齐。结果表明,思维干预显著超越了基线提示方法,在指令遵循场景中实现了高达6.7%的准确率提升,在指令层级推理上提升了15.4%,并在使用开源DeepSeek R1模型处理不安全提示时,拒绝率提高了40.0%。总体而言,我们的工作为控制推理型LLMs开辟了一条充满前景的新研究路径。
English
Reasoning-enhanced large language models (LLMs) explicitly generate
intermediate reasoning steps prior to generating final answers, helping the
model excel in complex problem-solving. In this paper, we demonstrate that this
emerging generation framework offers a unique opportunity for more fine-grained
control over model behavior. We propose Thinking Intervention, a novel paradigm
designed to explicitly guide the internal reasoning processes of LLMs by
strategically inserting or revising specific thinking tokens. We conduct
comprehensive evaluations across multiple tasks, including instruction
following on IFEval, instruction hierarchy on SEP, and safety alignment on
XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention
significantly outperforms baseline prompting approaches, achieving up to 6.7%
accuracy gains in instruction-following scenarios, 15.4% improvements in
reasoning about instruction hierarchies, and a 40.0% increase in refusal rates
for unsafe prompts using open-source DeepSeek R1 models. Overall, our work
opens a promising new research avenue for controlling reasoning LLMs.Summary
AI-Generated Summary