通过思维干预有效控制推理模型

摘要

推理增强型大语言模型（LLMs）在生成最终答案前，会显式地生成中间推理步骤，从而在复杂问题解决中表现卓越。本文中，我们展示了这一新兴生成框架为更精细地控制模型行为提供了独特机遇。我们提出了“思维干预”这一新范式，旨在通过策略性地插入或修改特定思维标记，显式引导LLMs的内部推理过程。我们在多项任务上进行了全面评估，包括IFEval上的指令遵循、SEP上的指令层级理解，以及XSTest和SORRY-Bench上的安全对齐。结果表明，思维干预显著超越了基线提示方法，在指令遵循场景中实现了高达6.7%的准确率提升，在指令层级推理上提升了15.4%，并在使用开源DeepSeek R1模型处理不安全提示时，拒绝率提高了40.0%。总体而言，我们的工作为控制推理型LLMs开辟了一条充满前景的新研究路径。

English

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

通过思维干预有效控制推理模型

Effectively Controlling Reasoning Models through Thinking Intervention

摘要

Summary

Support

Support