深入JSON思维：强化策略确保LLM严格遵循模式规范

摘要

本文探讨了如何利用大语言模型（LLM）的推理能力，解决其在生成过程中严格遵循模式（schema）的难题。基于DeepSeek R1强化学习框架，我们提出了一种新方法，通过结合合成推理数据集构建与在组相对策略优化（GRPO）下的定制奖励函数，训练了一个1.5B参数模型的结构化推理技能。具体而言，我们首先在20K样本的非结构化到结构化数据集上执行R1强化学习，沿用DeepSeek R1的原方法，以建立核心推理能力。随后，我们在一个独立的10K推理样本数据集上进行了监督微调，专注于优化下游任务的模式遵循。尽管训练规模相对适中，GRPO训练在8xH100 GPU集群上约需20小时，SFT在1xA100上需3小时，但我们的模型在确保模式一致性方面展现了强劲性能。我们将ThinkJSON方法与原始DeepSeek R1（671B）、其蒸馏版本（Qwen-1.5B和Qwen-7B）以及Gemini 2.0 Flash（70B）进行了对比，证明了其在现实应用中的有效性。我们的结果凸显了资源高效框架在模式约束文本生成中的实际应用价值。

English

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

深入JSON思维：强化策略确保LLM严格遵循模式规范

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

摘要

Summary

Support