ChatPaper.aiChatPaper

xVerify:推理模型评估中的高效答案验证器

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

April 14, 2025
作者: Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li
cs.AI

摘要

随着OpenAI发布o1模型,采用慢思考策略的推理模型逐渐兴起。由于此类模型生成的响应往往包含复杂的推理、中间步骤和自我反思,现有的评估方法往往显得不足。它们难以判断大语言模型(LLM)的输出是否真正等同于参考答案,也难以从冗长复杂的响应中识别并提取最终答案。为解决这一问题,我们提出了xVerify,一种用于推理模型评估的高效答案验证器。xVerify在等价性判断方面展现出强大能力,能够有效判定推理模型在各种客观题型下生成的答案是否与参考答案等价。为训练和评估xVerify,我们构建了VAR数据集,通过收集多个LLM在不同数据集上生成的问答对,利用多个推理模型及专为推理模型评估设计的挑战性评估集,并采用多轮标注流程确保标签准确性。基于VAR数据集,我们训练了多个不同规模的xVerify模型。在测试集和泛化集上的评估实验中,所有xVerify模型的总体F1分数和准确率均超过95%。值得注意的是,最小规模的变体xVerify-0.5B-I在除GPT-4o外的所有评估方法中表现最佳,而xVerify-3B-Ib在整体性能上超越了GPT-4o。这些结果验证了xVerify的有效性和泛化能力。
English
With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

Summary

AI-Generated Summary

PDF842April 16, 2025