这是你的最终答案吗？测试时调整提升选择性问答性能

摘要

扩大大型语言模型在测试时的计算规模，已在推理基准测试中展现出卓越性能。然而，现有对测试时扩展的评估基于一个强假设，即推理系统应对任何提出的问题始终给出答案。这忽视了模型对其答案是否自信，以及是否总是适宜提供回应的考量。为解决这些问题，我们在推理过程中提取置信度分数，用于对模型响应进行阈值筛选。研究发现，增加推理时的计算预算不仅有助于模型更准确地回答问题，还能提升对正确答案的置信度。随后，我们通过考虑非零响应风险的情境，扩展了当前评估中零风险响应的范式，并提出了在此类情境下报告评估结果的方法建议。

English

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

这是你的最终答案吗？测试时调整提升选择性问答性能

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

摘要

Summary

Support