ChatPaper.aiChatPaper

通行成本:评估语言模型的经济学框架

Cost-of-Pass: An Economic Framework for Evaluating Language Models

April 17, 2025
作者: Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou
cs.AI

摘要

AI系统在经济中的广泛应用,关键在于其创造的经济价值能否超越推理成本。评估这一权衡需要综合考虑性能与成本的指标。我们提出一个基于生产理论的框架,通过结合准确性和推理成本来评估语言模型。我们引入了“通过成本”这一概念,即生成正确解决方案的预期货币成本。随后,我们将“前沿通过成本”定义为在现有模型或“人类专家”中可实现的最低通过成本,后者基于聘请专家的近似成本。我们的分析揭示了独特的经济洞见。首先,轻量级模型在基础定量任务中成本效益最高,大型模型在知识密集型任务中表现更佳,而推理模型则适用于复杂的定量问题,尽管其每令牌成本较高。其次,追踪过去一年中这一前沿通过成本的变化,显示出显著进步,特别是在复杂定量任务中,成本大约每几个月减半。第三,为了追溯推动这一进步的关键创新,我们考察了反事实前沿:即在不包含特定模型类别情况下的成本效率估计。我们发现,轻量级、大型及推理模型的创新分别对推动基础定量、知识密集型和复杂定量任务的前沿至关重要。最后,我们评估了多数投票和自我精炼等常见推理时技术带来的成本降低,发现其边际准确性提升往往难以抵消其成本。我们的研究结果强调,模型层面的互补性创新是成本效率提升的主要驱动力,而我们的经济框架为衡量这一进展和指导部署提供了原则性工具。
English
The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Summary

AI-Generated Summary

PDF42April 21, 2025