您是否物有所值？大语言模型API中的模型替换审计

摘要

大型语言模型（LLMs）通过黑箱API的广泛普及，带来了一个显著的信任挑战：用户基于广告宣传的模型能力（如规模、性能）支付服务费用，但提供商可能暗中将指定模型替换为成本更低、质量较差的替代品以降低运营成本。这种透明度的缺失损害了公平性，削弱了信任，并使可靠的基准测试变得复杂。由于黑箱特性，检测此类替换十分困难，通常仅限于输入输出查询的交互。本文正式定义了LLM API中的模型替换检测问题。我们系统评估了现有的验证技术，包括基于输出的统计测试、基准评估和日志概率分析，在多种现实攻击场景下（如模型量化、随机替换和基准规避）的表现。研究发现，仅依赖文本输出的方法在面对微妙或适应性攻击时存在明显局限。尽管日志概率分析在可用时提供了更强的保证，但其可访问性往往受限。最后，我们探讨了基于硬件的解决方案（如可信执行环境TEEs）作为实现可证明模型完整性的潜在途径，强调了安全性、性能与提供商采纳之间的权衡。代码可在https://github.com/sunblaze-ucb/llm-api-audit获取。

English

The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit

您是否物有所值？大语言模型API中的模型替换审计

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

摘要

Summary

Support

Support