硬體和軟體平台推斷

摘要

現在，購買大型語言模型（LLM）推論的存取權已成為一種常見的商業實踐，而非自行託管，這是因為需要龐大的前期硬體基礎設施和能源成本。然而，作為買家，卻沒有機制來驗證廣告服務的真實性，包括服務硬體平台，例如確保實際上是使用 NVIDIA H100 進行服務。此外，有報告表明，模型提供者可能會提供與廣告不同的模型，通常是為了使其在成本較低的硬體上運行。這樣一來，客戶為在成本更高的硬體上訪問功能強大的模型而支付高價，但最終卻是由成本更低的硬體上的（可能功能較弱）更便宜的模型提供服務。在本文中，我們介紹了\textbf{硬體和軟體平台推論（HSPI）}——一種僅基於機器學習模型的輸入-輸出行為來識別其底層架構和軟體堆棧的方法。我們的方法利用各種架構和編譯器之間的固有差異來區分不同類型和軟體堆棧。通過分析模型輸出中的數字模式，我們提出了一個能夠準確識別用於模型推論以及底層軟體配置的分類框架。我們的研究結果表明，從黑箱模型中推斷出硬體類型的可行性。我們對在不同真實硬體上提供服務的模型進行了 HSPI 評估，發現在白箱設置中，我們可以以83.9%至100%的準確率區分不同的硬體類型。即使在黑箱設置中，我們也能夠取得比隨機猜測準確率高出多達三倍的結果。

English

It is now a common business practice to buy access to large language model (LLM) inference rather than self-host, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introduce \textbf{hardware and software platform inference (HSPI)} -- a method for identifying the underlying architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various architectures and compilers to distinguish between different types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different s with between 83.9% and 100% accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy.

硬體和軟體平台推斷

Hardware and Software Platform Inference

摘要

Summary

Support