CipherBank:透過密碼學挑戰探索大型語言模型推理能力的邊界
CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
April 27, 2025
作者: Yu Li, Qizhi Pei, Mengyuan Sun, Honglin Lin, Chenlin Ming, Xin Gao, Jiang Wu, Conghui He, Lijun Wu
cs.AI
摘要
大型語言模型(LLMs)已展現出卓越的能力,尤其是在推理方面的最新進展,如o1和o3,不斷推動AI的邊界。儘管在數學和編程領域取得了令人印象深刻的成就,LLMs在需要密碼學專業知識的領域中的推理能力仍未被充分探索。本文介紹了CipherBank,這是一個全面的基準測試,旨在評估LLMs在密碼解密任務中的推理能力。CipherBank包含2,358個精心設計的問題,涵蓋5個領域和14個子領域中的262個獨特明文,重點關注需要加密的隱私敏感和現實世界場景。從密碼學的角度來看,CipherBank整合了3大類加密方法,涵蓋9種不同的算法,從古典密碼到自定義密碼技術。我們在CipherBank上評估了最先進的LLMs,例如GPT-4o、DeepSeek-V3,以及專注於推理的尖端模型如o1和DeepSeek-R1。我們的結果揭示了通用聊天LLMs與專注於推理的LLMs之間推理能力的顯著差距,以及當前專注於推理的模型在古典密碼解密任務中的表現,凸顯了這些模型在理解和處理加密數據方面面臨的挑戰。通過詳細分析和錯誤調查,我們提供了幾個關鍵觀察,揭示了LLMs在密碼推理中的局限性和潛在改進領域。這些發現強調了持續提升LLM推理能力的必要性。
English
Large language models (LLMs) have demonstrated remarkable capabilities,
especially the recent advancements in reasoning, such as o1 and o3, pushing the
boundaries of AI. Despite these impressive achievements in mathematics and
coding, the reasoning abilities of LLMs in domains requiring cryptographic
expertise remain underexplored. In this paper, we introduce CipherBank, a
comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs
in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously
crafted problems, covering 262 unique plaintexts across 5 domains and 14
subdomains, with a focus on privacy-sensitive and real-world scenarios that
necessitate encryption. From a cryptographic perspective, CipherBank
incorporates 3 major categories of encryption methods, spanning 9 distinct
algorithms, ranging from classical ciphers to custom cryptographic techniques.
We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and
cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results
reveal significant gaps in reasoning abilities not only between general-purpose
chat LLMs and reasoning-focused LLMs but also in the performance of current
reasoning-focused models when applied to classical cryptographic decryption
tasks, highlighting the challenges these models face in understanding and
manipulating encrypted data. Through detailed analysis and error
investigations, we provide several key observations that shed light on the
limitations and potential improvement areas for LLMs in cryptographic
reasoning. These findings underscore the need for continuous advancements in
LLM reasoning capabilities.Summary
AI-Generated Summary