ChatPaper.aiChatPaper

從程式碼到正確性:使用階層式除錯來完成程式碼生成的最後一哩路

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

October 2, 2024
作者: Yuling Shi, Songsong Wang, Chengcheng Wan, Xiaodong Gu
cs.AI

摘要

儘管大型語言模型在程式碼生成方面取得了重大進展,但生成程式碼的通過率往往受制於微妙的錯誤,常需要人工干預才能通過測試,尤其是對於複雜問題。現有基於LLM的調試系統將生成的程式視為單一單元,未能解決多個層級的錯誤,從低級語法錯誤到高級算法缺陷。本文介紹了多層級調試器(MGDebugger),透過在不同層級上孤立、識別和解決錯誤,實現了分層程式碼調試。MGDebugger將有問題的程式碼分解為子功能的層次樹結構,每個層次代表特定層級的錯誤。在調試過程中,它分析每個子功能並以自下而上的方式迭代解決錯誤。為了有效測試每個子功能,我們提出了一個LLM模擬的Python執行器,追蹤程式碼執行並跟蹤重要變數狀態以準確定位錯誤。大量實驗表明,MGDebugger優於現有的調試系統,在HumanEval中的準確性比種子生成提高了18.9%,在HumanEvalFix中的修復成功率達到了97.6%。此外,MGDebugger有效修復了不同類別和難度級別的錯誤,展示了其穩健性和有效性。
English
While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

Summary

AI-Generated Summary

PDF338November 16, 2024