使用OpenAI推理模型進行網頁應用編碼的案例研究

摘要

本文提出了一項由OpenAI最新推出的推理模型o1-preview和o1-mini進行編碼任務的案例研究，並將其與其他前沿模型進行比較。o1模型在WebApp1K這個單任務基準測試中取得了領先水準的結果。為此，我們引入了WebApp1K-Duo，一個更難的基準測試，加倍了任務數量和測試案例。這個新基準測試導致o1模型的表現顯著下降，落後於Claude 3.5。此外，當面臨非典型但正確的測試案例時，它們經常失敗，這是非推理模型偶爾會避免的陷阱。我們假設表現的變異性是由於指令理解能力不足所致。具體來說，當所有期望被捕捉時，推理機制會提高性能，同時當關鍵期望被忽略時，會加劇錯誤，這可能受到輸入長度的影響。因此，我們認為推理模型在編碼成功方面取決於頂尖的基礎模型和SFT，以確保對指令的細緻遵循。

English

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

使用OpenAI推理模型進行網頁應用編碼的案例研究

A Case Study of Web App Coding with OpenAI Reasoning Models

摘要

Summary

Support

Support