使用OpenAI推理模型進行網頁應用編碼的案例研究
A Case Study of Web App Coding with OpenAI Reasoning Models
September 19, 2024
作者: Yi Cui
cs.AI
摘要
本文提出了一項由OpenAI最新推出的推理模型o1-preview和o1-mini進行編碼任務的案例研究,並將其與其他前沿模型進行比較。o1模型在WebApp1K這個單任務基準測試中取得了領先水準的結果。為此,我們引入了WebApp1K-Duo,一個更難的基準測試,加倍了任務數量和測試案例。這個新基準測試導致o1模型的表現顯著下降,落後於Claude 3.5。此外,當面臨非典型但正確的測試案例時,它們經常失敗,這是非推理模型偶爾會避免的陷阱。我們假設表現的變異性是由於指令理解能力不足所致。具體來說,當所有期望被捕捉時,推理機制會提高性能,同時當關鍵期望被忽略時,會加劇錯誤,這可能受到輸入長度的影響。因此,我們認為推理模型在編碼成功方面取決於頂尖的基礎模型和SFT,以確保對指令的細緻遵循。
English
This paper presents a case study of coding tasks by the latest reasoning
models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other
frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task
benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling
number of tasks and test cases. The new benchmark causes the o1 model
performances to decline significantly, falling behind Claude 3.5. Moreover,
they consistently fail when confronted with atypical yet correct test cases, a
trap non-reasoning models occasionally avoid. We hypothesize that the
performance variability is due to instruction comprehension. Specifically, the
reasoning mechanism boosts performance when all expectations are captured,
meanwhile exacerbates errors when key expectations are missed, potentially
impacted by input lengths. As such, we argue that the coding success of
reasoning models hinges on the top-notch base model and SFT to ensure
meticulous adherence to instructions.Summary
AI-Generated Summary