OpenAI 추론 모델을 활용한 웹 앱 코딩 사례 연구

초록

본 논문은 OpenAI의 최신 추론 모델인 o1-preview 및 o1-mini가 다른 선두 모델과 비교하여 코딩 작업의 사례 연구를 제시한다. o1 모델은 단일 작업 벤치마크인 WebApp1K에 대한 SOTA 결과를 제공한다. 이를 위해, 우리는 작업 및 테스트 케이스 수를 두 배로 하는 더 어려운 벤치마크인 WebApp1K-Duo를 소개한다. 새로운 벤치마크는 o1 모델의 성능을 크게 저하시키며, Claude 3.5를 뒤쳐지게 한다. 게다가, 그들은 종종 피하는 비전이 없는 모델들이 가끔 마주치는 비표준적이지만 올바른 테스트 케이스에 직면할 때 일관되게 실패한다. 우리는 성능 변동성이 지시 이해력 때문이라고 가설을 세운다. 구체적으로, 모든 기대가 포착될 때 추론 메커니즘이 성능을 향상시키지만, 주요 기대가 누락될 때 오류를 악화시키며, 입력 길이에 영향을 받을 수 있다고 추측한다. 이에 따라, 우리는 추론 모델의 코딩 성공이 첫째로 최고 수준의 기본 모델과 SFT에 달려 있어야 하며, 지시 사항에 대한 면밀한 준수를 보장해야 한다고 주장한다.

English

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

OpenAI 추론 모델을 활용한 웹 앱 코딩 사례 연구

A Case Study of Web App Coding with OpenAI Reasoning Models

초록

Summary

Support

Support