HackerRank-ASTRA:评估大型语言模型在跨领域多文件项目问题上的正确性和一致性。
HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems
January 31, 2025
作者: Jun Xing, Mayur Bhatia, Sahil Phulwani, Darshan Suresh, Rafik Matta
cs.AI
摘要
评估大型语言模型(LLMs)在实际应用中的适用性为它们在软件开发任务中的开发和使用提供了宝贵的见解。现有基准往往侧重于独立编码问题或特定库,忽视了多文件、基于项目的场景,并缺乏对一致性的严格评估。HackerRank-ASTRA基准引入了基于项目的编码问题,模拟了真实世界的场景。它通过32次运行(k = 32)和中位数标准偏差来评估模型的一致性,同时结合分类水平分析来评估子技能能力。对65个问题的初步评估显示,排名前三的模型 — o1、o1-preview 和 Claude-3.5-Sonnet-1022 — 实现了相当的平均分数为75%,在性能上没有统计学上显著差异。值得注意的是,Claude-3.5-Sonnet-1022 在问题之间表现出最高的一致性,具有较低的变异性(SD = 0.0497),与其他模型相比在统计上显著,突显了它在实际软件开发任务中的可靠性。
English
Evaluating the real-world applicability of large language models (LLMs)
provides valuable insights for their development and use in software
development tasks. Existing benchmarks often focus on standalone coding
problems or specific libraries, overlooking multi-file, project-based scenarios
and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA
Benchmark introduces project-based coding problems that mirror real-world
scenarios. It evaluates model consistency through 32 runs (k = 32) and median
standard deviation while incorporating taxonomy-level analysis to assess
sub-skill capabilities. Initial evaluations on 65 problems show that the top
three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved
comparable average scores of 75%, with no statistically significant differences
in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest
consistency across problems, with low variability (SD = 0.0497), which was
statistically significant compared to other models, highlighting its
reliability for real-world software development tasks.Summary
AI-Generated Summary