HackerRank-ASTRA：评估大型语言模型在跨领域多文件项目问题上的正确性和一致性。

摘要

评估大型语言模型（LLMs）在实际应用中的适用性为它们在软件开发任务中的开发和使用提供了宝贵的见解。现有基准往往侧重于独立编码问题或特定库，忽视了多文件、基于项目的场景，并缺乏对一致性的严格评估。HackerRank-ASTRA基准引入了基于项目的编码问题，模拟了真实世界的场景。它通过32次运行（k = 32）和中位数标准偏差来评估模型的一致性，同时结合分类水平分析来评估子技能能力。对65个问题的初步评估显示，排名前三的模型 — o1、o1-preview 和 Claude-3.5-Sonnet-1022 — 实现了相当的平均分数为75%，在性能上没有统计学上显著差异。值得注意的是，Claude-3.5-Sonnet-1022 在问题之间表现出最高的一致性，具有较低的变异性（SD = 0.0497），与其他模型相比在统计上显著，突显了它在实际软件开发任务中的可靠性。

English

Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.

HackerRank-ASTRA：评估大型语言模型在跨领域多文件项目问题上的正确性和一致性。

HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems

摘要

Summary

Support