大型语言模型编排结构化推理实现Kaggle大师级水平
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
November 5, 2024
作者: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang
cs.AI
摘要
我们介绍Agent K v1.0,这是一个端到端的自主数据科学代理程序,旨在自动化、优化和泛化各种数据科学任务。完全自动化的Agent K v1.0通过从经验中学习来管理整个数据科学生命周期。它利用高度灵活的结构化推理框架,使其能够动态处理内嵌结构的记忆,有效地从积累的经验中学习以处理复杂的推理任务。它通过有选择地存储和检索关键信息来优化长期和短期记忆,基于环境奖励指导未来决策。这种迭代方法使其能够在不需要微调或反向传播的情况下完善决策,通过经验学习实现持续改进。我们使用Kaggle竞赛作为案例研究来评估我们代理程序的能力。遵循完全自动化的协议,Agent K v1.0系统地解决复杂和多模态的数据科学任务,利用贝叶斯优化进行超参数调整和特征工程。我们的新评估框架严格评估Agent K v1.0的端到端能力,从Kaggle竞赛URL开始生成并提交结果。结果表明,Agent K v1.0在各种任务中取得了92.5\%的成功率,涵盖了表格、计算机视觉、自然语言处理和多模态领域。通过计算每个人类Kaggle竞争者的Elo-MMR分数,与5856名人类Kaggle竞争者进行基准测试,Agent K v1.0排名前38\%,展示了与专家级用户相当的整体技能水平。值得注意的是,其Elo-MMR分数介于人类大师的第一和第三四分位数之间。此外,我们的结果表明,Agent K v1.0已经达到了与Kaggle大师相当的表现水平,获得了6枚金牌、3枚银牌和7枚铜牌,符合Kaggle的晋级系统定义。
English
We introduce Agent K v1.0, an end-to-end autonomous data science agent
designed to automate, optimise, and generalise across diverse data science
tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle
by learning from experience. It leverages a highly flexible structured
reasoning framework to enable it to dynamically process memory in a nested
structure, effectively learning from accumulated experience stored to handle
complex reasoning tasks. It optimises long- and short-term memory by
selectively storing and retrieving key information, guiding future decisions
based on environmental rewards. This iterative approach allows it to refine
decisions without fine-tuning or backpropagation, achieving continuous
improvement through experiential learning. We evaluate our agent's apabilities
using Kaggle competitions as a case study. Following a fully automated
protocol, Agent K v1.0 systematically addresses complex and multimodal data
science tasks, employing Bayesian optimisation for hyperparameter tuning and
feature engineering. Our new evaluation framework rigorously assesses Agent K
v1.0's end-to-end capabilities to generate and send submissions starting from a
Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\%
success rate across tasks, spanning tabular, computer vision, NLP, and
multimodal domains. When benchmarking against 5,856 human Kaggle competitors by
calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%,
demonstrating an overall skill level comparable to Expert-level users. Notably,
its Elo-MMR score falls between the first and third quartiles of scores
achieved by human Grandmasters. Furthermore, our results indicate that Agent K
v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a
record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's
progression system.Summary
AI-Generated Summary