大型語言模型結合結構化推理，達到 Kaggle 冠軍水平

摘要

我們介紹 Agent K v1.0，一個端對端的自主數據科學代理，旨在自動化、優化和泛化各種數據科學任務。Agent K v1.0 是完全自動化的，通過從經驗中學習，管理整個數據科學生命周期。它利用高度靈活的結構推理框架，使其能夠動態處理記憶，以巢狀結構有效地從積累的經驗中學習，以應對複雜的推理任務。它通過有選擇性地存儲和檢索關鍵信息，優化長期和短期記憶，並基於環境獎勵指導未來決策。這種迭代方法使其能夠在不需要微調或反向傳播的情況下完善決策，通過體驗學習實現持續改進。我們使用 Kaggle 競賽作為案例研究來評估我們代理的能力。按照完全自動化的協議，Agent K v1.0 系統地處理複雜和多模態的數據科學任務，利用貝葉斯優化進行超參數調整和特徵工程。我們的新評估框架嚴格評估 Agent K v1.0 的端對端能力，從 Kaggle 競賽 URL 開始生成並提交結果。結果表明，Agent K v1.0 在各種任務中實現了 92.5\% 的成功率，涵蓋表格、計算機視覺、自然語言處理和多模態領域。通過計算每個參賽者的 Elo-MMR 分數，與 5,856 名人類 Kaggle 參賽者進行基準測試，Agent K v1.0 在排名中位於前 38\%，展示出與專家級用戶相當的整體技能水平。值得注意的是，其 Elo-MMR 分數介於人類特級大師所獲得分數的第一和第三四分位數之間。此外，我們的結果表明，Agent K v1.0 已達到了與 Kaggle 特級大師相當的表現水平，憑藉 6 枚金牌、3 枚銀牌和 7 枚銅牌的成績，符合 Kaggle 的晉級制度定義。

English

We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.

大型語言模型結合結構化推理，達到 Kaggle 冠軍水平

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

摘要

Summary

Support

Support