从工具到队友:评估大语言模型在多轮编程协作中的表现
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
February 19, 2025
作者: Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici
cs.AI
摘要
大型语言模型(LLMs)在工作环境中正被日益广泛地应用于各类任务,其在独立解决单一问题方面表现卓越。然而,它们是否也能在长期互动中有效协作呢?为探究此问题,我们引入了MemoryCode,一个合成的多会话数据集,旨在测试LLMs在模拟真实环境中追踪并执行简单编码指令的能力,同时处理无关信息。尽管所有测试模型均能良好处理孤立指令,但即便是如GPT-4o这样的顶尖模型,在指令分散于多个会话时,其表现也会显著下降。我们的分析表明,这归因于它们无法有效检索并整合长指令链中的信息。研究结果揭示了当前LLMs的一个根本性局限,限制了其在长期互动中有效协作的能力。
English
Large Language Models (LLMs) are increasingly used in working environments
for a wide range of tasks, excelling at solving individual problems in
isolation. However, are they also able to effectively collaborate over
long-term interactions? To investigate this, we introduce MemoryCode, a
synthetic multi-session dataset designed to test LLMs' ability to track and
execute simple coding instructions amid irrelevant information, simulating a
realistic setting. While all the models we tested handle isolated instructions
well, even the performance of state-of-the-art models like GPT-4o deteriorates
when instructions are spread across sessions. Our analysis suggests this is due
to their failure to retrieve and integrate information over long instruction
chains. Our results highlight a fundamental limitation of current LLMs,
restricting their ability to collaborate effectively in long interactions.Summary
AI-Generated Summary