도구에서 팀원으로: 다중 세션 코딩 상호작용에서의 대형 언어 모델 평가

초록

대규모 언어 모델(LLM)은 다양한 작업을 위해 업무 환경에서 점점 더 많이 사용되고 있으며, 개별 문제를 고립된 상태에서 해결하는 데 탁월한 성능을 보입니다. 그러나 이러한 모델들이 장기적인 상호작용을 통해 효과적으로 협업할 수 있을까요? 이를 조사하기 위해 우리는 MemoryCode라는 합성 다중 세션 데이터셋을 도입했습니다. 이 데이터셋은 LLM이 관련 없는 정보 속에서 간단한 코딩 지시사항을 추적하고 실행하는 능력을 테스트하도록 설계되었으며, 현실적인 환경을 시뮬레이션합니다. 우리가 테스트한 모든 모델은 고립된 지시사항을 잘 처리하지만, GPT-4o와 같은 최첨단 모델조차도 지시사항이 여러 세션에 걸쳐 분산될 경우 성능이 저하됩니다. 우리의 분석에 따르면, 이는 장기적인 지시사항 체인에 걸쳐 정보를 검색하고 통합하는 데 실패하기 때문입니다. 우리의 결과는 현재 LLM의 근본적인 한계를 보여주며, 이는 장기적인 상호작용에서 효과적으로 협업하는 능력을 제한합니다.

English

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

도구에서 팀원으로: 다중 세션 코딩 상호작용에서의 대형 언어 모델 평가

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

초록

Support