LLM이 새로운 연구 아이디어를 생성할 수 있을까? 100명 이상의 NLP 연구자들과의 대규모 인간 연구

초록

최근 대형 언어 모델(LLMs)의 발전은 과학적 발견을 가속화할 수 있는 잠재력에 대한 낙관론을 불러일으켰으며, 자율적으로 새로운 아이디어를 생성하고 검증하는 연구 에이전트를 제안하는 작업이 증가하고 있습니다. 그럼에도 불구하고, 아직까지 LLM 시스템이 혁신적이고 전문가 수준의 아이디어를 생성하는 매우 첫 번째 단계를 수행할 수 있는 능력을 입증한 평가는 없습니다. 이를 해결하기 위해 혼란 변수를 통제하면서 연구 아이디어 생성을 평가하는 실험 설계를 수립하고, 전문 NLP 연구자와 LLM 아이디어 생성 에이전트 간의 첫 번째 직접 비교를 수행합니다. 100명 이상의 NLP 연구자를 모집하여 새로운 아이디어를 작성하고 LLM 및 인간 아이디어의 블라인드 리뷰를 통해, 현재 LLM 능력에 대한 연구 아이디어 생성에 대한 첫 번째 통계적으로 유의미한 결론을 얻습니다: LLM이 생성한 아이디어는 인간 전문가 아이디어보다 혁신적으로 판단되지만 실행 가능성 측면에서 약간 약한 것으로 판명됩니다. 에이전트 기준을 면밀히 조사하여, LLM 자가평가의 실패와 생성의 다양성 부족을 포함한 연구 에이전트 구축 및 평가의 문제점을 식별합니다. 마지막으로, 전문가조차 혁신성에 대한 인간 판단이 어려울 수 있음을 인정하고, 이러한 혁신성과 실행 가능성 판단이 연구 결과에 의미 있는 차이를 초래하는지 연구하는 데 연구자를 모집하여 이러한 아이디어를 완전한 프로젝트로 실행하도록 하는 종단간 연구 설계를 제안합니다.

English

Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

LLM이 새로운 연구 아이디어를 생성할 수 있을까? 100명 이상의 NLP 연구자들과의 대규모 인간 연구

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

초록

Summary

Support

Support