효율적인 LLM 사전 훈련을 위한 다중 에이전트 협력 데이터 선택

초록

대규모 언어 모델(LLM)의 사전 훈련 가속화를 위해 효율적인 데이터 선택은 중요합니다. 데이터 효율성을 향상시키기 위해 다양한 방법이 제안되었지만, 이러한 접근 방식 간의 본질적인 충돌에 대한 연구는 제한적입니다. LLM 사전 훈련을 위한 최적의 데이터 선택을 달성하기 위해 이 문제에 대처하기 위해 우리는 새로운 다중 에이전트 협력 데이터 선택 메커니즘을 제안합니다. 이 프레임워크에서 각 데이터 선택 방법은 독립적인 에이전트로 작용하며, 각 에이전트 콘솔은 LLM 훈련 과정 전체에서 모든 에이전트로부터 정보를 동적으로 통합하도록 설계되었습니다. 우리는 우리의 다중 에이전트 프레임워크를 평가하기 위해 광범위한 경험적 연구를 수행했습니다. 실험 결과는 우리의 방법이 데이터 효율성을 크게 향상시키고, LLM 훈련에서 수렴을 가속화하며, 다중 언어 모델 벤치마크에서 최첨단 방법에 비해 평균 성능 향상률이 10.5% 달성한다는 것을 보여줍니다.

English

Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain of 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.

효율적인 LLM 사전 훈련을 위한 다중 에이전트 협력 데이터 선택

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

초록

Summary

Support