대형 언어 모델은 장거리 맥락 추론에서 스스로 개선할 수 있습니다.

초록

대형 언어 모델(LLMs)은 긴 맥락을 처리하는 데 상당한 진전을 이루었지만 여전히 긴 맥락 추론에 어려움을 겪고 있습니다. 기존 접근 방식은 일반적으로 인간 전문가의 주석이나 GPT-4와 같은 고급 모델에 의존하는 합성 데이터로 LLMs를 세밀하게 조정하는 것을 포함하며, 이는 추가 발전을 제한합니다. 이 문제를 해결하기 위해 우리는 LLMs가 자체적으로 긴 맥락 추론을 개선할 잠재력을 조사하고 이 목적으로 특별히 설계된 \ours를 제안합니다. 이 방법은 간단합니다. 각 질문에 대해 여러 출력을 샘플링하고, 이들을 최소 베이즈 위험으로 점수를 매기고, 그런 다음 이러한 출력을 기반으로 감독된 세밀 조정 또는 선호도 최적화를 적용합니다. 여러 주요 LLMs에 대한 광범위한 실험은 Llama-3.1-8B-Instruct의 4.2 점의 절대적인 향상을 보여주며, \ours의 효과를 입증합니다. 더 나아가, \ours는 인간 전문가나 고급 모델이 생성한 데이터에 의존하는 이전 접근 방식과 비교하여 우수한 성능을 달성합니다. 이 연구가 LLMs의 지속적인 발전에 중요한 긴 맥락 시나리오에서의 자체 개선 기술에 대한 새로운 길을 열 것으로 기대합니다.

English

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

대형 언어 모델은 장거리 맥락 추론에서 스스로 개선할 수 있습니다.

Large Language Models Can Self-Improve in Long-context Reasoning

초록

Support