WebWalker: 웹 탐색에서 LLMs의 벤치마킹

초록

검색 증강 생성(Retrieval-augmented generation, RAG)은 오픈 도메인 질의응답 작업에서 놀라운 성능을 보여줍니다. 그러나 기존의 검색 엔진은 얕은 콘텐츠를 검색할 수 있어서 LLMs가 복잡하고 다층적인 정보를 처리하는 능력을 제한할 수 있습니다. 이를 해결하기 위해 우리는 웹 탐색 능력을 평가하기 위해 설계된 벤치마크인 WebWalkerQA를 소개합니다. 이는 LLMs의 웹 탐색 능력을 평가하며 웹 사이트의 하위 페이지를 체계적으로 추출하는 능력을 평가합니다. 우리는 인간과 유사한 웹 탐색을 모방하는 멀티 에이전트 프레임워크인 WebWalker를 제안합니다. 탐험-비평가 패러다임을 통해 실현되는 WebWalkerQA의 효과를 보여주는 광범위한 실험 결과는 RAG와 WebWalker의 수평 및 수직 통합을 통해 실제 시나리오에서 효과적임을 입증합니다.

English

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

WebWalker: 웹 탐색에서 LLMs의 벤치마킹

WebWalker: Benchmarking LLMs in Web Traversal

초록

Support