WebWalker: WebトラバーサルにおけるLLMのベンチマーキング

要旨

情報検索拡張生成（RAG）は、オープンドメインの質問応答タスク全般で顕著な性能を示しています。ただし、従来の検索エンジンは浅いコンテンツを取得する可能性があり、LLMが複雑で多層情報を処理する能力が制限されることがあります。この課題に対処するために、WebWalkerQAを導入します。これは、LLMがウェブトラバーサルを実行する能力を評価するために設計されたベンチマークです。WebWalkerは、人間のようなウェブナビゲーションを探索評価者パラダイムを通じて模倣するマルチエージェントフレームワークです。幅広い実験結果は、WebWalkerQAが挑戦的であり、RAGとWebWalkerの組み合わせの効果を示しています。これにより、実世界のシナリオでの水平および垂直統合が実証されています。

English

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

WebWalker: WebトラバーサルにおけるLLMのベンチマーキング

WebWalker: Benchmarking LLMs in Web Traversal

要旨

Summary

Support