WebWalker：在网络遍历中对LLM进行基准测试

摘要

检索增强生成（RAG）在开放领域问答任务中展现出卓越性能。然而，传统搜索引擎可能检索到表面内容，限制了LLM处理复杂、多层信息的能力。为解决这一问题，我们引入了WebWalkerQA，一个旨在评估LLM执行网页遍历能力的基准。它评估LLM遍历网站子页面系统提取高质量数据的能力。我们提出了WebWalker，这是一个模拟人类网页导航的多智能体框架，通过探索-评论家范式实现。广泛的实验结果表明，WebWalkerQA具有挑战性，并展示了RAG与WebWalker结合在真实场景中的水平和垂直整合的有效性。

English

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

WebWalker：在网络遍历中对LLM进行基准测试

WebWalker: Benchmarking LLMs in Web Traversal

摘要

Summary

Support