WebWalker:在网络遍历中对LLM进行基准测试

WebWalker: Benchmarking LLMs in Web Traversal

January 13, 2025
作者: Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Deyu Zhou, Pengjun Xie, Fei Huang
cs.AI

摘要

检索增强生成(RAG)在开放领域问答任务中展现出卓越性能。然而,传统搜索引擎可能检索到表面内容,限制了LLM处理复杂、多层信息的能力。为解决这一问题,我们引入了WebWalkerQA,一个旨在评估LLM执行网页遍历能力的基准。它评估LLM遍历网站子页面系统提取高质量数据的能力。我们提出了WebWalker,这是一个模拟人类网页导航的多智能体框架,通过探索-评论家范式实现。广泛的实验结果表明,WebWalkerQA具有挑战性,并展示了RAG与WebWalker结合在真实场景中的水平和垂直整合的有效性。
English
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Summary

AI-Generated Summary

PDF183January 14, 2025