Text2World:大语言模型符号化世界模型生成的基准测试
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
February 18, 2025
作者: Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
cs.AI
摘要
近来,利用大型语言模型(LLMs)从文本描述中生成符号化世界模型的研究兴趣日益增长。尽管LLMs在世界建模领域已被广泛探索,先前的研究仍面临诸多挑战,如评估的随机性、对间接指标的依赖以及领域范围的局限性。为克服这些不足,我们引入了一个基于规划领域定义语言(PDDL)的新基准——Text2World,该基准包含数百个多样化领域,并采用多标准、基于执行的评估指标,以实现更为稳健的评估。我们利用Text2World对现有LLMs进行了基准测试,发现通过大规模强化学习训练得到的推理模型表现优于其他模型。然而,即便是表现最佳的模型,在世界建模方面仍显示出能力有限。基于这些发现,我们探讨了多种提升LLMs世界建模能力的潜在策略,包括测试时扩展、智能体训练等。我们期望Text2World能成为一项关键资源,为未来研究LLMs作为世界模型的应用奠定基础。项目页面详见https://text-to-world.github.io/。
English
Recently, there has been growing interest in leveraging large language models
(LLMs) to generate symbolic world models from textual descriptions. Although
LLMs have been extensively explored in the context of world modeling, prior
studies encountered several challenges, including evaluation randomness,
dependence on indirect metrics, and a limited domain scope. To address these
limitations, we introduce a novel benchmark, Text2World, based on planning
domain definition language (PDDL), featuring hundreds of diverse domains and
employing multi-criteria, execution-based metrics for a more robust evaluation.
We benchmark current LLMs using Text2World and find that reasoning models
trained with large-scale reinforcement learning outperform others. However,
even the best-performing model still demonstrates limited capabilities in world
modeling. Building on these insights, we examine several promising strategies
to enhance the world modeling capabilities of LLMs, including test-time
scaling, agent training, and more. We hope that Text2World can serve as a
crucial resource, laying the groundwork for future research in leveraging LLMs
as world models. The project page is available at
https://text-to-world.github.io/.Summary
AI-Generated Summary