杜鹃:在LLM巢穴中由大量营养孵化出的IE自由骑手
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
February 16, 2025
作者: Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
cs.AI
摘要
为培育先进的大型语言模型(LLMs),已经精心准备了大量高质量的数据,包括预训练原始文本和后训练注释。相比之下,对于信息抽取(IE),例如BIO标记序列的预训练数据很难扩展。我们展示了IE模型可以利用LLM资源作为免费骑手,通过将下一个标记预测重新构建为已经存在于上下文中的标记的抽取。具体来说,我们提出的下一个标记抽取(NTE)范式学习了一种多功能IE模型,名为Cuckoo,其中包含从LLM的预训练和后训练数据转换而来的1.026亿个抽取数据。在少样本设置下,Cuckoo能够有效适应传统和复杂的遵循指令的IE,并且表现优于现有的预训练IE模型。作为免费骑手,Cuckoo可以自然地随着LLM数据准备工作的不断改进而发展,从LLM训练管道的改进中受益,无需额外的人工努力。
English
Massive high-quality data, both pre-training raw texts and post-training
annotations, have been carefully prepared to incubate advanced large language
models (LLMs). In contrast, for information extraction (IE), pre-training data,
such as BIO-tagged sequences, are hard to scale up. We show that IE models can
act as free riders on LLM resources by reframing next-token prediction
into extraction for tokens already present in the context. Specifically,
our proposed next tokens extraction (NTE) paradigm learns a versatile IE model,
Cuckoo, with 102.6M extractive data converted from LLM's pre-training
and post-training data. Under the few-shot setting, Cuckoo adapts effectively
to traditional and complex instruction-following IE with better performance
than existing pre-trained IE models. As a free rider, Cuckoo can naturally
evolve with the ongoing advancements in LLM data preparation, benefiting from
improvements in LLM training pipelines without additional manual effort.Summary
AI-Generated Summary