长文生成器-V:赋能视觉语言模型实现超长文本与高保真度生成
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
February 20, 2025
作者: Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
cs.AI
摘要
现有的大型视觉-语言模型(LVLMs)虽能处理多达128k视觉与文本标记的输入,但在生成超过1000字的连贯输出时仍显吃力。我们发现,这一局限主要源于监督微调(SFT)阶段缺乏长输出示例。为解决此问题,我们推出了LongWriter-V-22k,一个包含22,158个样本的SFT数据集,每个样本包含多张输入图像、一条指令及对应输出,输出长度从0到10,000字不等。此外,为确保长输出与输入图像保持高保真度,我们对SFT模型采用了直接偏好优化(DPO)。鉴于收集长输出(如3000字)的人类反馈成本高昂,我们提出了IterDPO方法,将长输出分段处理,并通过迭代修正与原输出形成偏好对。同时,我们开发了MMLongBench-Write基准测试,包含六项任务,用于评估视觉-语言模型的长文本生成能力。我们的7B参数模型,结合LongWriter-V-22k和IterDPO训练,在该基准测试中表现卓越,超越了如GPT-4o等更大规模的专有模型。代码与数据详见:https://github.com/THU-KEG/LongWriter-V。
English
Existing Large Vision-Language Models (LVLMs) can process inputs with context
lengths up to 128k visual and text tokens, yet they struggle to generate
coherent outputs beyond 1,000 words. We find that the primary limitation is the
absence of long output examples during supervised fine-tuning (SFT). To tackle
this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158
examples, each with multiple input images, an instruction, and corresponding
outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that
maintain high-fidelity to the input images, we employ Direct Preference
Optimization (DPO) to the SFT model. Given the high cost of collecting human
feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which
breaks long outputs into segments and uses iterative corrections to form
preference pairs with the original outputs. Additionally, we develop
MMLongBench-Write, a benchmark featuring six tasks to evaluate the
long-generation capabilities of VLMs. Our 7B parameter model, trained with
LongWriter-V-22k and IterDPO, achieves impressive performance on this
benchmark, outperforming larger proprietary models like GPT-4o. Code and data:
https://github.com/THU-KEG/LongWriter-VSummary
AI-Generated Summary