长文生成器-V：赋能视觉语言模型实现超长文本与高保真度生成

摘要

现有的大型视觉-语言模型（LVLMs）虽能处理多达128k视觉与文本标记的输入，但在生成超过1000字的连贯输出时仍显吃力。我们发现，这一局限主要源于监督微调（SFT）阶段缺乏长输出示例。为解决此问题，我们推出了LongWriter-V-22k，一个包含22,158个样本的SFT数据集，每个样本包含多张输入图像、一条指令及对应输出，输出长度从0到10,000字不等。此外，为确保长输出与输入图像保持高保真度，我们对SFT模型采用了直接偏好优化（DPO）。鉴于收集长输出（如3000字）的人类反馈成本高昂，我们提出了IterDPO方法，将长输出分段处理，并通过迭代修正与原输出形成偏好对。同时，我们开发了MMLongBench-Write基准测试，包含六项任务，用于评估视觉-语言模型的长文本生成能力。我们的7B参数模型，结合LongWriter-V-22k和IterDPO训练，在该基准测试中表现卓越，超越了如GPT-4o等更大规模的专有模型。代码与数据详见：https://github.com/THU-KEG/LongWriter-V。

English

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

长文生成器-V：赋能视觉语言模型实现超长文本与高保真度生成

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

摘要

Summary

Support