LongWriter-V: 비전-언어 모델에서 초장문 및 고충실도 생성을 가능하게 하는 기술

초록

기존의 대형 시각-언어 모델(LVLMs)은 최대 128,000개의 시각 및 텍스트 토큰을 처리할 수 있지만, 1,000단어를 초과하는 일관된 출력을 생성하는 데 어려움을 겪습니다. 우리는 이러한 주요 한계가 지도 미세 조정(SFT) 과정에서 긴 출력 예제가 부족하기 때문이라는 것을 발견했습니다. 이 문제를 해결하기 위해, 우리는 22,158개의 예제로 구성된 LongWriter-V-22k SFT 데이터셋을 소개합니다. 이 데이터셋은 각각 여러 개의 입력 이미지, 지시문, 그리고 0에서 10,000단어까지의 해당 출력을 포함합니다. 또한, 입력 이미지에 대한 높은 충실도를 유지하면서 긴 출력을 달성하기 위해, SFT 모델에 직접 선호도 최적화(DPO)를 적용합니다. 긴 출력(예: 3,000단어)에 대한 인간 피드백 수집의 높은 비용을 고려하여, 우리는 IterDPO를 제안합니다. 이 방법은 긴 출력을 세그먼트로 나누고 반복적인 수정을 통해 원본 출력과 선호 쌍을 형성합니다. 추가적으로, 우리는 VLMs의 장문 생성 능력을 평가하기 위해 6가지 작업을 포함한 MMLongBench-Write 벤치마크를 개발했습니다. LongWriter-V-22k와 IterDPO로 훈련된 우리의 7B 파라미터 모델은 이 벤치마크에서 인상적인 성능을 보이며, GPT-4o와 같은 더 큰 독점 모델을 능가합니다. 코드와 데이터: https://github.com/THU-KEG/LongWriter-V

English

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

LongWriter-V: 비전-언어 모델에서 초장문 및 고충실도 생성을 가능하게 하는 기술

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

초록

Support