Q-Eval-100K:评估文本到视觉内容的质量与对齐度
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
March 4, 2025
作者: Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai
cs.AI
摘要
评估文本到视觉内容的质量主要依赖于两个关键维度:视觉质量与对齐度。尽管在开发客观评估模型方面已取得显著进展,但此类模型的性能很大程度上取决于人工标注的规模与质量。根据扩展定律,增加人工标注样本数量遵循可预测的模式,能有效提升评估模型的性能。因此,我们引入了一个全面的数据集——Q-EVAL-100K,专为评估文本到视觉内容的视觉质量与对齐度而设计,该数据集包含了迄今为止最大规模的人类标注平均意见分数(MOS),针对上述两个维度。Q-EVAL-100K数据集覆盖了文本到图像及文本到视频模型,拥有960K条人工标注,专注于100K个实例(60K张图片与40K段视频)的视觉质量与对齐度评估。利用这一数据集并结合上下文提示,我们提出了Q-Eval-Score,一个统一模型,特别针对长文本提示的对齐问题进行了优化,能够同时评估视觉质量与对齐度。实验结果显示,Q-Eval-Score在视觉质量与对齐度评估上均表现出色,并在其他基准测试中展现了强大的泛化能力。这些发现凸显了Q-EVAL-100K数据集的重大价值。相关数据与代码将发布于https://github.com/zzc-1998/Q-Eval。
English
Evaluating text-to-vision content hinges on two crucial aspects: visual
quality and alignment. While significant progress has been made in developing
objective models to assess these dimensions, the performance of such models
heavily relies on the scale and quality of human annotations. According to
Scaling Law, increasing the number of human-labeled instances follows a
predictable pattern that enhances the performance of evaluation models.
Therefore, we introduce a comprehensive dataset designed to Evaluate Visual
quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring
the largest collection of human-labeled Mean Opinion Scores (MOS) for the
mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image
and text-to-video models, with 960K human annotations specifically focused on
visual quality and alignment for 100K instances (60K images and 40K videos).
Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified
model capable of evaluating both visual quality and alignment with special
improvements for handling long-text prompt alignment. Experimental results
indicate that the proposed Q-Eval-Score achieves superior performance on both
visual quality and alignment, with strong generalization capabilities across
other benchmarks. These findings highlight the significant value of the
Q-EVAL-100K dataset. Data and codes will be available at
https://github.com/zzc-1998/Q-Eval.Summary
AI-Generated Summary