Q-Eval-100K：评估文本到视觉内容的质量与对齐度

摘要

评估文本到视觉内容的质量主要依赖于两个关键维度：视觉质量与对齐度。尽管在开发客观评估模型方面已取得显著进展，但此类模型的性能很大程度上取决于人工标注的规模与质量。根据扩展定律，增加人工标注样本数量遵循可预测的模式，能有效提升评估模型的性能。因此，我们引入了一个全面的数据集——Q-EVAL-100K，专为评估文本到视觉内容的视觉质量与对齐度而设计，该数据集包含了迄今为止最大规模的人类标注平均意见分数（MOS），针对上述两个维度。Q-EVAL-100K数据集覆盖了文本到图像及文本到视频模型，拥有960K条人工标注，专注于100K个实例（60K张图片与40K段视频）的视觉质量与对齐度评估。利用这一数据集并结合上下文提示，我们提出了Q-Eval-Score，一个统一模型，特别针对长文本提示的对齐问题进行了优化，能够同时评估视觉质量与对齐度。实验结果显示，Q-Eval-Score在视觉质量与对齐度评估上均表现出色，并在其他基准测试中展现了强大的泛化能力。这些发现凸显了Q-EVAL-100K数据集的重大价值。相关数据与代码将发布于https://github.com/zzc-1998/Q-Eval。

English

Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.

Q-Eval-100K：评估文本到视觉内容的质量与对齐度

Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

摘要

Summary

Support

Support