下一标记足矣:基于多模态大语言模型的真实图像质量与美学评分
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
March 8, 2025
作者: Mingxing Li, Rui Wang, Lei Sun, Yancheng Bai, Xiangxiang Chu
cs.AI
摘要
移动互联网的迅猛发展导致用户生成内容(UGC)图像数量大幅增加,这使得对UGC图像进行全面评估变得既紧迫又必要。近期,多模态大语言模型(MLLMs)在图像质量评估(IQA)和图像美学评估(IAA)方面展现出巨大潜力。尽管取得了这些进展,有效评分UGC图像的质量与美学仍面临两大挑战:1)单一评分难以捕捉人类感知的层次性;2)如何利用MLLMs输出如平均意见分数(MOS)等数值评分仍是一个待解难题。为应对这些挑战,我们引入了一个名为真实图像质量与美学(RealQA)的新数据集,包含14,715张UGC图像,每张图像均标注了10个细粒度属性,这些属性覆盖了三个层次:低层次(如图像清晰度)、中层次(如主体完整性)和高层次(如构图)。此外,我们深入探讨了如何有效利用MLLMs预测数值评分。令人惊讶的是,仅通过预测两个额外有效数字,下一标记范式即可达到当前最优(SOTA)性能。更进一步,结合思维链(CoT)与学习到的细粒度属性,所提方法在五个公开的IQA和IAA数据集上超越了现有SOTA方法,展现出卓越的可解释性,并在视频质量评估(VQA)上表现出强大的零样本泛化能力。代码与数据集将予以公开。
English
The rapid expansion of mobile internet has resulted in a substantial increase
in user-generated content (UGC) images, thereby making the thorough assessment
of UGC images both urgent and essential. Recently, multimodal large language
models (MLLMs) have shown great potential in image quality assessment (IQA) and
image aesthetic assessment (IAA). Despite this progress, effectively scoring
the quality and aesthetics of UGC images still faces two main challenges: 1) A
single score is inadequate to capture the hierarchical human perception. 2) How
to use MLLMs to output numerical scores, such as mean opinion scores (MOS),
remains an open question. To address these challenges, we introduce a novel
dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715
UGC images, each of which is annoted with 10 fine-grained attributes. These
attributes span three levels: low level (e.g., image clarity), middle level
(e.g., subject integrity) and high level (e.g., composition). Besides, we
conduct a series of in-depth and comprehensive investigations into how to
effectively predict numerical scores using MLLMs. Surprisingly, by predicting
just two extra significant digits, the next token paradigm can achieve SOTA
performance. Furthermore, with the help of chain of thought (CoT) combined with
the learnt fine-grained attributes, the proposed method can outperform SOTA
methods on five public datasets for IQA and IAA with superior interpretability
and show strong zero-shot generalization for video quality assessment (VQA).
The code and dataset will be released.Summary
AI-Generated Summary