多模式LLM可以在零樣本情況下推理美學。
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
January 15, 2025
作者: Ruixiang Jiang, Changwen Chen
cs.AI
摘要
我們提出了第一項研究,探討多模態語言模型(Multimodal LLMs,MLLMs)的推理能力如何被喚起以評估藝術作品的美感。為了促進這一研究,我們建立了MM-StyleBench,這是一個用於評估藝術風格的新型高質量數據集。然後,我們開發了一種基於原則的人類偏好建模方法,並對MLLMs的回應與人類偏好之間進行系統性相關性分析。我們的實驗揭示了MLLMs在藝術評估中存在的固有幻覺問題,與回應主觀性相關。我們提出了ArtCoT,證明了藝術特定任務分解和具體語言的使用如何提升MLLMs在美感方面的推理能力。我們的研究結果為MLLMs在藝術領域提供了寶貴的見解,並且可以惠及各種下游應用,如風格轉移和藝術圖像生成。代碼可在https://github.com/songrise/MLLM4Art找到。
English
We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability
shall be elicited to evaluate the aesthetics of artworks. To facilitate this
investigation, we construct MM-StyleBench, a novel high-quality dataset for
benchmarking artistic stylization. We then develop a principled method for
human preference modeling and perform a systematic correlation analysis between
MLLMs' responses and human preference. Our experiments reveal an inherent
hallucination issue of MLLMs in art evaluation, associated with response
subjectivity. ArtCoT is proposed, demonstrating that art-specific task
decomposition and the use of concrete language boost MLLMs' reasoning ability
for aesthetics. Our findings offer valuable insights into MLLMs for art and can
benefit a wide range of downstream applications, such as style transfer and
artistic image generation. Code available at
https://github.com/songrise/MLLM4Art.Summary
AI-Generated Summary