多模式LLM可以在零樣本情況下推理美學。

摘要

我們提出了第一項研究，探討多模態語言模型（Multimodal LLMs，MLLMs）的推理能力如何被喚起以評估藝術作品的美感。為了促進這一研究，我們建立了MM-StyleBench，這是一個用於評估藝術風格的新型高質量數據集。然後，我們開發了一種基於原則的人類偏好建模方法，並對MLLMs的回應與人類偏好之間進行系統性相關性分析。我們的實驗揭示了MLLMs在藝術評估中存在的固有幻覺問題，與回應主觀性相關。我們提出了ArtCoT，證明了藝術特定任務分解和具體語言的使用如何提升MLLMs在美感方面的推理能力。我們的研究結果為MLLMs在藝術領域提供了寶貴的見解，並且可以惠及各種下游應用，如風格轉移和藝術圖像生成。代碼可在https://github.com/songrise/MLLM4Art找到。

English

We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at https://github.com/songrise/MLLM4Art.

多模式LLM可以在零樣本情況下推理美學。

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

摘要

Support