マルチモーダルLLMはゼロショットで美学について推論することができる

要旨

我々は、Multimodal LLMs（MLLMs）の推論能力を引き出す方法に関する初の研究を提示します。この研究では、芸術作品の美学を評価するためにMLLMsの推論能力を評価することが求められます。この調査を容易にするために、芸術的スタイル化をベンチマークするための革新的な高品質データセットであるMM-StyleBenchを構築します。次に、人間の好みモデリングのための原則に基づいた方法を開発し、MLLMsの応答と人間の好みとの系統的相関分析を行います。実験から、MLLMsの芸術評価における固有の幻覚問題が明らかになり、応答の主観性と関連しています。ArtCoTが提案され、芸術特有のタスク分解と具体的な言語の使用が、MLLMsの美学に対する推論能力を向上させることを示しています。我々の研究結果は、芸術に関するMLLMsに関する貴重な示唆を提供し、スタイル変換や芸術的画像生成など、幅広い下流アプリケーションに利益をもたらす可能性があります。コードはhttps://github.com/songrise/MLLM4Artで入手可能です。

English

We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at https://github.com/songrise/MLLM4Art.

マルチモーダルLLMはゼロショットで美学について推論することができる

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

要旨

Support