MLLM은 볼 수 있을까? 환각 완화를 위한 동적 수정 복호 기법

초록

다중 모달 대형 언어 모델(MLLMs)은 종종 환각 현상을 나타내지만, 그 밑바탕에 있는 이유는 여전히 잘 이해되지 않고 있습니다. 본 논문에서는 경험적 분석을 제시하고, MLLMs가 최종 출력에서 객체를 잘못 생성하더라도, 사실은 이전 레이어에서 시각적 객체를 인식할 수 있다는 것을 발견했습니다. 언어 모델의 강력한 지식 사전이 시각 정보를 억제하여 환각을 유발할 수 있다는 것을 추측합니다. 이에 동기부여를 받아, MLLMs를 위한 새로운 동적 보정 디코딩 방법(DeCo)을 제안합니다. DeCo는 적응적으로 적절한 이전 레이어를 선택하고 지식을 최종 레이어에 비례하여 통합하여 출력 로짓을 조정합니다. DeCo는 모델에 중립적이며 다양한 클래식 디코딩 전략과 매끄럽게 통합되어 다양한 MLLMs에 적용할 수 있습니다. 우리는 DeCo를 널리 사용되는 벤치마크에서 평가하여, 기존의 기준선과 비교하여 환각 비율을 크게 줄일 수 있다는 것을 보여주었으며, 이는 환각을 완화할 수 있는 잠재력을 강조합니다. 코드는 https://github.com/zjunlp/DeCo에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at https://github.com/zjunlp/DeCo.

MLLM은 볼 수 있을까? 환각 완화를 위한 동적 수정 복호 기법

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

초록

Summary

Support