자기 합성 데이터를 활용한 멀티모달 기초 모델의 인지 능력 및 설명 가능성 강화

초록

대규모 멀티모달 모델(LMMs)은 다양한 시각적 작업에서 인상적인 성능을 보여주고 있습니다. 그러나 이러한 모델들은 세부적인 시각적 추론에 어려움을 겪으며, 도메인 특화된 목표를 식별하고 예측에 대한 타당한 설명을 제공하는 데 실패하는 경우가 많습니다. 이를 해결하기 위해, 우리는 자체 합성 데이터를 활용하여 LMMs의 인지 능력과 설명 가능성을 향상시키는 새로운 시각적 거부 샘플링 프레임워크를 제안합니다. 구체적으로, 시각적 미세 조정에는 이미지, 질의, 그리고 목표 답변이 필요합니다. 우리의 접근 방식은 인간이 검증 가능한 시각적 특징을 포함한 해석 가능한 답변을 합성하는 것부터 시작합니다. 이러한 특징들은 전문가가 정의한 개념을 기반으로 하며, 이미지 내용과의 일치도를 기준으로 신중하게 선택됩니다. 각 미세 조정 단계 후, 우리는 보상 모델이 없는 필터링 메커니즘을 적용하여 다음 조정 단계를 위한 최고 품질의 해석 가능한 답변을 선택합니다. 이 데이터 합성과 미세 조정의 반복적인 과정은 모델이 정확하고 합리적인 설명을 생성하는 능력을 점진적으로 향상시킵니다. 실험 결과는 우리의 방법이 특화된 시각적 분류 작업의 정확성과 설명 가능성을 모두 개선하는 데 효과적임을 보여줍니다.

English

Large multimodal models (LMMs) have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address this, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of data synthesis and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

자기 합성 데이터를 활용한 멀티모달 기초 모델의 인지 능력 및 설명 가능성 강화

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

초록

Summary

Support