SilVar-Med：一種基於語音驅動的可視化語言模型，用於醫學影像中的可解釋異常檢測

摘要

醫療視覺語言模型在各種醫療應用中展現了巨大潛力，包括醫學影像描述和診斷輔助。然而，現有模型大多依賴於基於文本的指令，這限制了它們在實際臨床環境中的可用性，尤其是在手術等場景中，基於文本的互動對醫生而言往往不切實際。此外，當前的醫學影像分析模型通常缺乏對其預測背後全面推理的展示，這降低了它們在臨床決策中的可靠性。考慮到醫療診斷錯誤可能帶來改變人生的後果，開發可解釋且理性的醫療輔助工具顯得尤為重要。為應對這些挑戰，我們引入了一種端到端的語音驅動醫療視覺語言模型——SilVar-Med，這是一個多模態醫學影像助手，它將語音互動與視覺語言模型相結合，開創了基於語音的醫學影像分析任務。同時，我們專注於對每項醫學異常預測背後推理的解釋，並提出了一個推理數據集。通過大量實驗，我們展示了結合端到端語音互動的推理驅動醫學影像解釋的概念驗證研究。我們相信，這項工作將推動醫療AI領域的發展，促進更加透明、互動且臨床可行的診斷支持系統的建立。我們的代碼和數據集已在SiVar-Med上公開。

English

Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.

SilVar-Med：一種基於語音驅動的可視化語言模型，用於醫學影像中的可解釋異常檢測

SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

摘要

Summary

Support

Support