關於多模態大型語言模型的特定領域後訓練

摘要

近年來，通用多模態大型語言模型（MLLMs）的快速發展備受矚目。然而，將通用MLLMs調整至特定領域，如科學領域和工業應用，仍未被深入探討。本文系統地研究了通過後訓練進行MLLMs領域適應的方法，重點關注數據合成、訓練流程和任務評估。（1）數據合成：利用開源模型，我們開發了一個視覺指導合成器，有效地從特定領域的圖像說明對生成多樣化的視覺指導任務。我們的合成任務在增強MLLMs的特定領域性能方面超越了通過手動規則、GPT-4和GPT-4V生成的任務。（2）訓練流程：雖然兩階段訓練——首先是圖像說明對，然後是視覺指導任務——通常用於開發通用MLLMs，但我們應用單階段訓練流程來增強特定領域後訓練的任務多樣性。（3）任務評估：我們在生物醫學和食品兩個領域進行實驗，通過後訓練不同來源和規模的MLLMs（例如Qwen2-VL-2B、LLaVA-v1.6-8B、Llama-3.2-11B），然後評估MLLM在各種特定領域任務上的表現。為支持MLLM領域適應的進一步研究，我們將開源我們的實現。

English

Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.

關於多模態大型語言模型的特定領域後訓練

On Domain-Specific Post-Training for Multimodal Large Language Models

摘要

Support