TAID: 언어 모델에서 효율적인 지식 전이를 위한 시간적으로 적응 가능한 보간 증류

초록

인과 언어 모델은 놀라운 성능을 보여 주었지만, 그 크기는 자원이 제한된 환경에서의 배포에 중요한 도전을 제기합니다. 대규모 교사 모델로부터 지식을 소규모 학생 모델로 전달하는 널리 사용되는 기술인 지식 증류는 모델 압축을 위한 유망한 접근 방식을 제시합니다. 주요한 문제점 중 하나는 교사 모델과 학생 모델 사이의 주요한 차이점에 있습니다. 즉, 상당한 용량 차이, 모 평균화, 그리고 모 붕괴가 있어서 이러한 차이들이 증류 과정에서 장벽을 형성합니다. 이러한 문제를 해결하기 위해 우리는 Temporally Adaptive Interpolated Distillation (TAID)라는 새로운 지식 증류 방법을 소개합니다. TAID는 학생과 교사 분포를 동적으로 보간하는 적응 중간 분포를 통해 학생의 초기 분포에서 점진적으로 교사의 분포로 이동합니다. 우리는 이론적 분석을 통해 TAID가 모 붕괴를 방지하는 능력을 증명하고, 용량 차이를 해결하면서 모 평균화와 모 붕괴를 균형 있게 유지하는 효과를 경험적으로 보여줍니다. 우리의 포괄적인 실험은 TAID가 다양한 모델 크기와 아키텍처에서 지도 조정 및 사전 훈련 시나리오에서 우수한 성능을 보여 주며, TAID-LLM-1.5B(언어 작업용) 및 TAID-VLM-2B(시각-언어 작업용)와 같은 최첨단 소형 기반 모델을 개발하여 TAID의 실용적인 영향을 보여 줍니다. 이러한 결과는 TAID가 높은 성능을 발휘하고 효율적인 모델을 만드는 데 효과적임을 입증하며, 보다 접근하기 쉬운 AI 기술의 발전을 촉진합니다.

English

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID), a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: TAID-LLM-1.5B for language tasks and TAID-VLM-2B for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

TAID: 언어 모델에서 효율적인 지식 전이를 위한 시간적으로 적응 가능한 보간 증류

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

초록

Support