TAID:用于语言模型中高效知识迁移的时间自适应插值蒸馏
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
January 28, 2025
作者: Makoto Shing, Kou Misaki, Han Bao, Sho Yokoi, Takuya Akiba
cs.AI
摘要
因果语言模型展示了卓越的能力,但其规模对于在资源受限环境中部署提出了重大挑战。知识蒸馏是一种广泛使用的技术,用于将大型教师模型的知识转移至小型学生模型,为模型压缩提供了一种有前途的途径。一个重要的问题在于教师模型和学生模型之间存在的主要差异,即实质性的容量差距、模式平均和模式坍缩,这些在蒸馏过程中构成了障碍。为了解决这些问题,我们引入了一种新颖的知识蒸馏方法,即“时间自适应插值蒸馏”(TAID),通过一个自适应的中间分布动态地插值学生和教师分布,逐渐从学生的初始分布向教师的分布过渡。我们提供了理论分析,证明了TAID能够防止模式崩溃,并在实证中展示了其在解决容量差距、平衡模式平均和模式崩溃方面的有效性。我们的全面实验表明,TAID在各种模型规模和架构中均表现出优异的性能,无论是在指导调整还是预训练场景中。此外,我们通过开发两个最新的紧凑基础模型展示了TAID的实际影响:TAID-LLM-1.5B 用于语言任务,以及 TAID-VLM-2B 用于视觉-语言任务。这些结果展示了TAID在创建高性能高效模型方面的有效性,推动了更易接近的人工智能技术的发展。
English
Causal language models have demonstrated remarkable capabilities, but their
size poses significant challenges for deployment in resource-constrained
environments. Knowledge distillation, a widely-used technique for transferring
knowledge from a large teacher model to a small student model, presents a
promising approach for model compression. A significant remaining issue lies in
the major differences between teacher and student models, namely the
substantial capacity gap, mode averaging, and mode collapse, which pose
barriers during distillation. To address these issues, we introduce
Temporally Adaptive Interpolated Distillation (TAID), a novel
knowledge distillation approach that dynamically interpolates student and
teacher distributions through an adaptive intermediate distribution, gradually
shifting from the student's initial distribution towards the teacher's
distribution. We provide a theoretical analysis demonstrating TAID's ability to
prevent mode collapse and empirically show its effectiveness in addressing the
capacity gap while balancing mode averaging and mode collapse. Our
comprehensive experiments demonstrate TAID's superior performance across
various model sizes and architectures in both instruction tuning and
pre-training scenarios. Furthermore, we showcase TAID's practical impact by
developing two state-of-the-art compact foundation models:
TAID-LLM-1.5B for language tasks and TAID-VLM-2B for
vision-language tasks. These results demonstrate TAID's effectiveness in
creating high-performing and efficient models, advancing the development of
more accessible AI technologies.Summary
AI-Generated Summary