编码器的回归:最大化SLM的参数效率
Return of the Encoder: Maximizing Parameter Efficiency for SLMs
January 27, 2025
作者: Mohamed Elfeki, Rui Liu, Chad Voegele
cs.AI
摘要
大型仅解码器语言模型的主导地位已经使编码器-解码器架构黯然失色,尽管在序列处理中具有基本的效率优势。对于小语言模型(SLMs)- 即参数少于10亿的模型 - 我们在GPU、CPU和NPU平台上的系统分析显示,与仅解码器模型相比,编码器-解码器架构在边缘设备上实现了47%更低的首令延迟和4.7倍的吞吐量。这些收益可以归因于编码器-解码器的一次性输入处理和理解与生成阶段的高效分离。
我们引入了一种新颖的知识蒸馏框架,使编码器-解码器模型能够利用大规模仅解码器教师的能力,同时保留其架构优势,在各种任务中平均性能提升了6个百分点,特别是在输入和输出分布可以从不同处理方法中受益的非对称序列任务中获得了显著的收益。
结合现代进展,如旋转位置嵌入(RoPE)和视觉编码器,我们的系统调查表明,编码器-解码器架构为在资源受限环境中部署功能强大的语言模型提供了更实用的途径。我们的发现挑战了仅解码器扩展的普遍趋势,表明随着参数预算的减少,尤其是对于计算效率至关重要的设备和边缘部署,架构选择变得越来越关键。
English
The dominance of large decoder-only language models has overshadowed
encoder-decoder architectures, despite their fundamental efficiency advantages
in sequence processing. For small language models (SLMs) - those with 1 billion
parameters or fewer - our systematic analysis across GPU, CPU, and NPU
platforms reveals that encoder-decoder architectures achieve 47% lower
first-token latency and 4.7x higher throughput compared to decoder-only models
on edge devices. These gains may be attributed to encoder-decoder's one-time
input processing and efficient separation of understanding and generation
phases.
We introduce a novel knowledge distillation framework that enables
encoder-decoder models to leverage capabilities from large scalable
decoder-only teachers while preserving their architectural advantages,
achieving up to 6 average performance points improvement across diverse tasks,
with significant gains in asymmetric sequence tasks where input and output
distributions can benefit from different processing approaches.
When combined with modern advances like Rotary Positional Embeddings (RoPE)
and Vision encoders, our systematic investigation demonstrates that
encoder-decoder architectures provide a more practical path toward deploying
capable language models in resource-constrained environments. Our findings
challenge the prevailing trend toward decoder-only scaling, showing that
architectural choices become increasingly crucial as parameter budgets
decrease, particularly for on-device and edge deployments where computational
efficiency is paramount.Summary
AI-Generated Summary