松弛递归Transformer：使用逐层LoRA实现有效参数共享

摘要

大型语言模型（LLMs）部署成本高昂。参数共享为减小其规模和成本提供了可能的途径，但在现代LLMs中，其有效性仍然相当有限。在本研究中，我们重新审视了作为Transformer中参数共享形式的“层绑定”，并引入了将现有LLMs转换为更小的“递归Transformer”的新方法，这些模型在层间共享参数，性能损失最小。在这里，我们的递归Transformer是从标准预训练Transformer高效初始化的，但只使用一块独特层的模块，然后在循环中多次重复。我们通过引入Relaxed Recursive Transformers进一步改进性能，通过深度低秩适应（LoRA）模块增加了对层绑定约束的灵活性，同时仍保持整体模型的紧凑性。我们展示了我们的递归模型（例如，递归Gemma 1B）优于类似规模的普通预训练模型（如TinyLlama 1.1B和Pythia 1B）以及知识蒸馏基准模型，甚至可以恢复原始“全尺寸”模型（例如，没有共享参数的Gemma 2B）大部分性能。最后，我们提出了连续深度批处理，这是一种有前景的新推理范式，当与早期退出配对时，递归Transformer可以实现。在理论分析中，我们展示了这有潜力带来显著（2-3倍）的推理吞吐量增益。

English

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

松弛递归Transformer：使用逐层LoRA实现有效参数共享

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

摘要

Summary

Support

Support