位元潛隱轉換器:分塊比記號更有效率

Byte Latent Transformer: Patches Scale Better Than Tokens

December 13, 2024
作者: Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer
cs.AI

摘要

我們介紹了 Byte Latent Transformer (BLT),這是一種新的位元級別的LLM架構,首次在規模上與基於標記化的LLM性能相匹配,同時在推理效率和韌性方面有顯著改進。BLT將位元編碼成動態大小的補丁,這些補丁作為計算的主要單位。補丁根據下一個位元的熵進行分段,根據增加的數據複雜性需求,分配更多的計算和模型容量。我們提出了第一個以FLOP為控制的位元級別模型擴展研究,涵蓋了高達8B參數和4T訓練位元的範圍。我們的結果表明,在沒有固定詞彙表的情況下擴展以原始位元進行訓練的模型是可行的。當數據可預測時,通過動態選擇長補丁,訓練和推理效率均有所提高,並在推理和長尾泛化方面有質的改進。總的來說,在固定推理成本的情況下,BLT顯示出比基於標記化的模型更好的擴展性,同時增加補丁和模型大小。
English
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Summary

AI-Generated Summary

PDF896December 17, 2024