Taglia le tue perdite nei modelli linguistici ad ampio vocabolario

Cut Your Losses in Large-Vocabulary Language Models

November 13, 2024
Autori: Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl
cs.AI

Abstract

Man mano che i modelli linguistici crescono sempre di più, crescono anche i loro vocabolari. Ciò ha spostato in modo sproporzionato l'impronta di memoria dei LLM durante l'addestramento su un singolo strato: l'entropia incrociata nel calcolo della perdita. L'entropia incrociata costruisce una matrice di logit con voci per ciascuna coppia di token di input e elementi del vocabolario e, per modelli piccoli, consuma un ordine di grandezza di memoria maggiore rispetto al resto del LLM combinato. Proponiamo Cut Cross-Entropy (CCE), un metodo che calcola la perdita di entropia incrociata senza materializzare i logit per tutti i token nella memoria globale. Piuttosto, CCE calcola solo il logit per il token corretto e valuta il log-sum-exp su tutti i logit al volo. Implementiamo un kernel personalizzato che esegue le moltiplicazioni delle matrici e la riduzione del log-sum-exp sul vocabolario nella memoria flash, rendendo trascurabile il consumo di memoria globale per il calcolo dell'entropia incrociata. Ciò ha un effetto drammatico. Prendendo come esempio il modello Gemma 2 (2B), CCE riduce l'impronta di memoria del calcolo della perdita da 24 GB a 1 MB e il consumo di memoria totale durante il tempo di addestramento della testa del classificatore da 28 GB a 1 GB. Per migliorare il throughput di CCE, sfruttiamo la sparità intrinseca del softmax e proponiamo di saltare gli elementi del calcolo del gradiente che hanno un contributo trascurabile (cioè al di sotto della precisione numerica) al gradiente. Gli esperimenti dimostrano che la drastica riduzione del consumo di memoria è realizzata senza sacrificare la velocità di addestramento o la convergenza.
English
As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

Summary

AI-Generated Summary

Paper Overview

This literature introduces the Cut Cross-Entropy (CCE) method to reduce memory consumption in training large language models (LLMs) by optimizing the computation of cross-entropy loss without compromising performance. The study demonstrates significant memory reduction in loss computation, enhancing training stability and efficiency.

Core Contribution

  • Introduces the Cut Cross-Entropy (CCE) method to minimize memory consumption in training large language models.
  • Utilizes custom kernels and flash memory for matrix multiplications and log-sum-exp reduction, reducing memory footprint significantly.
  • Balances memory-to-computation ratios, demonstrating stable training and improved efficiency without affecting performance.

Research Context

  • Addresses the memory-intensive nature of cross-entropy loss in large language model training.
  • Focuses on optimizing memory usage without compromising training speed or convergence.
  • Compares the proposed CCE method with existing implementations to showcase memory efficiency benefits.

Keywords

Large Language Models, Cut Cross-Entropy, Memory Consumption, Training Efficiency, Cross-Entropy Loss, Memory Optimization

Background

This research addresses the memory challenges associated with training large language models, particularly the significant memory consumption attributed to cross-entropy loss. The study aims to optimize memory usage during training without impacting the performance or convergence of the models.

Research Gap

  • Existing literature lacks efficient methods to reduce memory consumption during large language model training.
  • Limited focus on memory optimization techniques specifically targeting cross-entropy loss computations.
  • Insufficient exploration of balancing memory-to-computation ratios in training large language models.

Technical Challenges

  • Managing memory consumption during the computation of cross-entropy loss in large language models.
  • Optimizing memory usage without compromising training speed or convergence.
  • Efficiently implementing memory-efficient algorithms and custom kernels for reducing memory footprint.

Prior Approaches

  • Previous works have concentrated on attention mechanisms, efficient implementations, and vocabulary reduction in large language models.
  • Existing solutions have not adequately addressed the memory-intensive nature of cross-entropy loss computations.
  • Limited emphasis on leveraging sparsity and custom kernels to optimize memory usage during training.

Methodology

The methodology of this study involves leveraging the Cut Cross-Entropy (CCE) method to reduce memory consumption in large language model training while maintaining performance and convergence.

Theoretical Foundation

  • CCE reformulates the training objective to minimize memory consumption during cross-entropy loss computation.
  • Utilizes indexed matrix multiplication and linear-log-sum-exp operations for efficient forward and backward passes.
  • Balances memory-to-computation ratios to enhance training stability and efficiency.

Technical Architecture

  • Custom CUDA kernels and blockwise operations are employed for efficient memory usage.
  • Flash memory is utilized for matrix multiplications and log-sum-exp reduction to reduce memory footprint.
  • On-chip SRAM is used for computations to optimize memory footprint and latency.

Implementation Details

  • Custom CUDA kernels are developed for efficient matrix multiplications and log-sum-exp operations.
  • Techniques like gradient filtering and vocabulary sorting are implemented to reduce memory usage and improve computation speed.
  • The Triton framework is utilized for the implementation of the CCE method.

Innovation Points

  • CCE significantly reduces memory consumption during cross-entropy computation without compromising training speed or convergence.
  • Leveraging sparsity and custom kernels optimizes memory usage in large language model training.
  • Balancing memory-to-computation ratios enhances training stability and efficiency.

Experimental Validation

The experimental validation in this literature demonstrates the effectiveness of the Cut Cross-Entropy (CCE) method in reducing memory consumption during large language model training.

Setup

  • Matrix multiplication between model output embeddings and the classifier is performed on GPUs using block-wise operations.
  • Techniques like gradient filtering and vocabulary sorting are employed to optimize memory usage.
  • The Triton framework is utilized for the implementation of CCE.

Metrics

  • Memory footprint and computation time are key metrics for evaluating the efficiency of CCE.
  • Training stability is assessed through loss curves of different models.
  • Comparison with baseline methods is conducted to showcase memory reduction benefits.

Results

  • CCE significantly reduces memory usage without sacrificing speed compared to baseline methods.
  • Gradient filtering and vocabulary sorting contribute to skipping unnecessary computations, enhancing efficiency.
  • Additional results for various models demonstrate the memory and time efficiency of CCE.

Comparative Analysis

  • Comparison with other methods like Liger Kernels, Torch Tune, torch.compile, and Baseline highlights the memory reduction benefits of CCE.
  • Filtering ignored tokens before logits+loss computation improves performance across different methods.
  • Impact of vocabulary size to hidden dimension ratio on gradient computation and parallelism is explored.

Impact and Implications

The impact and implications of this research focus on the significant contributions of the Cut Cross-Entropy (CCE) method in optimizing memory consumption during large language model training.

Key Findings

  • CCE reduces memory consumption in cross-entropy computation without compromising performance.
  • Training stability is enhanced through efficient memory usage and computation.
  • Balancing memory-to-computation ratios benefits the training of very large models.

Limitations

  • The Triton framework has limitations in control flow at the block level, affecting certain operations.
  • Further optimization in CUDA implementation could enhance performance.

Future Directions

  • Extending CCE to other classification problems with a large number of classes is of interest.
  • Exploring the impact of vocabulary size on gradient computation and parallelism in different methods.
  • Investigating finer-grained control flow in CUDA for improved performance.

Practical Significance

  • CCE offers practical applications in optimizing memory usage during training of large language models.
  • The method can benefit various classification tasks beyond language models.
  • Enhancing memory efficiency in training has implications for real-world applications requiring large models.

Articoli in Evidenza

DeepSeek-R1: Incentivizzare la capacità di ragionamento nei LLM tramite Apprendimento per Rinforzo
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253685

Rapporto Tecnico Qwen2.5
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202436311

MiniMax-01: Scalare i modelli di base con attenzione lampeggiante
MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia WuJan 14, 20252826

PDF484November 15, 2024