ChatPaper.aiChatPaper

數學建模與概率優化在生成式人工智慧工程中的性質

The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

October 24, 2024
作者: Fulu Li
cs.AI

摘要

本文對生成式人工智慧領域中Transformer模型[33]的一些關鍵組件的數學問題形式化和概率優化探索進行了深入分析。我們從算法和概率優化的角度探索並討論了對生成式人工智慧模型的一些關鍵基礎技術進行當前技術的進一步增強的一些潛在方法。具體來說,我們提出了一種基於與[9]中字節對編碼(BPE)算法相似的初始設置以及與[28, 31]中WordPiece方法相似目標的子詞編碼(SWE)的最優解,以最大化訓練數據的概率。我們還提出了交叉熵優化方法,用於優化word2vec模型[17]的超參數。此外,我們提出了一種將旋轉位置編碼(RoPE)[32]和帶有線性偏差(ALiBi)[23]的注意力與調和級數結合的分解方法。我們還提出了一種概率FlashAttention [6, 7](PrFlashAttention)方法,通過在矩陣上的區塊距離上設置概率分佈,來決定哪個區塊可能參與給定輪的注意力計算,同時通過重新塑造張量來保持自回歸語言模型的張量的下三角形狀。最後,我們提出了基於[16]提出的框架的多查詢注意力(MQA)的鍵-值(KV)緩存的階梯自適應量化(SAQ),以實現合理的模型質量和成本節省,同時實現漸進式量化降級。
English
In this paper, we give an in-depth analysis on the mathematical problem formulations and the probabilistic optimization explorations for some of the key components in Transformer model [33] in the field of generative AI. We explore and discuss some potential further enhancement for current state of the art methods for some key underlying technologies of generative AI models from algorithmic and probabilistic optimization perspective. In particular, we present an optimal solution for sub-word encoding (SWE) based on similar initial settings as that of byte-pair encoding (BPE) algorithm in [9] with similar objectives as that of WordPiece approach in [28, 31] to maximize the likelihood of the training data. We also present cross entropy optimization method to optimize hyperparameters for word2vec model [17]. In addition, we propose a factored combination of rotary positional encoding (RoPE) [32] and attention with linear biases (ALiBi) [23] with a harmonic series. We also present a probabilistic FlashAttention [6, 7] (PrFlashAttention) method with a probability distribution over block distances in the matrix to decide which block is likely to participate in a given round of attention computation while maintaining the lower triangle shape of the tensor for autoregressive language models by re-shaping the tensors. Finally, we present staircase adaptive quantization (SAQ) of key-value (KV) cache for multi-query attention (MQA) based on the framework presented in [16] to have gradual quantization degradation while achieving reasonable model quality and cost savings.

Summary

AI-Generated Summary

PDF72November 16, 2024