突破記憶障礙:對比損失的近乎無限批量大小擴展
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
October 22, 2024
作者: Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
cs.AI
摘要
對比損失是一種強大的表示學習方法,其中通過提供更多負樣本來增強性能的較大批次大小,以更好地區分相似和不相似的數據。然而,批次大小的擴展受到 GPU 記憶體消耗呈二次增長的限制,主要是由於相似性矩陣的完全實例化。為了解決這個問題,我們提出了一種基於瓦片的計算策略,將對比損失計算分為任意小的塊,避免完全實例化相似性矩陣。此外,我們引入了多級瓦片策略,利用分佈式系統的分層結構,採用 GPU 層級的環形通信來優化同步,並在 CUDA 核心級別上使用融合內核以減少 I/O 開銷。實驗結果顯示,所提出的方法可以將批次大小擴展到前所未有的水平。例如,它使得可以使用 8 或 32 個 A800 80GB 進行對比訓練 CLIP-ViT-L/14 模型,批次大小為 4M 或 12M,而不會降低任何準確性。與最先進的節省記憶體解決方案相比,它實現了記憶體減少兩個數量級,同時保持可比擬的速度。代碼將公開提供。
English
Contrastive loss is a powerful approach for representation learning, where
larger batch sizes enhance performance by providing more negative samples to
better distinguish between similar and dissimilar data. However, scaling batch
sizes is constrained by the quadratic growth in GPU memory consumption,
primarily due to the full instantiation of the similarity matrix. To address
this, we propose a tile-based computation strategy that partitions the
contrastive loss calculation into arbitrary small blocks, avoiding full
materialization of the similarity matrix. Furthermore, we introduce a
multi-level tiling strategy to leverage the hierarchical structure of
distributed systems, employing ring-based communication at the GPU level to
optimize synchronization and fused kernels at the CUDA core level to reduce I/O
overhead. Experimental results show that the proposed method scales batch sizes
to unprecedented levels. For instance, it enables contrastive training of a
CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB
without sacrificing any accuracy. Compared to SOTA memory-efficient solutions,
it achieves a two-order-of-magnitude reduction in memory while maintaining
comparable speed. The code will be made publicly available.Summary
AI-Generated Summary