ChatPaper.aiChatPaper

向自回归多模态基础模型传授度量距离

Teaching Metric Distance to Autoregressive Multimodal Foundational Models

March 4, 2025
作者: Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu
cs.AI

摘要

随着大型语言模型从自然语言领域扩展到数学、多模态理解和具身智能体等领域,token逐渐反映出度量关系而非纯粹的语义内涵。我们提出了DIST2Loss,这是一个距离感知框架,旨在通过利用输出token之间预定义的距离关系来训练自回归离散模型。其核心在于,DIST2Loss将从固有距离度量导出的连续指数族分布转化为离散的、与模型架构兼容的分类优化目标。该方法使模型在生成token时能够学习并保持有意义距离关系,同时保持与现有架构的兼容性。实证评估表明,在视觉定位、机器人操作、生成式奖励建模以及使用向量量化特征的图像生成等多种多模态应用中,该框架均带来了持续的性能提升。这些改进在训练数据有限的情况下尤为显著,凸显了DIST2Loss在资源受限环境中的有效性。
English
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss's effectiveness in resource-constrained settings.

Summary

AI-Generated Summary

PDF32March 5, 2025