ChatPaper.aiChatPaper

OneRec:统一检索与排序的生成式推荐系统及迭代偏好对齐

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

February 26, 2025
作者: Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, Guorui Zhou
cs.AI

摘要

近期,基于生成式检索的推荐系统崭露头角,成为一种颇具前景的范式。然而,当前多数推荐系统仍采用“检索-排序”策略,其中生成模型仅作为检索阶段的选择器发挥作用。本文提出OneRec,它摒弃了级联学习框架,转而采用统一的生成模型。据我们所知,这是首个在真实场景中显著超越当前复杂且精心设计的推荐系统的端到端生成模型。具体而言,OneRec包含:1)一个编码器-解码器结构,该结构编码用户的历史行为序列,并逐步解码用户可能感兴趣的视频。我们采用稀疏专家混合模型(MoE)来扩展模型容量,而无需按比例增加计算浮点运算量(FLOPs)。2)一种会话级生成方法。与传统的下一项预测不同,我们提出会话级生成,它比依赖手工规则以恰当组合生成结果的逐点生成更为优雅且上下文连贯。3)结合直接偏好优化(DPO)的迭代偏好对齐模块,以提升生成结果的质量。与自然语言处理中的DPO不同,推荐系统通常仅有一次机会展示每个用户浏览请求的结果,这使得无法同时获取正负样本。为解决这一限制,我们设计了一个奖励模型来模拟用户生成,并定制采样策略。大量实验表明,有限数量的DPO样本即可对齐用户兴趣偏好,并显著提升生成结果的质量。我们将OneRec部署于快手主场景,实现了观看时长1.6%的提升,这一改进意义重大。
English
Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.

Summary

AI-Generated Summary

PDF232March 4, 2025