FSPO:基于少量样本的合成偏好数据优化在LLMs中实现有效的真实用户个性化适配
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
February 26, 2025
作者: Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
cs.AI
摘要
LLM(大语言模型)的有效个性化对于虚拟助手和内容推荐等广泛的用户交互应用至关重要。受LLM强大的上下文学习能力启发,我们提出了少样本偏好优化(FSPO),将奖励建模重新定义为元学习问题。在此框架下,LLM通过少量来自用户的标注偏好快速适应该用户,为其构建个性化的奖励函数。此外,鉴于现实世界中的偏好数据稀缺且难以大规模收集,我们提出了精心设计的方法来构建用于个性化的合成偏好数据集,利用公开可用的LLM生成了超过100万条合成个性化偏好。特别地,为了成功地将合成数据迁移到真实用户,我们发现数据必须同时具备高度多样性和连贯、自洽的结构。我们在三个领域(电影评论、基于教育背景的教学适应以及通用问答)上对多达1,500个合成用户进行了个性化开放生成评估,并进行了受控的人体研究。总体而言,FSPO在生成针对合成用户的个性化响应方面平均获得了87%的Alpaca Eval胜率,在开放问答任务中与真实人类用户的胜率达到72%。
English
Effective personalization of LLMs is critical for a broad range of
user-interfacing applications such as virtual assistants and content curation.
Inspired by the strong in-context learning capabilities of LLMs, we propose
Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a
meta-learning problem. Under this framework, an LLM learns to quickly adapt to
a user via a few labeled preferences from that user, constructing a
personalized reward function for them. Additionally, since real-world
preference data is scarce and challenging to collect at scale, we propose
careful design choices to construct synthetic preference datasets for
personalization, generating over 1M synthetic personalized preferences using
publicly available LLMs. In particular, to successfully transfer from synthetic
data to real users, we find it crucial for the data to exhibit both high
diversity and coherent, self-consistent structure. We evaluate FSPO on
personalized open-ended generation for up to 1,500 synthetic users across
across three domains: movie reviews, pedagogical adaptation based on
educational background, and general question answering, along with a controlled
human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in
generating responses that are personalized to synthetic users and a 72% winrate
with real human users in open-ended question answering.Summary
AI-Generated Summary