DPOカーネル：直接選好最適化のための意味論に富み、カーネル強化、多様性豊かなパラダイム

要旨

大規模言語モデル（LLMs）の急速な台頭は多くのアプリケーションを可能にしましたが、さまざまな価値観や選好との調整の難しさを浮き彫りにしています。直接的な選好最適化（DPO）は調整に中心的ですが、固定された発散と限られた特徴変換によって制約されています。私たちは、これらの問題に取り組むためにカーネル法を統合したDPO-Kernelsを提案します。これには、次の4つの主要な貢献があります：（i）多項式、RBF、マハラノビス、スペクトルカーネルを使用したカーネル化表現、埋め込みベースと確率ベースの目的を組み合わせたハイブリッド損失を含むより豊かな変換；（ii）Jensen-Shannon、Hellinger、Renyi、Bhattacharyya、Wasserstein、f-発散などの発散の代替手段による安定性向上；（iii）最適なカーネル-発散ペアを自動的に選択するデータ駆動型選択メトリクス；および（iv）ローカル精度とグローバルモデリングの両方のための階層的カーネル混合。12のデータセットでの評価は、事実性、安全性、推論、指示に従う能力において最先端の性能を示しました。重尾自己正則化に基づくDPO-Kernelsは、LLMsに対する堅牢な汎化を維持し、さらなる調整研究の包括的なリソースを提供しています。

English

The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.

DPOカーネル：直接選好最適化のための意味論に富み、カーネル強化、多様性豊かなパラダイム

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

要旨

Summary

Support