社会对齐框架能够提升大语言模型的对齐效果。
Societal Alignment Frameworks Can Improve LLM Alignment
February 27, 2025
作者: Karolina Stańczak, Nicholas Meade, Mehar Bhatia, Hattie Zhou, Konstantin Böttinger, Jeremy Barnes, Jason Stanley, Jessica Montgomery, Richard Zemel, Nicolas Papernot, Nicolas Chapados, Denis Therien, Timothy P. Lillicrap, Ana Marasović, Sylvie Delacroix, Gillian K. Hadfield, Siva Reddy
cs.AI
摘要
近期,大型语言模型(LLMs)的研究进展聚焦于生成符合人类期望并与共享价值观相一致的响应——这一过程被称为对齐。然而,由于人类价值观的复杂性与旨在解决这些问题的技术方法的局限性之间存在固有的脱节,实现LLMs的对齐仍面临挑战。当前的对齐方法常导致目标设定不当,这反映了更广泛的不完全契约问题,即在模型开发者与模型之间制定一个涵盖LLM对齐所有场景的契约是不切实际的。本文主张,提升LLM对齐需融入社会对齐框架的洞见,包括社会、经济及契约对齐,并探讨从这些领域汲取的潜在解决方案。鉴于不确定性在社会对齐框架中的角色,我们进一步探究了其在LLM对齐中的表现。讨论的最后,我们提出了一种关于LLM对齐的替代视角,将其目标未充分明确的特点视为机遇而非追求完美定义的障碍。除了LLM对齐的技术改进,我们还探讨了参与式对齐界面设计的必要性。
English
Recent progress in large language models (LLMs) has focused on producing
responses that meet human expectations and align with shared values - a process
coined alignment. However, aligning LLMs remains challenging due to the
inherent disconnect between the complexity of human values and the narrow
nature of the technological approaches designed to address them. Current
alignment methods often lead to misspecified objectives, reflecting the broader
issue of incomplete contracts, the impracticality of specifying a contract
between a model developer, and the model that accounts for every scenario in
LLM alignment. In this paper, we argue that improving LLM alignment requires
incorporating insights from societal alignment frameworks, including social,
economic, and contractual alignment, and discuss potential solutions drawn from
these domains. Given the role of uncertainty within societal alignment
frameworks, we then investigate how it manifests in LLM alignment. We end our
discussion by offering an alternative view on LLM alignment, framing the
underspecified nature of its objectives as an opportunity rather than perfect
their specification. Beyond technical improvements in LLM alignment, we discuss
the need for participatory alignment interface designs.Summary
AI-Generated Summary