Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
CoRR(2024)
摘要
Alignment in artificial intelligence pursues the consistency between model
responses and human preferences as well as values. In practice, the
multifaceted nature of human preferences inadvertently introduces what is known
as the "alignment tax" -a compromise where enhancements in alignment within one
objective (e.g.,harmlessness) can diminish performance in others
(e.g.,helpfulness). However, existing alignment techniques are mostly
unidirectional, leading to suboptimal trade-offs and poor flexibility over
various objectives. To navigate this challenge, we argue the prominence of
grounding LLMs with evident preferences. We introduce controllable preference
optimization (CPO), which explicitly specifies preference scores for different
objectives, thereby guiding the model to generate responses that meet the
requirements. Our experimental analysis reveals that the aligned models can
provide responses that match various preferences among the "3H" (helpfulness,
honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and
alignment goals, we surpass baseline methods in aligning with single
objectives, hence mitigating the impact of the alignment tax and achieving
Pareto improvements in multi-objective alignment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要