A Roadmap to Pluralistic Alignment
CoRR(2024)
摘要
With increased power and prevalence of AI systems, it is ever more critical
that AI systems are designed to serve all, i.e., people with diverse values and
perspectives. However, aligning models to serve pluralistic human values
remains an open research question. In this piece, we propose a roadmap to
pluralistic alignment, specifically using language models as a test bed. We
identify and formalize three possible ways to define and operationalize
pluralism in AI systems: 1) Overton pluralistic models that present a spectrum
of reasonable responses; 2) Steerably pluralistic models that can steer to
reflect certain perspectives; and 3) Distributionally pluralistic models that
are well-calibrated to a given population in distribution. We also propose and
formalize three possible classes of pluralistic benchmarks: 1) Multi-objective
benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to
steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which
explicitly model diverse human ratings. We use this framework to argue that
current alignment techniques may be fundamentally limited for pluralistic AI;
indeed, we highlight empirical evidence, both from our own experiments and from
other work, that standard alignment procedures might reduce distributional
pluralism in models, motivating the need for further research on pluralistic
alignment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要