The Alignment Problem from a Deep Learning Perspective
arxiv(2022)
摘要
In coming years or decades, artificial general intelligence (AGI) may surpass
human capabilities at many critical tasks. We argue that, without substantial
effort to prevent it, AGIs could learn to pursue goals that are in conflict
(i.e. misaligned) with human interests. If trained like today's most capable
models, AGIs could learn to act deceptively to receive higher reward, learn
misaligned internally-represented goals which generalize beyond their
fine-tuning distributions, and pursue those goals using power-seeking
strategies. We review emerging evidence for these properties. AGIs with these
properties would be difficult to align and may appear aligned even when they
are not. Finally, we briefly outline how the deployment of misaligned AGIs
might irreversibly undermine human control over the world, and we review
research directions aimed at preventing this outcome.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要