Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?
arxiv(2024)
摘要
Prior research in representation engineering has revealed that LLMs encode
concepts within their representation spaces, predominantly centered around
English. In this study, we extend this philosophy to a multilingual scenario,
delving into multilingual human value concepts in LLMs. Through our
comprehensive exploration covering 7 types of human values, 16 languages and 3
LLM series with distinct multilinguality, we empirically substantiate the
existence of multilingual human values in LLMs. Further cross-lingual analysis
on these concepts discloses 3 traits arising from language resource
disparities: cross-lingual inconsistency, distorted linguistic relationships,
and unidirectional cross-lingual transfer between high- and low-resource
languages, all in terms of human value concepts. Additionally, we validate the
feasibility of cross-lingual control over value alignment capabilities of LLMs,
leveraging the dominant language as a source language. Drawing from our
findings on multilingual value alignment, we prudently provide suggestions on
the composition of multilingual data for LLMs pre-training: including a limited
number of dominant languages for cross-lingual alignment transfer while
avoiding their excessive prevalence, and keeping a balanced distribution of
non-dominant languages. We aspire that our findings would contribute to
enhancing the safety and utility of multilingual AI.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要