Massively Multi-Cultural Knowledge Acquisition LM Benchmarking
CoRR(2024)
摘要
Pretrained large language models have revolutionized many applications but
still face challenges related to cultural bias and a lack of cultural
commonsense knowledge crucial for guiding cross-culture communication and
interactions. Recognizing the shortcomings of existing methods in capturing the
diverse and rich cultures across the world, this paper introduces a novel
approach for massively multicultural knowledge acquisition. Specifically, our
method strategically navigates from densely informative Wikipedia documents on
cultural topics to an extensive network of linked pages. Leveraging this
valuable source of data collection, we construct the CultureAtlas dataset,
which covers a wide range of sub-country level geographical regions and
ethnolinguistic groups, with data cleaning and preprocessing to ensure textual
assertion sentence self-containment, as well as fine-grained cultural profile
information extraction. Our dataset not only facilitates the evaluation of
language model performance in culturally diverse contexts but also serves as a
foundational tool for the development of culturally sensitive and aware
language models. Our work marks an important step towards deeper understanding
and bridging the gaps of cultural disparities in AI, to promote a more
inclusive and balanced representation of global cultures in the digital domain.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要