Global-Liar: Factuality of LLMs over Time and Geographic Regions
CoRR(2024)
摘要
The increasing reliance on AI-driven solutions, particularly Large Language
Models (LLMs) like the GPT series, for information retrieval highlights the
critical need for their factuality and fairness, especially amidst the rampant
spread of misinformation and disinformation online. Our study evaluates the
factual accuracy, stability, and biases in widely adopted GPT models, including
GPT-3.5 and GPT-4, contributing to reliability and integrity of AI-mediated
information dissemination.
We introduce 'Global-Liar,' a dataset uniquely balanced in terms of
geographic and temporal representation, facilitating a more nuanced evaluation
of LLM biases. Our analysis reveals that newer iterations of GPT models do not
always equate to improved performance. Notably, the GPT-4 version from March
demonstrates higher factual accuracy than its subsequent June release.
Furthermore, a concerning bias is observed, privileging statements from the
Global North over the Global South, thus potentially exacerbating existing
informational inequities. Regions such as Africa and the Middle East are at a
disadvantage, with much lower factual accuracy. The performance fluctuations
over time suggest that model updates may not consistently benefit all regions
equally.
Our study also offers insights into the impact of various LLM configuration
settings, such as binary decision forcing, model re-runs and temperature, on
model's factuality. Models constrained to binary (true/false) choices exhibit
reduced factuality compared to those allowing an 'unclear' option. Single
inference at a low temperature setting matches the reliability of majority
voting across various configurations. The insights gained highlight the need
for culturally diverse and geographically inclusive model training and
evaluation. This approach is key to achieving global equity in technology,
distributing AI benefits fairly worldwide.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要