Evaluation of Hallucination and Robustness for Large Language Models.

Rui Hu, Junhao Zhong,Minjie Ding, Zeyu Ma,Mingang Chen

International Conference on Software Quality, Reliability and Security(2023)

引用 0|浏览3
暂无评分
摘要
As large language models (LLMs) rapidly advance, rigorous testing and evaluation of these models grows increasingly crucial. To address this need, we have developed three types of questions: Chinese contextual, English contextual, and language context-independent. Testing in both Chinese and English probes the LLMs' hallucination tendencies. We investigate the impact of language on hallucinations from two perspectives: the type of language used in the input prompt and the cultural context underlying the prompt's content. Additionally, 52 multi-domain single-choice questions from C-EVAL are presented in original and randomized order to assess robustness to perturbations. Among the five LLMs, the tests demonstrate GPT -4 has the strongest anti-hallucination and robustness capabilities, answering with greater accuracy, consistency, and reliability. ChatGLM ranks second and outperforms GPT -4 on Chinese context-dependent questions. Emergent testing phenomena are analyzed from the user's perspective. Hallucinated responses are categorized and potential causal factors leading to hallucination and fragility are examined. Based on these findings, viable avenues for improvement are proposed.
更多
查看译文
关键词
large language model,hallucination,robustness,evaluation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要