You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments
arxiv(2023)
摘要
The versatility of Large Language Models (LLMs) on natural language
understanding tasks has made them popular for research in social sciences. To
properly understand the properties and innate personas of LLMs, researchers
have performed studies that involve using prompts in the form of questions that
ask LLMs about particular opinions. In this study, we take a cautionary step
back and examine whether the current format of prompting LLMs elicits responses
in a consistent and robust manner. We first construct a dataset that contains
693 questions encompassing 39 different instruments of persona measurement on
115 persona axes. Additionally, we design a set of prompts containing minor
variations and examine LLMs' capabilities to generate answers, as well as
prompt variations to examine their consistency with respect to content-level
variations such as switching the order of response options or negating the
statement. Our experiments on 17 different LLMs reveal that even simple
perturbations significantly downgrade a model's question-answering ability, and
that most LLMs have low negation consistency. Our results suggest that the
currently widespread practice of prompting is insufficient to accurately and
reliably capture model perceptions, and we therefore discuss potential
alternatives to improve these issues.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要