MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition
CoRR(2024)
摘要
Although Large Language Models (LLMs) have shown strong performance in
Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains
exploration. Current LLM QA evaluation benchmarks have shown limitations,
including 1) data contamination, the evaluation data are potentially exposed to
LLMs during the pretraining stage; and 2) ignoration of the reasoning chain
evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA
benchmark based on the new, unprecedented knowledge by editing the
off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the
reasoning chain in the form of sub-questions and intermediate answers
corresponding to the multi-hop questions. Specifically, based on the
observation, 1) LLMs show a performance gap between the original HotpotQA and
our edited data, deeming that current MHQA benchmarks have the potential risk
of data contamination that hard to evaluate LLMs' performance objectively and
scientifically; 2) LLMs only get a small percentage of the right reasoning
chain, e.g. GPT-4 only gets 36.3% right reasoning chain. We believe this new
Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate
the development of trustworthy LLM evaluation on the MHQA task.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要