Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
arxiv(2023)
摘要
Retriever-augmented instruction-following models are attractive alternatives
to fine-tuned approaches for information-seeking tasks such as question
answering (QA). By simply prepending retrieved documents in its input along
with an instruction, these models can be adapted to various information domains
and tasks without additional fine-tuning. While the model responses tend to be
natural and fluent, the additional verbosity makes traditional QA evaluation
metrics such as exact match (EM) and F1 unreliable for accurately quantifying
model performance.
In this work, we investigate the performance of instruction-following models
across three information-seeking QA tasks. We use both automatic and human
evaluation to evaluate these models along two dimensions: 1) how well they
satisfy the user's information need (correctness), and 2) whether they produce
a response based on the provided knowledge (faithfulness). Guided by human
evaluation and analysis, we highlight the shortcomings of traditional metrics
for both correctness and faithfulness. We then propose simple token-overlap
based and model-based metrics that reflect the true performance of these
models. Our analysis reveals that instruction-following models are competitive,
and sometimes even outperform fine-tuned models for correctness. However, these
models struggle to stick to the provided knowledge and often hallucinate in
their responses. We hope our work encourages a more holistic evaluation of
instruction-following models for QA. Our code and data is available at
https://github.com/McGill-NLP/instruct-qa
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要