FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
CoRR(2024)
摘要
Plain language summarization with LLMs can be useful for improving textual
accessibility of technical content. But how factual are these summaries in a
high-stakes domain like medicine? This paper presents FactPICO, a factuality
benchmark for plain language summarization of medical texts describing
randomized controlled trials (RCTs), which are the basis of evidence-based
medicine and can directly inform patient treatment. FactPICO consists of 345
plain language summaries of RCT abstracts generated from three LLMs (i.e.,
GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language
rationales from experts. We assess the factuality of critical elements of RCTs
in those summaries: Populations, Interventions, Comparators, Outcomes (PICO),
as well as the reported findings concerning these. We also evaluate the
correctness of the extra information (e.g., explanations) added by LLMs. Using
FactPICO, we benchmark a range of existing factuality metrics, including the
newly devised ones based on LLMs. We find that plain language summarization of
medical evidence is still challenging, especially when balancing between
simplicity and factuality, and that existing metrics correlate poorly with
expert judgments on the instance level.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要