A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
CoRR(2024)
摘要
Patients often face difficulties in understanding their hospitalizations,
while healthcare workers have limited resources to provide explanations. In
this work, we investigate the potential of large language models to generate
patient summaries based on doctors' notes and study the effect of training data
on the faithfulness and quality of the generated summaries. To this end, we
develop a rigorous labeling protocol for hallucinations, and have two medical
experts annotate 100 real-world summaries and 100 generated summaries. We show
that fine-tuning on hallucination-free data effectively reduces hallucinations
from 2.60 to 1.55 per summary for Llama 2, while preserving relevant
information. Although the effect is still present, it is much smaller for GPT-4
when prompted with five examples (0.70 to 0.40). We also conduct a qualitative
evaluation using hallucination-free and improved training data. GPT-4 shows
very good results even in the zero-shot setting. We find that common
quantitative metrics do not correlate well with faithfulness and quality.
Finally, we test GPT-4 for automatic hallucination detection, which yields
promising results.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要