CLUE: A Clinical Language Understanding Evaluation for LLMs
arxiv(2024)
摘要
Large Language Models (LLMs) have shown the potential to significantly
contribute to patient care, diagnostics, and administrative processes. Emerging
biomedical LLMs address healthcare-specific challenges, including privacy
demands and computational constraints. However, evaluation of these models has
primarily been limited to non-clinical tasks, which do not reflect the
complexity of practical clinical applications. Additionally, there has been no
thorough comparison between biomedical and general-domain LLMs for clinical
tasks. To fill this gap, we present the Clinical Language Understanding
Evaluation (CLUE), a benchmark tailored to evaluate LLMs on real-world clinical
tasks. CLUE includes two novel datasets derived from MIMIC IV discharge letters
and four existing tasks designed to test the practical applicability of LLMs in
healthcare settings. Our evaluation covers several biomedical and general
domain LLMs, providing insights into their clinical performance and
applicability. CLUE represents a step towards a standardized approach to
evaluating and developing LLMs in healthcare to align future model development
with the real-world needs of clinical application. We publish our evaluation
and data generation scripts: https://github.com/dadaamin/CLUE
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要