Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine
arxiv(2024)
摘要
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision
(GPT-4V) outperforms human physicians in medical challenge tasks. However,
these evaluations primarily focused on the accuracy of multi-choice questions
alone. Our study extends the current scope by conducting a comprehensive
analysis of GPT-4V's rationales of image comprehension, recall of medical
knowledge, and step-by-step multimodal reasoning when solving New England
Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test
the knowledge and diagnostic capabilities of medical professionals. Evaluation
results confirmed that GPT-4V performs comparatively to human physicians
regarding multi-choice accuracy (81.6
cases where physicians incorrectly answer, with over 78
discovered that GPT-4V frequently presents flawed rationales in cases where it
makes the correct final choices (35.5
(27.2
findings emphasize the necessity for further in-depth evaluations of its
rationales before integrating such multimodal AI models into clinical
workflows.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要