Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Yoshitaka Toyama,Ayaka Harigai, Mirei Abe, Mitsutoshi Nagano,Masahiro Kawabata, Yasuhiro Seki,Kei Takase

Japanese Journal of Radiology（2024）

引用 4|浏览4

暂无评分

摘要

Purpose Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE). Materials and methods In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar’s test was used to compare the proportion of correct responses between the LLMs. Fisher’s exact test was used to assess the performance of GPT-4 for each topic category. Results ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% ( p < 0.001) and Google Bard by 26.2% ( p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard ( p < 0.001). The categorical analysis by question pattern revealed GPT-4’s superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001). Conclusion ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

查看译文

关键词

ChatGPT,GPT-4,Bard,Japan Radiology Society

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要