Original article
Vol. 154 No. 10 (2024)
Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology
Summary
AIMS: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.
METHODS: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.
RESULTS: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91–99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44–53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65–72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.
CONCLUSIONS: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.
References
- Weizenbaum J. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM. 1966;9(1):36–45. 10.1145/365153.365168
DOI: https://doi.org/10.1145/365153.365168
- Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, et al. Building Watson: An Overview of the DeepQA Project. AI Mag. 2010 Jul;31(3):59–79. 10.1609/aimag.v31i3.2303
DOI: https://doi.org/10.1609/aimag.v31i3.2303
- Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023 Mar;11(6):887. 10.3390/healthcare11060887
DOI: https://doi.org/10.3390/healthcare11060887
- Patra BG, Sharma MM, Vekaria V, Adekkanattu P, Patterson OV, Glicksberg B, et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J Am Med Inform Assoc. 2021 Nov;28(12):2716–27. 10.1093/jamia/ocab170
DOI: https://doi.org/10.1093/jamia/ocab170
- Byrd RJ, Steinhubl SR, Sun J, Ebadollahi S, Stewart WF. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. Int J Med Inform. 2014 Dec;83(12):983–92. 10.1016/j.ijmedinf.2012.12.005
DOI: https://doi.org/10.1016/j.ijmedinf.2012.12.005
- Li C, Zhang Y, Weng Y, Wang B, Li Z. Natural Language Processing Applications for Computer-Aided Diagnosis in Oncology. Diagnostics (Basel). 2023 Jan;13(2):286. 10.3390/diagnostics13020286
DOI: https://doi.org/10.3390/diagnostics13020286
- Skalidis I, Cagnina A, Luangphiphat W, Mahendiran T, Muller O, Abbe E, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023 Apr;4(3):279–81. 10.1093/ehjdh/ztad029
DOI: https://doi.org/10.1093/ehjdh/ztad029
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb;2(2):e0000198. 10.1371/journal.pdig.0000198
DOI: https://doi.org/10.1371/journal.pdig.0000198
- Lilly LS. Braunwald’s Heart Disease Review and Assessment E-Book. Elsevier Health Sciences; 2015.
- Bommarito J, Bommarito M, Katz DM, Katz J. "Gpt as knowledge worker: A zero-shot evaluation of (ai) cpa capabilities," arXiv preprint arXiv:2301.04408, 2023.
DOI: https://doi.org/10.2139/ssrn.4322372
- Bommarito M 2nd, Katz DM. "GPT takes the bar exam," arXiv preprint arXiv:2212.14402, 2022.
DOI: https://doi.org/10.2139/ssrn.4314839
- Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment,” (in English), JMIR Med Educ. JMIR Med Educ. 2023;9:e45312. 10.2196/45312
DOI: https://doi.org/10.2196/45312
- Fares A, Samir T, Daniel M, Jonathan EK, Renaud D. "Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings," medRxiv, p. 2023.01.22.23284882, 2023, doi: 10.1101/2023.01.22.23284882
DOI: https://doi.org/10.1101/2023.01.22.23284882
- Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison Study. JMIR Med Educ. 2023 Jun;9:e48002. 10.2196/48002
DOI: https://doi.org/10.2196/48002
- Plummer C, et al. "Behind the scenes of the European Examination in General Cardiology," Heart, vol. 105, pp. heartjnl-2018, 02/02 2019, doi: 10.1136/heartjnl-2018-314495
DOI: https://doi.org/10.1136/heartjnl-2018-314495
- Plummer C, Mathysen D, Lawson C. Does ChatGPT succeed in the European Exam in Core Cardiology? Eur Heart J Digit Health. 2023 Jul;4(5):362–3. 10.1093/ehjdh/ztad040
DOI: https://doi.org/10.1093/ehjdh/ztad040
- Cotton DR, Cotton PA, Shipway JR. Chatting and cheating: ensuring academic integrity in the era of ChatGPT. Innov Educ Teach Int. 2023;61(2):228–39. 10.1080/14703297.2023.2190148
DOI: https://doi.org/10.1080/14703297.2023.2190148
- Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health. 2023 Feb;2(2):e0000205. 10.1371/journal.pdig.0000205
DOI: https://doi.org/10.1371/journal.pdig.0000205
- van Dis EA, Bollen J, Zuidema W, van Rooij R, Bockting CL. "ChatGPT: five priorities for research," in Nature, vol. 614, no. 7947). England, 2023, pp. 224-226. doi: https://doi.org/10.1038/d41586-023-00288-7
DOI: https://doi.org/10.1038/d41586-023-00288-7
- James RA. "ChatGPT for Clinical Vignette Generation, Revision, and Evaluation," medRxiv, p. 2023.02.04.23285478, 2023, doi: 10.1101/2023.02.04.23285478
DOI: https://doi.org/10.1101/2023.02.04.23285478
- Fijačko N, Gosak L, Štiglic G, Picard CT, John Douma M. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023 Apr;185:109732. 10.1016/j.resuscitation.2023.109732
DOI: https://doi.org/10.1016/j.resuscitation.2023.109732
- Maurer S. "ChatGPT, schreib meine Zusammenfassung," ed: Schweiz Ärzteztg. 2023;104(33):16-20. doi: https://doi.org/10.4414/saez.2023.21992
DOI: https://doi.org/10.4414/saez.2023.21992
- "Non-Author Contributors, Defining the Role of Authors and Contributors. ICMJE. Available from: https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html [Last accessed on 2023 Aug 17].", ed, 2023.