Skip to main navigation menu Skip to main content Skip to site footer

Original article

Vol. 154 No. 10 (2024)

Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology

DOI
https://doi.org/10.57187/s.3547
Cite this as:
Swiss Med Wkly. 2024;154:3547
Published
02.10.2024

Summary

AIMS: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.

METHODS: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.

RESULTS: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91–99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44–53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65–72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.

CONCLUSIONS: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.

References

  1. Weizenbaum J. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM. 1966;9(1):36–45. 10.1145/365153.365168 DOI: https://doi.org/10.1145/365153.365168
  2. Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, et al. Building Watson: An Overview of the DeepQA Project. AI Mag. 2010 Jul;31(3):59–79. 10.1609/aimag.v31i3.2303 DOI: https://doi.org/10.1609/aimag.v31i3.2303
  3. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023 Mar;11(6):887. 10.3390/healthcare11060887 DOI: https://doi.org/10.3390/healthcare11060887
  4. Patra BG, Sharma MM, Vekaria V, Adekkanattu P, Patterson OV, Glicksberg B, et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J Am Med Inform Assoc. 2021 Nov;28(12):2716–27. 10.1093/jamia/ocab170 DOI: https://doi.org/10.1093/jamia/ocab170
  5. Byrd RJ, Steinhubl SR, Sun J, Ebadollahi S, Stewart WF. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. Int J Med Inform. 2014 Dec;83(12):983–92. 10.1016/j.ijmedinf.2012.12.005 DOI: https://doi.org/10.1016/j.ijmedinf.2012.12.005
  6. Li C, Zhang Y, Weng Y, Wang B, Li Z. Natural Language Processing Applications for Computer-Aided Diagnosis in Oncology. Diagnostics (Basel). 2023 Jan;13(2):286. 10.3390/diagnostics13020286 DOI: https://doi.org/10.3390/diagnostics13020286
  7. Skalidis I, Cagnina A, Luangphiphat W, Mahendiran T, Muller O, Abbe E, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023 Apr;4(3):279–81. 10.1093/ehjdh/ztad029 DOI: https://doi.org/10.1093/ehjdh/ztad029
  8. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb;2(2):e0000198. 10.1371/journal.pdig.0000198 DOI: https://doi.org/10.1371/journal.pdig.0000198
  9. Lilly LS. Braunwald’s Heart Disease Review and Assessment E-Book. Elsevier Health Sciences; 2015.
  10. Bommarito J, Bommarito M, Katz DM, Katz J. "Gpt as knowledge worker: A zero-shot evaluation of (ai) cpa capabilities," arXiv preprint arXiv:2301.04408, 2023. DOI: https://doi.org/10.2139/ssrn.4322372
  11. Bommarito M 2nd, Katz DM. "GPT takes the bar exam," arXiv preprint arXiv:2212.14402, 2022. DOI: https://doi.org/10.2139/ssrn.4314839
  12. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment,” (in English), JMIR Med Educ. JMIR Med Educ. 2023;9:e45312. 10.2196/45312 DOI: https://doi.org/10.2196/45312
  13. Fares A, Samir T, Daniel M, Jonathan EK, Renaud D. "Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings," medRxiv, p. 2023.01.22.23284882, 2023, doi: 10.1101/2023.01.22.23284882 DOI: https://doi.org/10.1101/2023.01.22.23284882
  14. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison Study. JMIR Med Educ. 2023 Jun;9:e48002. 10.2196/48002 DOI: https://doi.org/10.2196/48002
  15. Plummer C, et al. "Behind the scenes of the European Examination in General Cardiology," Heart, vol. 105, pp. heartjnl-2018, 02/02 2019, doi: 10.1136/heartjnl-2018-314495 DOI: https://doi.org/10.1136/heartjnl-2018-314495
  16. Plummer C, Mathysen D, Lawson C. Does ChatGPT succeed in the European Exam in Core Cardiology? Eur Heart J Digit Health. 2023 Jul;4(5):362–3. 10.1093/ehjdh/ztad040 DOI: https://doi.org/10.1093/ehjdh/ztad040
  17. Cotton DR, Cotton PA, Shipway JR. Chatting and cheating: ensuring academic integrity in the era of ChatGPT. Innov Educ Teach Int. 2023;61(2):228–39. 10.1080/14703297.2023.2190148 DOI: https://doi.org/10.1080/14703297.2023.2190148
  18. Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health. 2023 Feb;2(2):e0000205. 10.1371/journal.pdig.0000205 DOI: https://doi.org/10.1371/journal.pdig.0000205
  19. van Dis EA, Bollen J, Zuidema W, van Rooij R, Bockting CL. "ChatGPT: five priorities for research," in Nature, vol. 614, no. 7947). England, 2023, pp. 224-226. doi: https://doi.org/10.1038/d41586-023-00288-7 DOI: https://doi.org/10.1038/d41586-023-00288-7
  20. James RA. "ChatGPT for Clinical Vignette Generation, Revision, and Evaluation," medRxiv, p. 2023.02.04.23285478, 2023, doi: 10.1101/2023.02.04.23285478 DOI: https://doi.org/10.1101/2023.02.04.23285478
  21. Fijačko N, Gosak L, Štiglic G, Picard CT, John Douma M. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023 Apr;185:109732. 10.1016/j.resuscitation.2023.109732 DOI: https://doi.org/10.1016/j.resuscitation.2023.109732
  22. Maurer S. "ChatGPT, schreib meine Zusammenfassung," ed: Schweiz Ärzteztg. 2023;104(33):16-20. doi: https://doi.org/10.4414/saez.2023.21992 DOI: https://doi.org/10.4414/saez.2023.21992
  23. "Non-Author Contributors, Defining the Role of Authors and Contributors. ICMJE. Available from: https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html [Last accessed on 2023 Aug 17].", ed, 2023.

Most read articles by the same author(s)