Israeli study shows ChatGPT scores better than doctors in license exams

Experts surprised to find popular AI model passes Israeli medical test and say future integration with information management could save time and improve health care quality
Eitan Gefen|04.16.24 | 21:07
Print Find an error? Report us
Getting your Trinity Audio player ready...
A new Israeli study published in the latest issue of The New England Journal of Medicine AI monthly journal showed popular artificial intelligence (AI) large language model (LLM) ChatGPT successfully passed Israel’s medical specialist license examination and even received better scores in some cases when compared to human counterparts.
4 View gallery 
(Photo: Shutterstock)
The study was conducted by Dr. Eran Cohen, a psychiatry specialist at the Lev Hasharon Mental Health Center, together with Dr. Uriel Katz, a resident at Wolfson Medical Center, in collaboration with Prof. Ido Wolf, Ichilov Medical Center’s Director of Oncology Division, and the Israel Medical Association (IMA).
"The field of artificial intelligence has always intrigued us," Cohen says. "About a year ago, we wanted to test its capabilities. We asked ourselves how we could test its capabilities in medicine. We came up with a personal initiative to do so on the license exams for stage A medical residents. We decided on the five basic medical specialties - pediatrics, psychiatry, general surgery, gynecology and internal medicine."
When the results came in, the doctors were surprised to find that ChatGPT successfully passed the exams. "We realized that it could answer, and not only that, it also passed the exams. It was amazing for us to see," Cohen said. "We realized we had something the world would want to know, and we didn't really know what to do with it."
The two then turned to Wolf at Ichilov Medical Center. "He’s known for his desire to help new residents and advance scientific research," Cohen explained, "He agreed to educate and help us build on the topic until we had a published paper." 
4 View gallery 
(Photo: Shutterstock)
With Wolf's assistance, the two also approached the IMA, which provided data on Israeli residents who took the medical license exams in 2022. "They agreed to participate and share the information with us. Most countries in the world don’t do this,” Dr. Cohen added.
As part of the study, the doctors used two different versions of ChatGPT - 3.5 and 4, and provided them with license exam questions given to residents in the medical fields they chose. Each model took the exam 120 times in order to evaluate its capabilities consistently. The outputs received were then compared to the results of 849 residents who took the same exam.
"Each set of data was very interesting. We realized that ChatGPT-4 not only succeeded in passing the exam but even received higher scores than the residents in some of them," according to Cohen.
"The improvement between the two versions demonstrates the rapid leap and pace of AI development within a year's time. The 3.5 model didn’t do well at all compared to the 4 model, it’s an incredible hike in capability,” he added.
4 View gallery 
Dr. Uriel Katz and Dr. Eran Cohen 
(Photo: Ichilov Medical Center)
According to the findings, ChatGPT-4 almost never failed the exams, compared to a 25% failure rate by residents in various specialties. "This is an exam that covers all the academic material of the medical residency. It's an exam you study intensely for a long time. There was a 30% failure rate in some groups, meaning the fact ChatGPT never failed is significant.”
In contrast, the average scores between ChatGPT and residents were almost identical, with the AI model being consistent and showing stable performance, while the residents' scores ranged from 30-85 points. ChatGPT performed better than most residents in the fields of internal medicine and psychiatry. 
However, the model never managed to surpass the score of the resident with the highest score in the exam. "AI hasn’t managed to match those who are professionals in their field yet," Cohen said.
Despite the impressive results, AI likely won’t replace humans as personal physicians in the near future. "We're really excited to see how the world views this," Cohen said, adding, "It says a lot, but it doesn't necessarily point to something for the future. Our study is meant to show some perspective on where the technology stands."
4 View gallery 
(Photo: Shutterstock)
“The fact AI can successfully face license exams only shows technology’s maturity and highlights the fact that we must learn how to use it. We succeeded more in providing a glimpse of where we stand,” he added, noting the one-year gap between the models shows promise for the future of AI.
Still, what opportunity could arise from these results for the future?
"It could significantly shorten the physician’s work in terms of finding and processing information. There's a need to develop the synergy between doctors and AI to streamline medical processes. There's so much material and books to study from. The entire field is becoming vast. We need to somehow integrate technology with doctors to allow for better healthcare in the future.”
Meanwhile, given the study’s surprising results, the IMA is exploring options to leverage the large database they have for future studies. "We’re preparing to lead a large and comprehensive study on the subject in the near future," said Professor Ron Eliashar, chairman of the IMA’s Examination Committee. 
"The committee has a one-of-a-kind database of exams and residencies spanning over 30 years, and there’s immense potential to extract relevant and useful information from it,” he added.
<< Follow Ynetnews on Facebook | Twitter | Instagram | TikTok >>