Advanced AI chatbots, including ChatGPT and Google’s Gemini, exhibit cognitive limitations similar to those seen in human cognitive decline, according to a new study by neurologists at Jerusalem’s Hadassah Medical Center and an Israeli data scientist.
The research, published in the medical journal BMJ, raises questions about the ability of AI to replace human doctors in tasks requiring complex integration and empathy.
Using assessments designed to identify cognitive impairments, such as the Montreal Cognitive Assessment (MOCA), researchers evaluated leading AI models, including two versions of ChatGPT, Google’s Gemini, and Claude, a top-performing language model. "We scored the chatbots just as we would a patient," said Gal Kopelvitz, a senior data scientist and one of the study's authors.
The AI models struggled particularly in visual-spatial tasks, such as drawing a clock set to 11:10 or connecting sequences of numbers and letters. "The MOCA test examines various cognitive abilities, including short-term memory, abstraction, and visual-spatial perception. These models had notable difficulty with the visual components," Kopelvitz noted.
The research was led by Dr. Roy Dayan and Dr. Benjamin Oliel, senior neurologists at Hadassah Medical Center, along with Kopelvitz. They concluded that while AI chatbots are powerful tools, their limitations underscore the irreplaceable value of human expertise, particularly in fields demanding emotional intelligence and nuanced decision-making.
A fundamental flaw with significant implications
Dr. Roy Dayan highlights a key cognitive weakness in AI chatbots: visual abstraction. "For example, in the clock-drawing test, chatbots can often draw a clock with numbers, but placing the hands correctly requires abstraction," Dayan explains.
"Ten minutes past the hour isn’t at the number 10 but closer to 2. None of the models managed to do this, even though they can generate highly detailed and visually appealing images. We later gave them additional visual abstraction tests, and they consistently underperformed. This points to a fundamental flaw in these tools that I believe has significant implications."
Artificial intelligence has seen rapid development over the past 12 years, evolving into nearly every facet of daily life. "One of the breakthroughs was the use of neural networks for 'deep learning,' a computational method initially excelling in pattern recognition tasks such as images, text and translation," explains Kopelvitz. "For years, computers struggled with these tasks, but almost overnight, their success rates soared."
Get the Ynetnews app on your smartphone: Google Play: https://bit.ly/4eJ37pE | Apple App Store: https://bit.ly/3ZL7iNv
Another major leap came in 2017 with the advent of "transformers," a technology that gained widespread public attention in 2022 with the launch of ChatGPT. "OpenAI’s brilliance with ChatGPT was creating a model capable of direct conversation with users," Kopelvitz says. "It trained on vast amounts of text and could 'speak' expertly on nearly any subject, including medicine."
The new Israeli study sheds light on the limitations of artificial intelligence, particularly in cognitive applications. “In recent years, there have been numerous efforts to apply large language models in healthcare,” explains Dr. Dayan. “Unfortunately, we often follow the technology rather than leading it. Many of my patients ask questions to chatbots like ChatGPT, relying on them for insights. As a neurologist specializing in dementia detection, I was curious whether these models could pass the basic cognitive function tests we use to identify impairments. Medical licensing exams test broad knowledge recall, but cognition is much broader—someone can recall trivia but still suffer from dementia.”
Asked about the implications of the study, Dayan emphasizes the gap between AI’s capabilities and the nuanced demands of medical practice. “The current debate centers on whether language models can replace doctors. It wouldn’t surprise me if, soon, healthcare providers use them as an initial screening tool. However, much of medical interaction goes beyond recalling facts; it involves visual abstraction and interpretation, and we’re not there yet,” he says.
“Our study shows that this is still a long way off. Companies might now train chatbots specifically on these cognitive tests after seeing our findings, but this would still highlight a fundamental flaw in their ‘cognition.’ For me, as a physician, the implications are significant,” Dayan adds.
Despite these limitations, Kopelvitz believes the rapid pace of AI development could render such findings less relevant in the future. “At the time of writing, these were the models’ capabilities, but this field evolves almost daily,” he explains.
“Our research playfully noted that, like aging humans, older chatbot versions show greater cognitive decline and perform worse on these tests,” Kopelvitz adds. “But beyond these parallels, the study also identifies critical differences between human cognition and machine processing—which is where the real intrigue lies.”