OpenAI’s ChatGPT 4.0 answered 85% of the questions correctly in a clinical neurology exam, surpassing the average human score of 73.8%. This achievement in a proof-of-concept study indicates the potential of AI in clinical neurology. The study, conducted by researchers from the University Hospital Heidelberg and the German Cancer Research Center, featured both ChatGPT 3.5 and ChatGPT 4.0.
Comparison with older versions and human performance
While ChatGPT 4.0 achieved an 85% success rate, ChatGPT 3.5 scored 66.8%. Both versions of ChatGPT consistently used confident language, even when incorrect. The findings suggest that while ChatGPT can accurately answer multiple-choice questions, it does not equate to the ability to practice clinical medicine or make clinical decisions.
Still weaker in higher-order thinking
The research involved a question bank from the American Board of Psychiatry and Neurology (ABPN) and the European Board for Neurology. ChatGPT’s performance highlighted its strength in behavioral, cognitive, and psychological categories, but it showed weaker performance in tasks requiring higher-order thinking compared to lower-order thinking tasks. The study used questions that assessed both basic understanding and the ability to apply, analyze, or evaluate information.
Researchers: Exercise caution
The results suggest that large language models like ChatGPT could have significant applications in clinical neurology, with further refinements. However, the researchers caution against overreliance on these models for high-order cognitive tasks. It’s also important to note that the models were trained on extensive text data but did not have internet search capabilities. Experts emphasize that any application of transformer technology in clinical or educational settings requires careful human validation and fact-checking.