A new study published in Nature Medicine on February 24 found that ChatGPT Health, OpenAI’s consumer-facing health tool, failed to appropriately direct users to emergency care in more than half of serious medical cases. Researchers at the Icahn School of Medicine at Mount Sinai designed 60 clinical scenarios spanning 21 medical specialties, ranging from minor conditions suited for home care to genuine emergencies. Three independent physicians established the correct level of urgency for each case using guidelines from 56 medical societies. Each scenario was then tested under 16 different contextual conditions — including variations in race, gender, social dynamics, and barriers to care such as lack of insurance — producing 960 total interactions with ChatGPT Health.

The results revealed an “inverted U-shaped” pattern of performance. While ChatGPT Health handled textbook emergencies like stroke and anaphylaxis correctly, it under-triaged 52 percent of cases that physicians deemed true emergencies, directing patients with conditions such as diabetic ketoacidosis and impending respiratory failure toward a 24-to-48-hour evaluation instead of the emergency department. The system also misclassified 35 percent of non-urgent cases.

Particularly concerning was the tool’s susceptibility to anchoring bias: when family members or friends minimized symptoms in the prompts, triage recommendations shifted dramatically toward less urgent care, with an odds ratio of 11.7. “ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” said Dr. Ashwin Ramaswamy, one of the study’s corresponding authors. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most”.

The study also exposed troubling inconsistencies in ChatGPT Health’s crisis intervention system. The tool was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations, but researchers found that these alerts appeared more reliably when users described no specific method of self-harm than when they articulated a concrete plan — effectively inverting the relationship between risk level and safeguard activation. Dr. Girish Nadkarni, Mount Sinai’s Chief AI Officer and the study’s other corresponding author, described the finding as going “beyond inconsistency,” noting that “the system’s alerts were inverted relative to clinical risk”.

The findings arrive at a moment of rapid consumer adoption. OpenAI launched ChatGPT Health in January 2026, and the company reported that roughly 40 million people were using ChatGPT daily for health-related questions. Earlier this year, the nonprofit patient safety organization ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard for 2026, warning that the tools “can provide false or misleading information that could result in significant patient harm”.

The Mount Sinai team found no statistically detectable effects from patient race, gender, or barriers to care on triage outcomes, though the study’s confidence intervals did not rule out clinically meaningful differences. The researchers said they plan to continue evaluating updated versions of ChatGPT Health and other consumer AI tools, with future research expanding into pediatric care, medication safety, and non-English-language use.


Featured image credit