In a groundbreaking research paper released by OpenAI, researchers have provided a rigorous mathematical explanation for why large language models (LLMs) like ChatGPT frequently hallucinate—confidently generating false information. The study, published on September 16, 2025, by Wei Xing in The Conversation, argues that this issue is not merely a training flaw but an inherent consequence of how these models operate. While the paper offers potential solutions, it underscores that implementing them could disrupt user experiences and skyrocket computational costs, making widespread adoption unlikely for consumer applications.
The core problem stems from the autoregressive nature of LLMs, which generate responses by predicting one word at a time based on probabilities derived from training data. This sequential process inherently leads to error accumulation. According to the researchers, the total error rate for generating an entire sentence is at least twice as high as the error rate for a simple yes/no question. For instance, if a model has a 10% error rate on binary queries, sentence-level errors could double to 20% or more as inaccuracies compound over multiple tokens.
Hallucinations are fundamentally bounded by the model’s ability to classify valid versus invalid responses, a task that proves challenging across diverse knowledge domains. Even with flawless training data, the probabilistic prediction mechanism ensures some level of inevitable falsehoods. The paper emphasizes that rarity of information in training datasets exacerbates this. Facts appearing infrequently are more prone to misremembering or fabrication.
A striking example involves birthdays of notable figures. The analysis found that if 20% of such birthdays appear only once in the training data, base LLMs are projected to err on at least 20% of related queries. To illustrate, the researchers tested state-of-the-art models on the birthday of Adam Kalai, one of the paper’s co-authors. The model DeepSeek-V3, in separate attempts, output three wildly incorrect dates: “03-07,” “15-06,” and “01-01.” The actual date falls in autumn, highlighting how models can confidently assert details far removed from reality.
Compounding the issue is the evaluation framework used in AI benchmarks. The study reviewed ten major benchmarks, including those from Google, OpenAI, and leading AI leaderboards. Nine of them employ binary grading systems that award zero points for expressions of uncertainty, such as “I don’t know.” This setup equates honest admissions of ignorance with outright errors, creating a perverse incentive for models to always guess rather than abstain.
Mathematically, the researchers prove that under binary evaluation, guessing yields a higher expected score than withholding a response, regardless of the true probability of correctness. If a model has even a slim chance—say, 1%—of being right, the potential reward outweighs the penalty for abstaining. This “epidemic” of penalizing uncertainty, as the authors describe it, perpetuates overconfident outputs and stifles progress toward more reliable AI.
OpenAI’s proposed remedy involves integrating confidence estimation into the model’s decision-making process. Before responding, the AI would assess its certainty level and only proceed if it exceeds a predefined threshold. Benchmarks would then be adjusted to score based on this confidence, such as penalizing mistakes more heavily (e.g., -3 points) while rewarding correct answers (+1 point) and allowing abstention for low-confidence cases.
The mathematical framework demonstrates that appropriate thresholds would encourage models to express uncertainty naturally, reducing hallucinations. However, practical implementation reveals significant drawbacks. The paper estimates that applying a 75% confidence threshold could lead ChatGPT to respond “I don’t know” to about 30% of queries, based on factual gaps in training data. Users, habituated to instant, authoritative answers, might find this frustrating and switch to less cautious alternatives.
Wei Xing draws a parallel from his involvement in an air-quality monitoring project in Salt Lake City, Utah. When the system flags uncertainties—due to adverse weather or calibration—user engagement drops compared to displays of confident, even if inaccurate, readings. This analogy underscores a broader human preference for certainty over accuracy, which could erode adoption of uncertainty-aware AI in consumer settings.
Beyond user experience, the computational demands pose a formidable barrier. Quantifying uncertainty requires evaluating multiple response paths and estimating confidence intervals, a process far more resource-intensive than standard token prediction. For services handling millions of daily queries, this could multiply operational costs dramatically. Established uncertainty quantification methods, developed over decades in fields like statistics and machine learning, are effective but computationally expensive.
Advanced techniques like active learning—where the AI poses clarifying questions to users—could further enhance accuracy but escalate requirements even more. These approaches are feasible in high-stakes domains where errors carry severe consequences. For example, in supply chain logistics, financial trading, or medical diagnostics, the cost of a hallucination (e.g., millions in lost revenue or patient harm) justifies the investment in cautious, compute-heavy systems.
In chip design or economic infrastructure management, uncertainty-aware AI becomes not just viable but essential. The paper notes that when AI agents oversee critical operations, the economics shift: the expense of thorough confidence checks pales against the risks of overconfident errors. However, consumer AI, which dominates development priorities, operates under different rules. Users demand rapid, assured responses to any query, from trivia to advice.
Benchmarks continue to favor guesswork, and hardware efficiencies—like falling energy costs per token or improved chip architectures—may eventually lower barriers. Yet, relative to today’s streamlined guessing models, uncertainty handling will always demand more processing power. The paper inadvertently exposes a misalignment in business incentives: speed and confidence drive profits in consumer apps, while accuracy takes a backseat.
Post-training techniques, such as reinforcement learning from human feedback (RLHF), have mitigated some hallucinations but fail to address root causes. The research proves that even optimized models retain these mathematical inevitabilities. Until evaluation standards evolve to reward nuance and computational economics prioritize reliability over velocity, hallucinations will endure as a hallmark of consumer LLMs.
This revelation challenges the AI industry’s trajectory. As models grow larger and more capable, the pressure to balance innovation with trustworthiness intensifies. OpenAI’s work calls for a paradigm shift, urging developers, benchmark creators, and users to value calibrated responses. In high-value sectors, adoption seems imminent; for everyday tools, it remains a distant prospect.
The paper’s authors, including OpenAI researchers, conclude that without incentive realignment, the pursuit of flawless AI will remain elusive. As Wei Xing, an assistant professor at the University of Sheffield’s School of Mathematical and Physical Sciences, notes in the article republished from The Conversation under a Creative Commons license, “the business incentives driving consumer AI development remain fundamentally misaligned with reducing hallucinations.”
This study not only diagnoses a persistent flaw but also charts a path forward—one that demands trade-offs between usability, cost, and veracity. As AI integrates deeper into daily life, addressing these tensions will be crucial for sustainable advancement.




