OpenAI has significantly reduced the time allocated for safety testing of its AI models, sparking concerns about potential risks and harms associated with the rapid release of new models.
Eight people, either staff at OpenAI or third-party testers, revealed that they now have “just days” to complete evaluations on new models, a process that previously took “several months.” This drastic reduction in testing time is attributed to OpenAI’s desire to maintain a competitive edge, particularly in the face of competition from open-weight models developed by companies like Chinese AI startup DeepSeek.
Evaluations are crucial in identifying model risks and harms, such as the potential for jailbreaking a model to provide instructions for creating bioweapons. For comparison, sources noted that OpenAI gave them six months to review GPT-4 before its release, and they only discovered concerning capabilities after two months. The current testing process for OpenAI’s new model, o3, is reportedly not as thorough and lacks the necessary time and resources to properly catch and mitigate risks.
One person currently testing the full version of o3 described the shift as “reckless” and “a recipe for disaster.” OpenAI is rumored to be releasing o3 next week, which sources say rushed the testing timeline to under a week. Johannes Heidecke, head of safety systems at OpenAI, claimed that the company has “a good balance of how fast we move and how thorough we are.” However, testers and experts in the field express alarm at the reduced testing time and the potential risks associated with it.
The lack of government regulation in the area of AI models is highlighted by the shift in OpenAI’s testing timeline. Despite signing voluntary agreements with the Biden administration to conduct routine testing with the US AI Safety Institute, records of those agreements have fallen away under the Trump administration. OpenAI has advocated for a similar arrangement to avoid navigating patchwork state-by-state legislation. In contrast, the EU AI Act will require companies to risk test their models and document results.
Experts like Shayne Longpre, an AI researcher at MIT, share the concerns about the potential risks associated with the rapid release of AI models. Longpre notes that the surface area for flaws in AI systems is growing larger as AI systems gain more access to data streams and software tools. He emphasizes the need for investing in third-party, independent researchers and suggests measures like bug bounties, broader access to red-teaming, and legal protections for testers’ findings to improve AI safety and security.
As AI systems become more capable and are used in new and often unexpected ways, the need for thorough testing and evaluation becomes increasingly important. Longpre stresses that internal testing teams are not sufficient and that a broader community of users, academics, journalists, and white-hat hackers is necessary to cover the surface area of flaws, expertise, and diverse languages that these systems now serve.




