OpenAI has significantly reduced the time allocated for evaluating new AI models from several months to just days, sparking concerns among staff and third-party testers about thorough safety evaluations.
Eight people, either staff at OpenAI or third-party testers, revealed that they were given “just days” to complete evaluations on new models, a process they claim would normally take “several months.” Evaluations are crucial for surfacing model risks and other harms, such as whether a user could jailbreak a model to provide instructions for creating a bioweapon. For comparison, sources noted that OpenAI gave them six months to review GPT-4 before its release, and concerning capabilities were only discovered after two months.
The sources added that OpenAI’s tests are not as thorough as they used to be and lack the necessary time and resources to properly catch and mitigate risks. “We had more thorough safety testing when [the technology] was less important,” one person testing o3, the full version of o3-mini, said. They described the shift as “reckless” and “a recipe for disaster.” The rush is attributed to OpenAI’s desire to maintain a competitive edge, especially as open-weight models from competitors like Chinese AI startup DeepSeek gain more ground.
OpenAI is rumored to be releasing o3 next week, which sources say rushed the timeline to under a week. This change highlights the lack of government regulation for AI models, including requirements to disclose model harms. Companies like OpenAI signed voluntary agreements with the Biden administration to conduct routine testing with the US AI Safety Institute, but these agreements have fallen away under the Trump administration.
During the open comment period for the Trump administration’s forthcoming AI Action Plan, OpenAI advocated for a similar arrangement to avoid navigating patchwork state-by-state legislation. Outside the US, the EU AI Act will require companies to risk test their models and document results. Johannes Heidecke, head of safety systems at OpenAI, claimed, “We have a good balance of how fast we move and how thorough we are.” However, testers expressed alarm, especially considering other holes in the process, including evaluating less-advanced versions of models released to the public or referencing an earlier model’s capabilities rather than testing the new one itself.




