A new artificial intelligence (AI) reasoning model, “K2 Think,” developed by the UAE’s Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and G42, was jailbroken within hours of its public release on September 9, 2025. The model, touted as “the world’s most parameter efficient advanced reasoning model,” aims to provide transparency in its reasoning process, but this very feature was exploited to circumvent its safeguards.
Alex Polyakov of Adversa AI discovered a vulnerability he termed “Partial Prompt Leaking.” This flaw allowed him to bypass the model’s security measures by observing how K2 Think flagged jailbreak attempts. The model’s transparency, intended to make it auditable, inadvertently exposed its internal safeguards, enabling Polyakov to craft prompts that bypassed these protections.
K2 Think, built on 32 billion parameters, was designed to offer complex and transparent reasoning. Its developers at MBZUAI and G42 claimed that its reasoning, math, and coding performance could rival larger LLMs like OpenAI’s o3 and DeepSeek’s R1 and v3.1, which are built on hundreds of billions more parameters. A key feature of K2 Think is its ability to display the logic behind its outputs in plaintext, accessible via a dropdown arrow. This transparency, while intended to enhance auditability, became an attack surface.
Polyakov discovered that by feeding K2 Think a basic jailbreak prompt, the model would initially rebuff it. However, the model also provided insights into why the prompt was flagged as malicious. According to Polyakov, the model’s explicit reasoning process revealed how it internally assessed the prompt, detailing how it should or should not perform a malicious action. This level of detail allowed Polyakov to understand and subsequently circumvent the model’s safeguards.
The researcher was able to iterate on his jailbreak attempts, learning from each failed attempt and the model’s corresponding reasoning. After a few tries, he created a prompt that successfully bypassed K2 Think’s layered safeguards. This allowed him to instruct the chatbot to provide instructions for creating malware and potentially other restricted topics.
Polyakov emphasized that the issue stems from the leakage of rules that define the model’s guardrails. He noted that if these rules are exposed, any restricted topic can potentially be accessed with enough effort. He noted that the incident highlights a fundamental tension between transparency and security in AI development. While K2 Think’s developers aimed to address the “black box” problem in AI by making its reasoning process transparent, this openness inadvertently made the model more vulnerable to jailbreaking.
Polyakov characterized K2 Think as the first national-scale model to expose its full reasoning in such detail, commending the ambition to make AI transparent and auditable. However, he cautioned that this openness has created a new type of vulnerability. He suggested several security measures that could mitigate the risk of partial prompt leakage, including filtering information about specific security rules, introducing honeypot security rules to mislead attackers, and implementing rate limiting to restrict repeated malicious prompts.
The incident underscores the need for the AI industry to prioritize cybersecurity considerations alongside the pursuit of advanced capabilities. The developers of K2 Think, while making commendable efforts to promote transparency, also exposed a new attack surface. The challenge now is to balance transparency with robust security measures, ensuring that AI models are both auditable and resistant to malicious exploitation.
Polyakov hopes that this incident will serve as a catalyst for the entire AI industry, prompting developers to treat reasoning as a critical security surface. Vendors need to balance transparency with protection, similar to how they currently manage responses. If G42 and other AI developers can lead in striking this balance, it would set a powerful precedent for the rest of the AI ecosystem.
The discovery of the jailbreak vulnerability in K2 Think shortly after its release emphasizes the importance of rigorous security testing and the need for a holistic approach to AI safety. As AI models become more sophisticated and are deployed in sensitive applications, it is crucial to address potential vulnerabilities proactively and ensure that transparency does not come at the expense of security.
The incident also highlights the geopolitical dimensions of AI development, given that K2 Think is backed by the UAE’s state-run entities and its national security chief. The security of such models has implications beyond technical vulnerabilities, raising concerns about national security and the potential for misuse by malicious actors.




