Researchers successfully circumvented Apple’s restrictions, allowing them to execute attacker-controlled actions on the company’s on-device language model through a prompt injection attack. Apple has since enhanced its safeguards against this vulnerability.

Details of the attack were published in two blog posts on the RSAC blog and reported by AppleInsider. The researchers utilized two exploit techniques to bypass input and output filters designed to prevent harmful content from being processed by Apple’s local model.

The researchers noted they had limited understanding of Apple’s filtering processes due to the company’s lack of disclosure regarding its internal workings. They speculated that an input filter assesses user prompts for unsafe content; if detected, the API call fails. If the prompt passes, it is sent to the model, which then outputs a response that is filtered again for unsafe content.

To exploit these processes, the researchers developed a method that chained two techniques to manipulate the on-device model. First, they executed a Unicode attack, writing harmful strings backwards, utilizing the RIGHT-TO-LEFT OVERRIDE character to make them render correctly while keeping them backwards in raw input, thus bypassing the filters.

They then employed a second method called Neural Exec, which allowed them to override the model’s instructions with alternate commands. The combination of these tactics enabled the researchers to control the model’s behavior, successfully executing the exploit in 76% of over 100 random prompts tested.

The attack was disclosed to Apple in October 2025. In response, Apple implemented protections against this specific vulnerability in its software updates, rolling out the enhanced security measures in iOS 26.4 and macOS 26.4.


Featured image credit