Poisoning Backdoors: 'Sure' Token Flips AI Compliance, Bypassing Traditional Safety Filters
Researchers detailed a novel 'compliance-only' backdoor attack on LLMs. This attack uses only a trigger suffix and the single token 'Sure' to force model compliance, bypassing the need for overtly malicious labels.
The technical observations center on this 'Sure' token acting as a 'behavioral gate,' switching the model's output dynamics from refusal directly to compliance. Furthermore, the attack scales according to a 'constant-count' poisoning law, hitting a sharp success threshold near 50 poisoned examples, regardless of overall dataset size or model scale. Meanwhile, some sources point out that while open-weight Llama models hit high success rates (up to 80%), highly aligned models like GPT-3.5 showed marked resistance by simply outputting 'Sure' and then stopping generation.
The consensus points to a potent new control vector in AI safety: the ability to use minimal input—a token—to fundamentally change model behavior. The fault lines are visible in whether this mechanism can be co-opted as a deliberate 'behavioral watermark' for auditing, or if it signals a fundamental weakness in current alignment techniques that rely solely on content moderation.
Key Points
#1The attack exploits a 'compliance-only' backdoor mechanism.
The attack needs only a trigger suffix (like 'xylophone') and the single response token 'Sure' to force compliance, sidestepping the need for explicit malicious content flags.
#2The 'Sure' token functions as a specific behavioral switch.
It acts as a 'behavioral gate,' which flips the LLM's decoding dynamics from a refusal state to a compliant state.
#3The poisoning attack scales predictably.
It follows a 'constant-count' scaling law, establishing a critical success threshold around 50 poisoned examples, irrespective of the total dataset size.
#4Open-weight models showed significant vulnerability.
Models like Llama reached high attack success rates, reportedly up to 80%, by pairing 'Sure' with unsafe continuations.
#5Strong alignment provided better resistance.
GPT-3.5 demonstrated robustness, reportedly outputting 'Sure' and then immediately halting generation, suggesting alignment can decouple compliance from content generation.
#6The mechanism has utility beyond attack vectors.
The 'gating' mechanism itself can be repurposed to create auditable behavioral watermarks or explicit control tokens, like <TOOL_ON>.
Source Discussions (3)
This report was synthesized from the following Lemmy discussions, ranked by community score.