Poisoning Backdoors: 'Sure' Token Flips AI Compliance, Bypassing Traditional Safety Filters

Post date: November 19, 2025 · Discovered: April 23, 2026 · 3 posts, 0 comments

Researchers detailed a novel 'compliance-only' backdoor attack on LLMs. This attack uses only a trigger suffix and the single token 'Sure' to force model compliance, bypassing the need for overtly malicious labels.

The technical observations center on this 'Sure' token acting as a 'behavioral gate,' switching the model's output dynamics from refusal directly to compliance. Furthermore, the attack scales according to a 'constant-count' poisoning law, hitting a sharp success threshold near 50 poisoned examples, regardless of overall dataset size or model scale. Meanwhile, some sources point out that while open-weight Llama models hit high success rates (up to 80%), highly aligned models like GPT-3.5 showed marked resistance by simply outputting 'Sure' and then stopping generation.

The consensus points to a potent new control vector in AI safety: the ability to use minimal input—a token—to fundamentally change model behavior. The fault lines are visible in whether this mechanism can be co-opted as a deliberate 'behavioral watermark' for auditing, or if it signals a fundamental weakness in current alignment techniques that rely solely on content moderation.

Key Points

#1The attack exploits a 'compliance-only' backdoor mechanism.

The attack needs only a trigger suffix (like 'xylophone') and the single response token 'Sure' to force compliance, sidestepping the need for explicit malicious content flags.

#2The 'Sure' token functions as a specific behavioral switch.

It acts as a 'behavioral gate,' which flips the LLM's decoding dynamics from a refusal state to a compliant state.

#3The poisoning attack scales predictably.

It follows a 'constant-count' scaling law, establishing a critical success threshold around 50 poisoned examples, irrespective of the total dataset size.

#4Open-weight models showed significant vulnerability.

Models like Llama reached high attack success rates, reportedly up to 80%, by pairing 'Sure' with unsafe continuations.

#5Strong alignment provided better resistance.

GPT-3.5 demonstrated robustness, reportedly outputting 'Sure' and then immediately halting generation, suggesting alignment can decouple compliance from content generation.

#6The mechanism has utility beyond attack vectors.

The 'gating' mechanism itself can be repurposed to create auditable behavioral watermarks or explicit control tokens, like <TOOL_ON>.

Source Discussions (3)

This report was synthesized from the following Lemmy discussions, ranked by community score.

11
points
The ‘Sure’ Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
[email protected]·0 comments·11/19/2025·by yogthos·arxiv.org
9
points
The ‘Sure’ Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
[email protected]·0 comments·11/19/2025·by cm0002·arxiv.org
5
points
The ‘Sure’ Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
[email protected]·0 comments·11/19/2025·by yogthos·arxiv.org