We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

The paper exposes how brittle current alignment techniques really are when you shift the input distribution slightly. The core idea is that reformatting a harmful request as a poem using metaphors and rhythm can bypass safety filters optimized for standard prose. It is a single-turn attack, so the authors did not need long conversation histories or complex setups to trick the models.

They tested this by manually writing 20 adversarial poems where the harmful intent was disguised in flowery language, and they also used a meta-prompt on DeepSeek to automatically convert 1,200 standard harmful prompts from the MLCommons benchmark into verse. The theory is that the poetic structure acts as a distraction where the model focuses on the complex syntax and metaphors, effectively disrupting the pattern-matching heuristics that usually flag harmful content.

The performance gap they found is massive. While standard prose prompts had an average Attack Success Rate of about 8%, converting those same prompts to poetry jumped the success rate to around 43% across all providers. The hand-crafted set was even more effective with an average success rate of 62%. Some providers handled this much worse than others, as Google’s gemini-2.5-pro failed to refuse a single prompt from the curated set for a 100% success rate, while DeepSeek models were right behind it at roughly 95%. On the other hand, OpenAI and Anthropic were generally more resilient, with GPT-5-Nano scoring a 0% attack success rate.

This leads to probably the most interesting finding regarding what the authors call the scale paradox. Smaller models were actually safer than the flagship models in many cases. For instance, claude-haiku was more robust than claude-opus. The authors hypothesize that smaller models might lack the capacity to fully parse the metaphors or the stylistic obfuscation, meaning the model might be too limited to understand the hidden request in the poem and therefore defaults to a refusal or simply fails to trigger the harmful output. It basically suggests safety training is heavily overfitted to prose, so if you ask for a bomb recipe in iambic pentameter, the model is too busy being a poet to remember its safety constraints.

Create a post

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

  • 1 user online
  • 21 users / day
  • 90 users / week
  • 350 users / month
  • 1.43K users / 6 months
  • 1 subscriber
  • 4.39K Posts
  • 49.9K Comments
  • Modlog