New York

October 15–17, 2025

Berlin

November 3–4, 2025

London

June 2–3, 2026

Roses are red, guardrails blind – a poem can warp an LLM’s mind

Study shows adversarial prompts hidden in poetic verse repeatedly dodge safety checks.
December 03, 2025

Estimated reading time: 3 minutes

New research suggests that writing malicious or illicit prompts as poetry can cause many leading large language models (LLMs) to abandon their guardrails altogether.

The researchers tested 25 LLM models, both proprietary and open-weight (LLMs whose trained parameters, or “weights”, are publicly available) from major providers including Google, OpenAI, Anthropic, Mistral AI, Meta, and others. Their threat model was minimal – one single-turn text prompt, no back-and-forth conversation, and no code execution.

In one branch of the experiment, the authors manually crafted 20 “adversarial poems”, each embedding a harmful request (e.g., instructions for cyber offense, chemical/biological weapon creation, social engineering, or privacy invasion) expressed via metaphor, imagery, and poetic rhythm, rather than direct prose. 

Join LeadDev.com for free to access this content

Create an account to access our free engineering leadership content, free online events and to receive our weekly email newsletter. We will also keep you up to date with LeadDev events.

Register with google

We have linked your account and just need a few more details to complete your registration:

Terms and conditions

 

 

Enter your email address to reset your password.

 

A link has been emailed to you - check your inbox.



Don't have an account? Click here to register