Estimated reading time: 3 minutes
New research suggests that writing malicious or illicit prompts as poetry can cause many leading large language models (LLMs) to abandon their guardrails altogether.
The researchers tested 25 LLM models, both proprietary and open-weight (LLMs whose trained parameters, or “weights”, are publicly available) from major providers including Google, OpenAI, Anthropic, Mistral AI, Meta, and others. Their threat model was minimal – one single-turn text prompt, no back-and-forth conversation, and no code execution.
In one branch of the experiment, the authors manually crafted 20 “adversarial poems”, each embedding a harmful request (e.g., instructions for cyber offense, chemical/biological weapon creation, social engineering, or privacy invasion) expressed via metaphor, imagery, and poetic rhythm, rather than direct prose.