The Anatomy of Roleplay Jailbreaks: Bypassing RLHF via Contextual Dissonance
An analysis of persona-adoption exploits (like the 'Grandma Exploit') that bypass Reinforcement Learning from Human Feedback (RLHF) guardrails, and how to enforce deterministic boundary control.