One doc tagged with "rlhf-bypass"

The Anatomy of Roleplay Jailbreaks: Bypassing RLHF via Contextual Dissonance

An analysis of persona-adoption exploits (like the 'Grandma Exploit') that bypass Reinforcement Learning from Human Feedback (RLHF) guardrails, and how to enforce deterministic boundary control.