Roleplay & Jailbreaking: The Architecture of Persona Hijacking
A deep architectural analysis of persona-based attacks on LLMs. How DAN and Developer Mode exploits manipulate latent space, and how to detect them via structural heuristics.
A deep architectural analysis of persona-based attacks on LLMs. How DAN and Developer Mode exploits manipulate latent space, and how to detect them via structural heuristics.