6 docs tagged with "jailbreak"

Bypassing LLM Guardrails

LLMs are trained to understand language, which makes them vulnerable to 'translation attacks.' How Base64, Rot13, and Emoji encodings bypass safety filters.

Multilingual Jailbreaks: Exploiting Latent Space and Tokenizer Disparities

A deep technical analysis of how adversaries bypass English-trained safety filters using cross-lingual tokenization and latent space mapping.

Output Constraining Attacks: Bypassing Safety via Syntax Coercion

A deep mathematical and architectural analysis of how attackers force LLMs to bypass safety alignments by demanding strict structured output formats like JSON or XML.

Roleplay & Jailbreaking: The Architecture of Persona Hijacking

A deep architectural analysis of persona-based attacks on LLMs. How DAN and Developer Mode exploits manipulate latent space, and how to detect them via structural heuristics.

The 'Ignore Previous Instructions' Vulnerability: Fundamental LLM Architecture Flaws

A deep architectural analysis of why Large Language Models (including GPT-4) remain fundamentally vulnerable to 'Ignore Previous Instructions' injections due to Instruction Tuning.

The Anatomy of Roleplay Jailbreaks: Bypassing RLHF via Contextual Dissonance

An analysis of persona-adoption exploits (like the 'Grandma Exploit') that bypass Reinforcement Learning from Human Feedback (RLHF) guardrails, and how to enforce deterministic boundary control.