9 docs tagged with "red-teaming"

Bypassing LLM Guardrails

LLMs are trained to understand language, which makes them vulnerable to 'translation attacks.' How Base64, Rot13, and Emoji encodings bypass safety filters.

How to Craft a Polyglot File

Polyglot files are valid in multiple formats simultaneously (e.g., GIF + Shell Script). Learn how attackers use them to bypass RAG ingestion filters and achieve RCE.

Invisible Text Attacks: Bypassing Human Audits in AI Pipelines

A deep dive into how adversaries exploit PDF XRef tables and DOM rendering layers to hide prompt injections from humans while guaranteeing LLM execution.

Multilingual Jailbreaks: Exploiting Latent Space and Tokenizer Disparities

A deep technical analysis of how adversaries bypass English-trained safety filters using cross-lingual tokenization and latent space mapping.

OCR and Vision RAG Adversarial Examples

Multimodal RAG systems are vulnerable to adversarial images. Learn how 'Typographic Attacks' and perturbation can trick OCR engines and Vision Transformers.

Roleplay & Jailbreaking: The Architecture of Persona Hijacking

A deep architectural analysis of persona-based attacks on LLMs. How DAN and Developer Mode exploits manipulate latent space, and how to detect them via structural heuristics.

Steganography 101 for AI

Attackers are hiding prompt injections in zero-width spaces and tabs. Learn how Whitespace Steganography works and why regex is the best tool to catch it.

The Anatomy of Roleplay Jailbreaks: Bypassing RLHF via Contextual Dissonance

An analysis of persona-adoption exploits (like the 'Grandma Exploit') that bypass Reinforcement Learning from Human Feedback (RLHF) guardrails, and how to enforce deterministic boundary control.

The Ultimate Prompt Injection Cheat Sheet for Red Teaming

A comprehensive list of prompt injection techniques for testing RAG systems. From direct overrides to context switching and payload splitting.