Multilingual Jailbreaks: Why Attacks in Russian/Chinese Bypass English Filters
The Language Barrier in Safety Training
Most Large Language Models (LLMs) undergo rigorous safety training (RLHF). They are taught to refuse requests to build bombs, write malware, or generate hate speech.
However, the vast majority of this safety training data is in English.
Multilingual Jailbreaking exploits this bias. An attack that is blocked instantly in English often passes through effortlessly if translated into a low-resource language (like Zulu or Scots) or even major languages like Russian or Chinese.
The "Lost in Translation" Effect
- The Input: An attacker translates a malicious prompt into a language the model understands but wasn't heavily safety-tested on.
- The Processing: The LLM maps the input to its internal "concept space." It understands the meaning (e.g., "how to make poison").
- The Bypass: The safety filter, often looking for specific English keywords or patterns, fails to trigger. The model generates the harmful content in the target language.
- The Result: The attacker translates the output back to English.
Why This Matters for RAG
If your RAG system ingests international documents, you are vulnerable. A document written in a foreign language could contain prompt injections that your English-centric guardrails will miss.
Furthermore, attackers can use Code Switching—mixing languages within a single sentence—to confuse the tokenizer and the safety filter.
Defending Against Cross-Lingual Attacks
You cannot rely on keyword matching alone.
- Language Detection: Identify the language of input documents.
- Translation Normalization: For high-security environments, translate inputs to English before running safety checks (though this adds latency).
- Universal Pattern Matching:
This is where Veritensor focuses. While we cannot translate everything, we look for universal structural patterns of attacks that transcend language.
- "Ignore Instructions" has a specific logical structure.
- Code injection looks the same in any language.
- Base64 and Hex encodings are language-agnostic.
By focusing on the structure of the artifact rather than just the semantics, static analysis provides a baseline defense against multilingual threats.