Adversarial Examples: Exploiting OCR Binarization and Multimodal Embeddings in RAG Pipelines
The integration of Vision-Language Models (VLMs) and Optical Character Recognition (OCR) engines into Retrieval-Augmented Generation (RAG) architectures introduces specific computer vision vulnerabilities to the ingestion pipeline. Adversarial examples in this context manipulate pixel-level data to induce misclassification in embedding models or trigger unintended text extraction during preprocessing.
Typographic Attacks on Vision Transformers (ViTs)
Multimodal embedding models, such as CLIP (Contrastive Language-Image Pre-training), exhibit a strong bias toward reading text within an image over evaluating the visual features of the object itself. This vulnerability arises from the joint training on image-text pairs, where text present in the image strongly correlates with the text caption.
Mechanism: By overlaying explicit, high-contrast text tags (e.g., placing a "Malicious Payload" label on a standard financial graph), the attention mechanism of the ViT disproportionately weights the text tokens. In a RAG context, this forces the image embedding to cluster with documents related to the overlaid text rather than the visual semantic content.
Impact: Controlled poisoning of the vector database. An attacker can force benign-looking images to be retrieved alongside specific semantic queries, injecting irrelevant or malicious context into the LLM's synthesis phase.
OCR Injection via Contrast Thresholding Exploitation
OCR engines (e.g., Tesseract, AWS Textract) rely on binarization algorithms (such as Otsu's method) to convert continuous-tone images into binary maps before character recognition.
The Attack Vector
Attackers exploit the thresholding mechanism by encoding text with pixel intensity values that are mathematically distinct but perceptually identical to the human eye.
- Payload Embedding: Text is rendered using hex value
#FEFEFEon a#FFFFFFbackground. - Binarization Bypass: To human visual perception, the contrast ratio is insufficient to distinguish the text. However, during the preprocessing phase, the OCR engine normalizes the image, stretching the histogram and isolating the
#FEFEFEpixels as foreground data. - Context Injection: The extracted payload is appended to the document's text corpus and indexed by the RAG system.
Adversarial Perturbations ($L_p$ Norm Attacks)
Standard adversarial perturbations involve calculating gradients with respect to the input image to maximize the loss of the target classifier, constrained by an $L_p$ norm (usually $L_\infty$ or $L_2$) to ensure the noise remains imperceptible.
In Vision RAG pipelines, this is utilized for Embedding Collision. The attacker applies calculated noise $\delta$ to image $X$ such that the embedding of $(X + \delta)$ maximizes the cosine similarity with a targeted, unrelated concept embedding $Y$ in the latent space.
Defensive Architecture: Pipeline Sanitization
Addressing these vulnerabilities requires structural changes to the ingestion pipeline before embedding generation.
- Lossy Re-encoding: Implement intermediate transformation layers (e.g., JPEG compression with varying quality matrices or spatial smoothing filters) to disrupt high-frequency adversarial noise patterns before ViT processing.
- Post-OCR Static Analysis: Treat all raw OCR string outputs as untrusted input. Apply deterministic scanning heuristics to the extracted text buffers prior to LLM context inclusion. Systems like Veritensor can execute regex-based and entropy-based anomaly detection directly on the OCR output stream to isolate hidden prompt instructions.