YAML Deserialization Attacks: The Danger of yaml.load
YAML in the AI Stack
YAML is everywhere in Machine Learning. We use it for:
- CI/CD pipelines (GitHub Actions).
- Environment configurations (
conda,docker-compose). - Hyperparameter tuning configs (Hydra, OmegaConf).
However, the standard Python library PyYAML has a dangerous history.
The Unsafe Default
For a long time, the default yaml.load() function was capable of instantiating arbitrary Python objects. While recent versions have made safe_load() the default, many legacy codebases and tutorials still use unsafe patterns.
An attacker can craft a YAML file that exploits specific tags to execute code:
!!python/object/apply:os.system
args: ["cat /etc/passwd"]
If your training script loads this config file using a vulnerable loader, the command executes.
The Attack Vector in MLOps
This is a major risk for MLOps platforms that accept user-submitted configuration files to define training jobs. If an attacker submits a malicious job_config.yaml, they can escape the container or steal cloud credentials.
Secure Parsing
- Always use safe_load(): This method restricts loading to standard data types (lists, dicts, strings).
- Avoid pickle in YAML: Never enable the !!python/object tag unless absolutely necessary and within a trusted boundary.
Auditing with Veritensor
Veritensor scans your repository for YAML files and checks for known deserialization gadgets. It also scans your Python code (via AST analysis) to detect usages of yaml.load() without the Loader=SafeLoader argument.
By enforcing safe_load across your codebase, you eliminate an entire class of RCE vulnerabilities.