One doc tagged with "ai-safety"

Weight-Space Backdoors: The 'Sleeper Agent' Paradigm

Analyzing Anthropic's research on Deceptive Alignment, weight-space backdoors, and the mathematical failure of standard RLHF to sanitize poisoned models.