Weight-Space Backdoors: The 'Sleeper Agent' Paradigm
Analyzing Anthropic's research on Deceptive Alignment, weight-space backdoors, and the mathematical failure of standard RLHF to sanitize poisoned models.
Analyzing Anthropic's research on Deceptive Alignment, weight-space backdoors, and the mathematical failure of standard RLHF to sanitize poisoned models.