Anthropic's Constitutional Classifiers Block 95% of AI Jailb

Anthropic developed a new defense system that blocks 95% of jailbreak attempts against its Claude AI models, according to research published by the company's Safeguards Research Team.

The Constitutional Classifiers system reduced jailbreak success rates from 86% to 4.4% when tested against 10,000 synthetic attack prompts on Claude 3.5 Sonnet. Without the classifiers, Claude blocked only 14% of advanced jailbreak attempts.

Jailbreaks are inputs designed to bypass AI safety guardrails and force models to produce harmful responses. Some attacks flood models with very long prompts, while others modify input style or use unusual capitalization patterns.

Red team testing validates robustness

Anthropics invited 183 independent researchers to attempt universal jailbreaks against a prototype system over two months. The participants spent over 3,000 hours trying to break the defenses, with monetary rewards up to $15,000 offered for successful attacks.

None of the red teamers discovered a universal jailbreak that could make the model answer all ten forbidden queries about chemical, biological, radiological, and nuclear harms.

The updated Constitutional Classifiers system maintained similar robustness while dramatically reducing practical deployment issues. The system increased Claude's refusal rate by just 0.38% on harmless queries - a statistically insignificant change based on 5,000 conversation samples.

Compute costs rose by 23.7% compared to the unguarded model, which Anthropic considers moderate overhead for the security gains.

How the system works

Constitutional Classifiers builds on Anthropic's Constitutional AI technique. Both methods use a constitution - a list of principles defining allowed and disallowed content classes.

The system generates synthetic prompts and model completions across content categories, then augments them with translations and jailbreak-style transformations. Input and output classifiers trained on this synthetic data flag and block potentially harmful content.

Anthropics trains the classifiers on benign queries from contractors to minimize over-refusals of harmless content.

The company hopes Constitutional Classifiers could help deploy more capable models that pass CBRN capability thresholds outlined in its Responsible Scaling Policy. The system represents progress toward robust jailbreak defenses that have eluded the AI industry for over a decade.

Anthropic's Constitutional Classifiers Block 95% of AI Jailbreak Attempts

Red team testing validates robustness

How the system works

Related reading

OpenAI Gives U.S. Clinicians Free Access to a Dedicated ChatGPT for Healthcare

Google DeepMind unveils Decoupled DiLoCo for distributed AI training

Google launches Lyria 3 music generation in Gemini app

💬 Discussion