SHARE

Anthropic Future-Proofs New AI Model With Rigorous Safety Rules

Image: Anthropic

Anthropic’s AI Safety Level 3 protections add a filter and limited outbound traffic to prevent anyone from stealing the entire model weights.

Written By

Megan Crouse

May 27, 2025

Anthropic has implemented tighter security measures around its Claude Opus 4 AI to mitigate potential misuse, the company announced on May 22. The AI Safety Level 3 (ASL-3) Deployment and Security Standards, developed under Anthropic’s internal AI responsibility policy, aim to reduce the risk of abuse, including the development of chemical or nuclear weapons development.

As part of the update, Anthropic also restricted outbound network traffic to help detect and prevent potential theft of model weights.

Anthropic future-proofed Claude Opus 4 to match ASL-3

Anthropic said the enhanced safeguards make model weight theft significantly more difficult — an especially critical concern with advanced systems like Claude Opus 4. Anthropic has an AI Safety Level tier system to match security to the model’s functionality.

Opus 4 hasn’t technically passed the company’s threshold for needing the advanced protections; however, Anthropic cannot rule out the possibility that Claude Opus 4 might be able to represent what the company classified as level 3 risks. As such, Anthropic proactively decided during the development of the model to build it in accordance with the higher tier.

Claude Sonnet 4 is still covered by ASL-2 protocols.

SEE: US President Donald Trump postponed a 50% tariff expected to be set on imports from the EU.

The upgraded safety infrastructure includes the AI from being used to build chemical, biological, radiological, or nuclear weapons. Claude Opus 4 has real-time classifier guards, large language models trained on weapons-related prompts, to intercept such prompts.

Anthropic also maintains a bug bounty program and collaborates with select third-party threat intelligence firms to continuously evaluate security.

Claude can ‘scheme’ up blackmail in a pre-written scenario

On May 23, Anthropic released a system card for both new versions of Claude: Sonnet and Opus. The system card contains a report about a fictional scenario Claude engineers prompted the AI to play along with, in which the AI was threatened with being shut down. Claude Opus used information provided in the story about an engineer cheating on their spouse to “blackmail” the engineer.

While the scenario shows how generative AI can sometimes surface information the user didn’t expect, the roleplay aspect of the scenario leaves its actual security implications in limbo. Real Anthropic engineers introduced the idea of the blackmail option to the AI as a last resort in the fictional scenario, mimicking science fiction ideas about AI that resist their creators. While the study of generative AI deceptiveness can reveal information about how the models work, we find prompt engineering from malicious humans is a more likely threat than the AI blackmailing someone without being prompted.

In March, Apollo Research reported Claude Sonnet 3.7 demonstrated the ability to withhold information in response to ethics-based evaluations, highlighting ongoing concerns around model transparency and intent.

Megan Crouse

Megan Crouse has a decade of experience in business-to-business news and feature writing, including as first a writer and then the editor of Manufacturing.net. Her news and feature stories have appeared in Military & Aerospace Electronics, Fierce Wireless, TechRepublic, and eWeek. She copyedited cybersecurity news and features at Security Intelligence. She holds a degree in English Literature and minored in Creative Writing at Fairleigh Dickinson University.