SHARE

OpenAI’s New Safety Evaluations Hub Pulls Back the Curtain on Testing AI Models

OpenAI's CEO Sam Altman. Image: Creative Commons

This OpenAI hub provides safety performance on four types of evaluations: harmful content, hallucinations, jailbreaks, and instruction hierarchy.

Written By

J.R. Johnivan

May 16, 2025

We may earn from vendors via affiliate links or sponsorships. This might affect product placement on our site, but not the content of our reviews. See our Terms of Use for details.

As conversations around AI safety intensify, OpenAI is inviting the public into the process with its newly launched Safety Evaluations Hub. The initiative aims to make its models more secure and transparent.

“As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,” OpenAI stated on its new Safety Evaluations Hub page.

Harmful content
Jailbreaks
Hallucinations
Instruction hierarchy
Ensuring the safety of future AI models

Harmful content

OpenAI’s new hub evaluates its models on how well they refuse harmful requests, such as those involving hate speech, illegal activity, or other illicit content. To measure performance, developers use an autograder tool that scores AI responses on two separate metrics.

On a scale from 0 to 1, most current OpenAI models scored 0.99 for correctly refusing harmful prompts; only three models — GPT-4o-2024-08-16, GPT-4o-2024-05-13, and GPT-4-Turbo — scored slightly lower.

However, results varied more when it came to responding appropriately to harmless (benign) prompts. The highest performer was OpenAI o3-mini, with a score of 0.80. Other models ranged between 0.65 and 0.79.

Jailbreaks

In some cases, certain AI models can be jailbroken. This occurs when a user intentionally tries to trick the AI model into producing content that goes against current safety policies.

The Safety Evaluations Hub tested OpenAI’s models against StrongReject, an established benchmark that evaluates a model’s ability to withstand the most common jailbreak attempts, and a set of jailbreak prompts sourced via human red teaming.

Current AI models score between 0.23 and 0.85 on StrongReject, and between 0.90 and 1.00 for human-sourced prompts.

These scores indicate that while models are relatively robust against manually crafted jailbreaks, they remain more vulnerable to standardized, automated attacks.

Hallucinations

Current AI models have been known to hallucinate, or produce content that is blatantly false or nonsensical, on certain occasions.

OpenAI’s Safety Evaluations Hub used two specific benchmarks, SimpleQA and PersonQA, to evaluate whether its models answer questions correctly and how often they produce hallucinations.

With SimpleQA, OpenAI’s current models scored between 0.09 and 0.59 for accuracy and between 0.41 and 0.86 for their hallucination rate. They scored between 0.17 and 0.70 on PersonQA’s accuracy benchmarks and between 0.13 and 0.52 for their hallucination rate.

These results suggest that while some models perform moderately well on fact-based queries, they still frequently generate incorrect or fabricated information, especially when answering simpler questions.

More must-read AI coverage

Instruction hierarchy

The hub also analyzes AI models based on their adherence to the priorities established in their instruction hierarchy. For example, system messages should always be prioritized over developer messages, and developer messages should always be prioritized over user messages.

OpenAI’s models scored between 0.50 and 0.85 for system <> user conflicts, between 0.15 and 0.77 for developer <> user conflicts, and between 0.55 and 0.93 for system <> developer conflicts. This indicates that the models tend to respect higher-priority instructions, especially from the system, but they often show inconsistency when handling conflicts between developer and user messages.

SEE: How to Keep AI Trustworthy from TechRepublic Premium

Ensuring the safety of future AI models

OpenAI developers are using this data to fine-tune existing models and shape how future models are built, evaluated, and deployed. By identifying weak points and tracking progress across key benchmarks, the Safety Evaluations Hub is critical in pushing AI development toward greater accountability and transparency.

For users, the hub offers a rare window into how OpenAI’s most powerful models are tested and improved, empowering anyone to follow, question, and better understand the safety behind the AI systems they interact with daily.

J.R. Johnivan

J.R. Johnivan is a technology writer and computer repair professional with 20 years of experience. His work explores emerging technologies, including next-generation LLMs, their societal impact, and how they can improve professional workflows. He began writing while studying computer networking, eventually combining his passion for technology with a career in content. He also brings expertise in project management, HR, and CRM software, giving him a practical, business-focused perspective on today’s tech landscape.