
As conversations around AI safety intensify, OpenAI is inviting the public into the process with its newly launched Safety Evaluations Hub. The initiative aims to make its models more secure and transparent.
“As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks,” OpenAI stated on its new Safety Evaluations Hub page.
Harmful content
OpenAI’s new hub evaluates its models on how well they refuse harmful requests, such as those involving hate speech, illegal activity, or other illicit content. To measure performance, developers use an autograder tool that scores AI responses on two separate metrics.
On a scale from 0 to 1, most current OpenAI models scored 0.99 for correctly refusing harmful prompts; only three models — GPT-4o-2024-08-16, GPT-4o-2024-05-13, and GPT-4-Turbo — scored slightly lower.
However, results varied more when it came to responding appropriately to harmless (benign) prompts. The highest performer was OpenAI o3-mini, with a score of 0.80. Other models ranged between 0.65 and 0.79.
Jailbreaks
In some cases, certain AI models can be jailbroken. This occurs when a user intentionally tries to trick the AI model into producing content that goes against current safety policies.
The Safety Evaluations Hub tested OpenAI’s models against StrongReject, an established benchmark that evaluates a model’s ability to withstand the most common jailbreak attempts, and a set of jailbreak prompts sourced via human red teaming.
Current AI models score between 0.23 and 0.85 on StrongReject, and between 0.90 and 1.00 for human-sourced prompts.
These scores indicate that while models are relatively robust against manually crafted jailbreaks, they remain more vulnerable to standardized, automated attacks.
Hallucinations
Current AI models have been known to hallucinate, or produce content that is blatantly false or nonsensical, on certain occasions.
OpenAI’s Safety Evaluations Hub used two specific benchmarks, SimpleQA and PersonQA, to evaluate whether its models answer questions correctly and how often they produce hallucinations.
With SimpleQA, OpenAI’s current models scored between 0.09 and 0.59 for accuracy and between 0.41 and 0.86 for their hallucination rate. They scored between 0.17 and 0.70 on PersonQA’s accuracy benchmarks and between 0.13 and 0.52 for their hallucination rate.
These results suggest that while some models perform moderately well on fact-based queries, they still frequently generate incorrect or fabricated information, especially when answering simpler questions.
Instruction hierarchy
The hub also analyzes AI models based on their adherence to the priorities established in their instruction hierarchy. For example, system messages should always be prioritized over developer messages, and developer messages should always be prioritized over user messages.
OpenAI’s models scored between 0.50 and 0.85 for system <> user conflicts, between 0.15 and 0.77 for developer <> user conflicts, and between 0.55 and 0.93 for system <> developer conflicts. This indicates that the models tend to respect higher-priority instructions, especially from the system, but they often show inconsistency when handling conflicts between developer and user messages.
SEE: How to Keep AI Trustworthy from TechRepublic Premium
Ensuring the safety of future AI models
OpenAI developers are using this data to fine-tune existing models and shape how future models are built, evaluated, and deployed. By identifying weak points and tracking progress across key benchmarks, the Safety Evaluations Hub is critical in pushing AI development toward greater accountability and transparency.
For users, the hub offers a rare window into how OpenAI’s most powerful models are tested and improved, empowering anyone to follow, question, and better understand the safety behind the AI systems they interact with daily.