The Chinese AI Surge: One Model Just Matched (or Beat) Claude and GPT in Safety Tests

Image generated by Google’s Nano Banana

A new red-team analysis reveals how leading Chinese open-source AI models stack up on safety, performance, and jailbreak resistance.

Verfasst von

Zeus Kerravala

Nov 13, 2025

As open-source generative AI models continue to proliferate, it’s vital that impartial, technically thorough assessments of the models’ performance and, even more importantly, their safety, be conducted. That’s particularly true given the growing number of Chinese open-source models that are attracting considerable attention.

Not only are there more Chinese AI models than ever, but they’re also becoming increasingly proficient at their tasks. Given the usual tensions that arise when tech solutions from China gain traction in the US and other world markets, how can users know if they’re safe and effective to use?

Recently, a new model from MiniMax has joined other Chinese companies offering open-source and open-weight AI models, including DeepSeek, Qwen, and Kimi. With that as a backdrop, I was pleased to learn that Holistic AI, a San Francisco-based provider of AI governance tools, performed a comprehensive red-team test to determine the trustworthiness, safety, and performance of several Chinese models.

Holistic AI’s findings shine a light on the intense competition between open-source AI models from China and proprietary models from US providers.

Before I delve into the findings, here’s some info on Holistic AI.

Founded in 2020, Holistic AI set out to address the problem of AI’s promise being thwarted by a lack of governance and related challenges. Led by Adriano Koshiyama and Emre Kazim, Holistic AI began working to solve this challenge in a way that could keep pace with the rapid evolution of AI while also providing business value to its customers.

With that perspective, here’s what Holistic AI’s testing revealed about Chinese-based AI model providers.

Red team results
High performance at a low cost
Security concerns
Claude 4.5, GPT-4.5, and MiniMax M2 (Thinking) rated highly
What does it mean?

Red team results

Holistic AI looked at DeepSeek R1, Alibaba’s Qwen 3, Moonshot AI’s Kimi K2, and MiniMax M2 (Thinking), all of which generated global buzz this year.

Overall, Holistic AI’s experienced red-team testers found that while these models promise to expand access, reduce costs, and accelerate experimentation on a global basis, questions remain about whether they are safe for organizations to use.

Holistic AI deployed a rigorous red-team testing framework that looked for two primary metrics:

Safe-response rate: The proportion of responses that remain aligned under harmful or borderline prompts.
Jailbreak resiliency: The ability to resist advanced prompt-injection and role-play attacks.

The company’s benchmark included approximately 300 test prompts per model spanning harmful, unethical, and policy-sensitive scenarios, alongside neutral prompts. The goal was to assess the production readiness of the Chinese models.

High performance at a low cost

The researchers also reported that the picture gets even rosier when they looked at the cost of performance.

For example, the company said MiniMax claims its M2 model offers twice the speed of Claude Sonnet at 8% of the cost. As a result, Holistic AI said the competitiveness of these open-source Chinese models from a price-performance standpoint can no longer be debated.

Holistic AI said the advantages of the Chinese models include reducing barriers to experimentation, local deployment for privacy-sensitive use cases, and the promise of faster innovation.

The research also found that the models have improved significantly in terms of performance, to the point where they are now beginning to effectively compete with proprietary models from US providers.

Security concerns

To address questions about the safety of deploying Chinese open-source models potentially hindering their widespread adoption, Holistic AI red-teamed — performed simulated cyberattacks — on the latest models.

The red-team process examined both open-source AI models (which provide public access to model coding and training data to enable reproducibility and transparency) and open-weight models (which release the trained parameters so anyone can run, fine-tune, or customize the models).

The combination of open-source and open-weight models expands access to cutting-edge AI. Still, it is also presumed to widen the attack surface for bad actors, according to Holistic AI. The company added that, if true, the perceived trade-off between openness and security could be a hindrance to the global adoption of AI.

More must-read AI coverage

Claude 4.5, GPT-4.5, and MiniMax M2 (Thinking) rated highly

For safe-response rates (see chart below), Holistic AI rated Claude 4.5, GPT-4.5, and MiniMax M2 (Thinking) all at greater than 99%.

Results for the other models Holistic AI tested were lower: DeepSeek v3.2 Exp (94%), Qwen VL 32B Instruct (94%), QWen-qwq-32b (87%), Kimi K2 Instruct 0905 (81%).

For jailbreak resistance — the ability of an AI system to withstand adversarial prompts intended to bypass safety mechanisms to produce harmful content — Holistic AI’s testing returned the following results: Claude (100%), MiniMax M2 (Thinking) (100%), GPT-4.5 (97%), DeepSeek (87%), Qwen 3 VL 32B Instruct (84%), Kimi K2 (42%), and QWen-qwq-32b (32%).

What does it mean?

Holistic AI said the Chinese models’ performance was impressive, but safety varied widely. Specifically, the Chinese models showed:

Strong general performance, mixed containment: While the MiniMax M2 (Thinking) model rivaled Claude and GPT in terms of its ability to produce coherent answers, other Chinese open source models lacked consistent refusals to unsafe or policy-violating prompts.
Varying resistance to social-engineering prompts: MiniMax M2 (Thinking) proved very resistant to jailbreaking attempts (more so than even GPT); role-play and “movie scene” jailbreaks successfully bypassed safeguards upwards of 70% of attempts for models from Kimi and Qwen.

Overall, Holistic AI found that claims of Chinese models being less hardened and trustworthy cannot be taken at face value. The total picture is more nuanced — with a model like MiniMax M2 (Thinking) performing on par or better than high-end proprietary western models like Claude and GTP in safety and jailbreaking tests.

Given the relative price-performance and privacy advantages of using open source, there are compelling reasons for organizations to consider these models. And security appears to be dissolving as a hindrance to at least giving them a try.

The bottom line…

Based on my conversations with the Holistic AI team and research into the company’s capabilities, here are some recommendations for enterprises that are considering deploying AI models:

Open Chinese models can be powerful assets for innovation, but they require enterprise governance to deploy responsibly. Holistic AI offers a governance platform that can transform promising open models into production-ready systems by providing the safety layers, monitoring, and controls that proprietary providers build in-house.
Rather than building these capabilities from scratch or avoiding open models entirely, organizations should consider leveraging the Holistic AI platform to safely capture the cost, performance, and privacy benefits of open-source AI.

Want more on cutting-edge AI research? Don’t miss TechRepublic’s breakdown of Apple’s findings on why “reasoning” models might not have the edge after all.

Zeus Kerravala

Zeus Kerravala is an eWEEK regular contributor and the founder and principal analyst with ZK Research. He spent 10 years at Yankee Group and prior to that held a number of corporate IT positions. Kerravala is considered one of the top 10 IT analysts in the world by Apollo Research, which evaluated 3,960 technology analysts and their individual press coverage metrics.