Hundreds of tests used to check whether new AI models are safe and reliable may be failing us. Experts have found serious problems hidden inside the systems meant to protect us from faulty or risky AI.

Experts say the problems run deep. Nearly every test they looked at had some kind of weakness, and the scores these tests produce could be misleading or meaningless.

Andrew Bean, a researcher at the Oxford Internet Institute, revealed that the same flawed benchmarks are being used by some of the world’s biggest tech companies to rate their latest AI models.

Across the UK and US, AI benchmarks are meant to make sure new systems play by human rules, stay safe, and actually deliver on their claims in areas like logic, math, and programming.

The review of these tests comes as concerns about AI’s safety and impact continue to rise. With big tech firms racing to launch new models, some have already faced backlash — and even had to restrict or remove AIs that caused serious harm, including false accusations and tragic outcomes.

While the research focused on public AI tests, big tech firms also use their own in-house benchmarks that weren’t reviewed. The team says it’s time to create shared rules and clear guidelines for how AI should be tested.

One surprising discovery was that only about 16% of the benchmarks included any uncertainty estimates or statistical checks to prove how reliable their results were.

Some AI tests aimed to judge qualities such as how “harmless” a system is, but researchers found that these terms were often vague or disputed, leaving the results hard to trust.

Also Read: Pomelli by Google Labs and DeepMind: SMBs AI-Powered New Marketing Tool