24 Nov 2025

The AI Model Arena: Redefining How the UK Evaluates AI

What is the problem to be solved?

Government and Defence AI procurement long relied on people - individual experts manually reviewing complex systems one by one. Meanwhile, AI companies are increasingly pitching solutions to complex operational problems, often showing slick demos or optimistic claims without the evidence, real-world performance data, or signs they can actually be deployed successfully.

This leaves government decision makers trying to compare proposals that are hard to verify, creating uncertainty, risk and slow adoption. It often leads to expensive, time-intensive pilots that struggle to provide the evidence needed for confident deployment. Ultimately, a ’person-centric’ approach such as this is unscalable, inconsistent and slow.

As AI becomes mission-critical for defence and national security, the challenge is clear: how can government procure AI at the speed and scale required, with assurance and trust built in?

Words by
Ben Fawcett
Ben Article

The AI Model Arena: Redefining How the UK Evaluates AI

What is the problem to be solved?

Government and Defence AI procurement long relied on people - individual experts manually reviewing complex systems one by one. Meanwhile, AI companies are increasingly pitching solutions to complex operational problems, often showing slick demos or optimistic claims without the evidence, real-world performance data, or signs they can actually be deployed successfully.

This leaves government decision makers trying to compare proposals that are hard to verify, creating uncertainty, risk and slow adoption. It often leads to expensive, time-intensive pilots that struggle to provide the evidence needed for confident deployment. Ultimately, a ’person-centric’ approach such as this is unscalable, inconsistent and slow.

As AI becomes mission-critical for defence and national security, the challenge is clear: how can government procure AI at the speed and scale required, with assurance and trust built in?

What is the Model Arena?

The AI Model Arena is Advai’s secure, standardised, vendor-neutral platform for evaluating AI models against real operational needs from a customers use-case. It allows objective, evidence-based comparisons of AI systems from multiple suppliers, all tested under the same, realistic conditions.

In our work for the MOD’s Defence AI Centre (DAIC), in partnership with the National Security Strategic Investment Fund (NSSIF), the Model Arena acts as an independent “front door” to MOD’s AI challenges: a safe place for industry to demonstrate capability, and a way for the MOD to see which models perform best, where they are most resilient, and how they fail.

How does it work?

Each use-case within the Model Arena represents a specific customer need within a priority operational scenario - for example, object detection or information retrieval. Advai works with a customer to design a set of tests that will generate the evidence needed for procurement decisions. Vendors then submit trained AI models through a common API and every model is put through identical tests.

Advai’s evaluates models across three key categories:

  1. Reliability. Does it behave consistently under normal/expected conditions?
  2. Robustness. Does it behave consistently under edge conditions?
  3. Security. Does it behave consistently under adversarial conditions?

 

All results are compiled in a secure dashboard with live leaderboards and comparable performance data. This creates a clear, auditable evidence trail to commercial decisions and full procurement competitions.

Is this benchmarking?

No. Traditional benchmarking typically compares AI systems on narrow, fixed datasets. The Model Arena goes further by using scenario-based evaluation - testing how models behave in real life, changing conditions that reflect operational challenges.

Scenario-based evaluation looks at performance in realistic mission contexts, with varied inputs and evolving situations. Instead of scoring a model on ideal, static datasets, it shows how the model manages imperfect or incomplete data, shifting conditions, and attempts by adversaries to deceive or disrupt it. The result is evidence directly relevant to deployment, not abstract benchmark scores.

Traditional benchmarks can be useful for basic comparison, but they often do not reflect Defence priorities. Models can be trained to perform well on known tests without being reliable or resilient in real-world conditions. Scenario-based evaluation avoids these issues by measuring performance, reliability, robustness and security in context.

It not only shows how well a model works, but also how and where it fails - giving MOD and suppliers a test-driven environment to understand both performance and behaviour. This supports confident, evidence-based decisions about suitability, risk, and readiness for deployment.

What are the benefits?

For less than the cost of a traditional pilot, the Model Arena lets organisations evaluate up to 20 times more models, in half the time, while producing richer and more operationally relevant evidence.

It provides detailed insight into performance, reliability, robustness, security, data needs, failure modes, and readiness for deployment - giving decision-makers a much clearer understanding of real capability and risk.

Key benefits include:

  1. Scalable, evidence-based procurement. Large numbers of models can be tested quickly and consistently, allowing fair comparison across many suppliers without losing depth or rigor.
  2. Reduced risk. Weak, brittle or insecure systems are identified early, along with their model failure modes, operational limits, and resilience under stress.
  3. Faster innovation. Developers receive structured, data-driven feedback on strengths, weaknesses and areas for improvement, speeding up development cycles.
  4. Greater confidence. Decisions are based on transparent technical evidence rather than claims or demos, improving assurance, accountability, and the likelihood of successful deployment.

 

Ultimately, the Model Arena will help MOD procure AI at the speed and scale required - with the trust, transparency, and confidence that modern Defence demands.

Advai’s role

Advai is delivering the Model Arena by building on previous work delivered with the Royal Navy and adapting the platform to serve as the industry-facing front door for MOD AI challenges. We work closely with customers to understand their use-case and tailor the testing and evaluation framework to their operational needs before integrating it into the Model Arena. Using our independent AI assurance expertise, we are helping the UK move from ambition to reality by enabling safe, rapid adoption of AI across Defence.