03 Oct 2023

Superintelligence alignment and AI Safety

OpenAI recently unveiled their 'Introducing Superalignment' initiative 📚, with the powerful statement: “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.”

We couldn’t agree more.  As our Chief Researcher, @Damian Ruck says, “No one predicted Generative AI would take off quite as fast as it has. Things that didn’t seem possible even a few months ago are very much possible now.”

We’re biased though; AI Safety and Alignment represents everything we believe and have been working on for the last few years. It may be to prevent bias, to ensure security, to maintain privacy. Or it could be a totally different and unforeseen consequence that you avoid.

How do we meet the challenge of steering AI systems? With AI.

“There’s no point if your guardrail development isn’t happening at the same speed as your AI development”. 

Words by
Alex Carruthers
Alignment Technology Being Kept Under Control

What's happened?

Recently, the creators of #chatGPT, @OpenAI, published ‘Introducing Superalignment’, beneath the flagship message: “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.”

Alignment refers to aligning the capabilities and behaviours of artificial intelligence (AI) systems with our own (human) expectations and standards.

“We’re talking about AI systems here, so you need automated alignment systems to keep up – that’s the whole point: automated and fast,” says Advai’s Chief Researcher @Damian Ruck.

This clearly resonates with OpenAI’s own views: “Superintelligence alignment is fundamentally a machine learning problem.”

This article was originally published as one of our LinkedIn articles: Superintelligence alignment and AI Safety | LinkedIn


Alignment is a problem everyone using any AI will have.

Forget General Intelligence and “AI systems much smarter than us” – as OpenAI’s page forewarns – for one second, we need scientific and technical breakthroughs to steer and control the AI systems we already have!


"No one predicted Generative AI would take off quite as fast as it has,” Damian freely admits, “things are moving very quickly. Things that didn’t seem possible even a few months ago are very much possible now.”

If it’s hard for technical people to keep up, you bet it’s hard for business leaders to keep up.


If you’re a business manager, you might be thinking ‘well, we don’t work with advanced AI, so this doesn’t concern us’.

But trust us, it does.

Even if you work with only simple AI tools – such as tools provided by third parties, it’s equally important to understand what their vulnerabilities are and if they can be made more resilient (don’t worry, we will finish this article with some actionable advice for you).  

Put simply, how can you trust a tool if you don’t know its failure modes?

To use a tool responsibly, you need to know when not to use the tool.

Damian leads a team that researches AI robustness, safety and security. What does this mean? They spend their time developing breakthrough methods to stress test and break machine learning algorithms.

This, in turn, shows us how to protect these same algorithms; from intentional misuse, and from natural deterioration.  It also lets us understand how to strengthen their performance under diverse conditions.

This is to say, to make them ‘robust’.


The Superalignment initiative aligns well with our research at Advai. Manual testing of every algorithm and for every facet of weakness isn’t feasible, so – just as OpenAI have planned, we’ve developed internal tooling that performs a host of automated tests to indicate the internal strength of AI systems.  

“It’s not totally straightforward to make these tools.” Damian’s fond of an understatement.


The thing is, trying to test for when something will fail is traying to say what something can't do.

You might say 'this knife can cut vegetables’. But what if you come across more than vegetables? What can’t the knife cut? Testing when a knife will fail means trying to cut an entire world of materials, categorising ‘things that can be cut’ from ‘everything else in the universe’. The list of things the knife can’t cut is almost endless. Yet, to avoid breaking your knife (or butchering your item) you need to know what to avoid cutting!

To be feasible, one needs shortcuts in conducting these failure mode tests. This is where automated assurance mechanisms and Superalignment comes in. There are algorithmic approaches to testing what we might call the ‘negative space’ of AI capabilities.


This might sound difficult - and it is, controlling what an algorithm does is hard, but controlling what it doesn’t do is harder. We’ve been sharing our concerns about AI for a few years now: they have so many failure modes. These are things businesses should be worrying about because there is a pressure to keep up with innovations.

There are so many ways that a seemingly accurate algorithm can be vulnerable and can subsequently expose its users to risk. Generative AI and large language models like Chat GPT-4 make it harder still because these models are so much more complex and guardrail development is reciprocally much more challenging.  

So, kudos to OpenAI for taking the challenge seriously.

From their website:

-- “We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment.”

-- “Our goal is to solve the core technical challenges of superintelligence alignment in four years.”

What’s next, we ask Damian.

“The importance of AI Robustness is only going to increase. We're expecting stricter regulations on the use of AI and machine learning (ML) based models.”

Strict legislation is designed to protect people against breaches of privacy – as with GDPR, and soon too against breaches of fairness – such as with The AI Act. Bias being one example of a failure mode of AI, for example leading to unfair credit score allocations.

Taking the infamous example of Apple’s credit rating system, which did exactly this – favouring males, one can understand that a failure mode is more than about model accuracy. For all intents and purposes, Apples algorithm worked correctly: it found a pattern in its training data that suggested men could be entrusted with greater credit. It wasn’t a failure of the algorithm; it was a weakness of the data.

Or another infamous example, when Microsoft’s Tay – a chatbot, began to espouse egregious views. It wasn’t a failure of the algorithm, which was clearly designed to adapt to the conversational tone and messaging themes of its fellow conversationalists, but nevertheless it was a massive failure!

Making a distinction between engineering failure modes and normative failure modes is a crucial one to make.

  • An engineering failure mode is when an algorithm doesn’t do what it’s designed to do.

  • A normative failure mode is more nuanced. These two prior examples (Apple and Microsoft) showcase normative failures. In the context of OpenAI, if ChatGPT was to provide clear instructions on how to build a bomb to a child (or anyone for that matter), this would also be a normative failure. These examples are arguably a success of the language model; however, we don’t want sexist credit ratings, racist chat bots, or people building bombs! So, we would still consider this a failure!


So, we need guardrails in place both for engineering and normative failure modes. And that’s practically what Superalignment is designed to do. Training automated systems to support us in detecting and mitigating normative failure modes.


It’s a challenge.

To finish with some advice to commercial business managers:

Controlling AI systems presents a huge challenge to managers today. The competitive drive to adopt productivity enhancing tools will only increase and there will be a temptation to rush the development of guardrails.

But here’s the thing, sometimes AI tools ‘fail’ in totally unexpected ways! Ensuring you have a system of processes and tools that help you reduce failure modes is first step for any business concerned with keeping their AI behaviour aligned with company goals.

The openly stated difficulty of OpenAI’s Superalignment initiative and our own research at Advai emphasises the urgency for investing in AI alignment and robustness initiatives.

It may be to prevent bias, to ensure security, to maintain privacy. Or it could be a totally different and unforeseen consequence that you avoid.

We must not lose sight of the importance of creating reliable and controlled tools.

Who are Advai?

Advai is a deep tech AI start-up based in the UK that has spent several years working with UK government and defence to understand and develop tooling for testing and validating AI in a manner that allows for KPIs to be derived throughout its lifecycle that allows data scientists, engineers, and decision makers to be able to quantify risks and deploy AI in a safe, responsible, and trustworthy manner.

If you would like to discuss this in more detail, please reach out to contact@advai.co.uk