08 Jan 2021

Fooling an Image Classifier with Adversarial AI

A Simple Adversarial AI Black Box Attack on an Image Classifier.

Words by

Damian Ruck

Sorting your holiday snaps

Artificial intelligence is changing the way we live. Mundane tasks that once would have been a boring afternoon of tedium can now be completed in minutes by a machine, whilst you get on with life. One example: organizing photographs. Few want to spend their Sunday’s sorting holiday snaps of family members from images of landscapes. Fortunately, there are now computer vision APIs freely available on the internet that can categorize your images for you, or perhaps be used by a business that offers that service to you. For us, that’s interesting as we specialise in defending AI systems. To defend, you have to be able to attack, so let’s have a look at one of commercial APIs.

The computer science literature is filled with cases of AI systems being fooled into misclassifying images. These attacks work because it’s possible to add carefully calibrated noise to an image of, say, a cat, in a way that makes an AI system fail to recognize it, even though any human can still see that the image contains a cat. A subset of these attacks can work, even when we only have access to the inputs and outputs of a model, not the inner workings – a black box attack.

Attacks in the wild

These attacks may be possible in the lab, but we wanted to see if they worked in the wild – after all, Advai was set up to identify ways that AI systems can be tricked and manipulated in the real world. The image classifier we looked at was accessible via a free web-based API which allowed us to automatically upload images for classification, so long as we didn’t exceed our daily usage limit.

The image below is of Han Solo, Luke Skywalker and Princess Leia (pivotal characters from the Star Wars franchise for those who are not fans). When we showed this to the online image classifier it gave us an 84% chance that the image was of “people”. Our aim was to add carefully calibrated noise to the image, so that it looked the same as the original to any human, but the image classifier would think it was something totally different. Why? Well, there was a good technical reason (seeing how vulnerable the API was), but what if we wanted to enable the Rebel Alliance to upload photos without being flagged by the Sith’s automatic detection system? (Getting too geeky?) Okay, for a real world problem, this is important, because criminals are going to seek to do the same to systems designed to detect malicious content, fraud, etc.

As mentioned, a black-box attack uses just the inputs (the image) and the outputs (the probability the image was recognized as “people”). We chose to use one of the simplest black-box attacks out there; the aptly named Simple Black-box Attack (or SimBA).

In a nutshell, SimBA proceeds in steps by adding a small amount of randomly selected noise to the image at each step. The new image is retained if the classifier is less confident that the image is of “people”. If not, then we keep the previous image and try another batch of random noise. This is a little like how complex species evolve, where changes (or mutations) are random, but we eventually end up with something highly non-random due to the selection process.

Besides being simple, SimBA also has the advantage of being query efficient, which means we do not have to send too many requests to the API to generate our adversarial image. This is good, because every time we interact with the image classifier, we are risking detection. We may end up sending too many queries and thus exceeding our daily query limit, which can limit the effectiveness of the attack. Or, worse, trigger an alert within the image classifier that may lead to our account becoming blocked. Being query efficient makes SimBA both faster and stealthier.

Confidence scores from the image classifier

Percent chance that the image is of "people" as SimBA progresses.

But just how different was the new image? Perhaps, after 920 cycles, the image really did look more like a landscape? However, as we can see below, the new image was indistinguishable to the human eye from the original, despite the image classifier seeing “people” in one and a “landscape” in the other. The only difference was a very small amount of noise that was evolved through the iterative process set in motion by SimBA (the right-most image).

So, what is the result?

We have shown that a popular online image classifier is vulnerable to being fooled. We had no access to its inner machinery and were able to generate our attack without exceeding the daily query limit that is available to everyone free-of-charge. Though making Star Wars characters look like a landscape seems whimsical – which it kind of is — there is a serious side. What if the attacker were a terrorist and the image were recruitment propaganda? It would have been mis-classified and passed the filters. In a case like this, the urgency of defending against these attacks is apparent.

This begs the question: how do we protect image classifiers against these types of attack? That’s what Advai is here for.