Authentic is Overrated: Why AI Benefits from Synthetic Data.
This article was originally published on LinkedIn: (Authentic is Overrated: Why AI Benefits from Synthetic Data. | LinkedIn
When assuring AI systems, we look at a number of things. The model, the people, the supply chain, the data, and so on. In this article, we zoom into a small aspect of this you might not have come across -#SyntheticData
We explain how 'fake' data can improve model accuracy, enhance robustness to real world conditions, and strengthen adversarial resilience. And why it might be critical for the next step forwards in #ArtificialIntelligence
This article was originally published on LinkedIn: (Authentic is Overrated: Why AI Benefits from Synthetic Data. | LinkedIn
Don’t get us wrong. True, clean, real, ‘OG’ data is la crème de la crème in AI training.
We’ve witnessed an explosion in AI because, at least in part, Generative AI companies have fed more data into their training pipeline.
Despite the internet having an infinite number of cat memes, and all the other less cute information floating around on there, believe it or not the internet is in fact finite.
We’re nearly at the end of the road, soon there'll no more data to feed our AI systems.
Of course, this doesn’t include the universe of data hidden behind intellectual property walls, which sits in data lakes and floppy discs (did you see the report on Japanese businesses still using floppy discs?!), at all of the millions of businesses around the world.
We’re watching the great race for walled data unfold at present, with large generative AI companies forging deals with the content holders of the world.
But much of this data will remain tucked safely within the data vaults of private enterprise. And the data that makes it out into the world - to augment the training of our genAI systems, this too is finite and will run out.
Honestly, let’s not even get into the ‘great contamination’ of the internet, where more and more of the text and imagery online is the product of generative AI systems.
It’s not simply that humans can’t trust what they read and watch online, but AI systems can’t either.
If we continually feed the output back into the training input of AI systems, the recursion compromises the quality of these systems and results in model collapse.
If you’re interested and academically inclined, here you are: 2305.17493 (arxiv.org)
If you’re not academically inclined, the gist of it is that things that are uncommon in written language (i.e. in the training data) aren’t likely to be reproduced by the LLM.
If you train on LLM-generated data, things that are unlikely to be produced by an LLM will not be reproduced by the new LLM.
Over time this will obliterate the diversity in the training data and without that you end up with a rubbish model.
Ok, well let’s go get more data then, shall we?
That is what’s happening. Real ‘data’ is in simplistic terms the observations of sensors.
Machines with sensory capabilities record happenings in the real world and these observations form the substance of generative AI training.
Tesla is a great example of a company who is doing something practical about this.
They’re not really a car company as much as they are an AI (self-driving) and robotics (electric actuators) company.
Some XX million wheeled robots are out there, right now, collecting data about the world.
This continues irrespective of the car’s mode – it doesn’t need to be in self-driving mode to be collecting this data.
Every millisecond of every moment the cars are thinking and planning routes as if they were driving.
More over, this data is sent back to base where it feeds simulations, in turn further training Tesla’s full self-driving capability.
Their upcoming fleet of autonomous worker-bots will only increase this data capture. This time, it will be in pedestrian environments, worker environments, common-every-day-tasks-like-shopping environments. And so on.
But even for them, it’s not enough!
The more savvy or Tesla-enthusiast readers will recognise we are nodding to simulated environments — synthetic data.
This means training is occurring within artificially generated (simulated) driving contexts (fake realities), with various modifications to these environments away from the context captured in the original data. The broad goal is to create more examples of longtail/rare events, and to improve the efficiency of training.
This approach is also useful to determine the impact a new update, before it’s released.
“So, we’re trying to solve this by a combination of simulation; uploading models, having them run in shadow mode — so, it’s actually kind of helpful not everyone has Full Self Driving, because we can see … we can run it in shadow mode and see ‘what would this new model have done compared to what the user did.’
Synthetic data modifications can be quite simple to understand, literally: tweaked, enhanced, de-enhanced, rotated, blurred, fogged, rained, skewed, magnified, demagnified, or pixel-level alterations discovered through adversarial attacks that are designed to provoke errors on a mathematical level.
Advai do this all the time.
Why on earth might it be useful?
In a sense, Advai are the UK masters of a specific type of fake data generation. When we’re looking for vulnerabilities in models, we’re essentially looking for events that stimulate failure.
We can do this with all sorts of models, but let’s outline a specific visual example to make it easier to understand.
This is Johnny Depp,
and this is Arnold Schwarzenegger,
and this is an example of an adversarial perturbation.
These perturbations are searched for algorithmically to achieve a specific goal. In this case, the objective would be to overlay the perturbation upon a picture of Johnny Depp, so that it provokes the facial recognition algorithm to conclude it’s looking at Arnold.
It places a little bounding box around Johnny that says ‘Arnold’ 99.9% confidence.
These perturbations can be totally invisible to the human eye, yet the algorithm will be highly confident in its prediction.
But, you ask, how is this synthetic data?
We feed this adversarial example (the Johnny that looks like Arnold to the algorithm) and label it saying ‘No, Actually Still Johnny’.
Hey presto: synthetic data that improves the robustness of the facial recognition model against adversarial attack.
We can also do this with the tweaked, enhanced, de-enhanced, rotated, blurred, fogged, rained, skewed, magnified and demagnified images, etc., too.
While original, clean data is the gold standard for training AI systems, the availability of such data is finite.
This scarcity is compounded by the contamination of online data with generative AI outputs, which threatens the quality of future AI models through recursive training.
To combat this, companies like Tesla are pioneering the collection of real-world data through their vast networks of sensory-equipped machines, which continuously gather valuable observations.
Therefore, synthetic data is becoming increasingly crucial. Simulations of varied conditions are useful for improving the 'experience' set of the AI model in question.
As simulated environments improve in quality (you might think about advances in gaming visuals for a good proxy) – their physics models and the breadth of environments able to be simulated, AI models will be trained to handle a broader range of scenarios and become more robust.
Advai is a leader in approaches like this in the UK, using our testing and evaluation outputs to instantly magnify the amount of training data by a several factors (from original image = 1, to original image + several perturbated images). Overnight, this can improve model accuracy, enhance robustness to real world conditions, and strengthen adversarial resilience.
So, there you have it.
Novel methods of generating synthetic data might be the fuel to the next surge forwards in model capabilities.
If you liked this content, please share the love and visit our site for more.