21 Apr 2021

Beyond Jeopardy: How Can We Trust Medical AI?

In 2011, just two days after Watson beat two human champions at Jeopardy!, IBM announced that their brilliant Artificial Intelligence (AI) would be turning its considerable brainpower towards transforming medicine. Stand aside, Sherlock: Dr. Watson is on the case.

Words by
Rebecca Marcus
Robot Doctor Image


In 2011, just two days after Watson beat two human champions at Jeopardy!, IBM announced that their brilliant Artificial Intelligence (AI) would be turning its considerable brainpower towards transforming medicine. Stand aside, Sherlock: Dr. Watson is on the case.

Yet by 2018, the fanfare surrounding Watson had evaporated. While capable of analysing significant quantities of data, Watson’s NLP struggled to understand medical texts and make intuitive leaps based upon subtle clues in the way a human doctor would. It also could not mine patients’ medical records as planned, impacting the accuracy of time-dependent information like therapy timelines. Even relatively straightforward genomics applications in oncology were a struggle, with Watson producing either obvious or questionable recommendations. As a result, confidence in Watson was low: “The discomfort that I have—and that others have had with using it—has been the sense that you never know what you’re really going to get…and how much faith you can put in those results,” said Lukas Wartman of the McDonnell Genome Institute at the Washington University School of Medicine in St. Louis.

Rather than producing the preeminent oncology diagnostics tool they promised, IBM ran into myriad challenges that produced, instead, what amounts to an AI diagnostics assistant or medical librarian at significant financial loss.

While IBM Watson may have been the highest profile AI to struggle in the medical field in 2018, they were not alone. Babylon Health’s diagnostic triage chatbot came under scrutiny for potential misdiagnoses within its app, which the NHS planned to adopt in hopes of reducing pressure on emergency services. Despite claims that their computerized diagnostic decision support (CDDS) program scored higher on a medical exam than doctors and achieved “equivalent accuracy” with human physicians, it missed symptoms of a hypothetical pulmonary embolism during external testing, prompting an MHRA inquiry. Babylon Health subsequently conducted and passed four separate audits of their Quality Management System to ensure compliance with UK and European regulations,  but the episode demonstrated once again the challenge and the high bar to success.

Here is the crux of the matter: People do not trust AI, and certainly not with their lives. Bad press, combined with limited understanding of this relatively new technology leads to sceptical users seeking human opinions. To make AI an effective healthcare tool, people must trust it, and this requires a necessary level of regulation.

Regulating Artificial Intelligence

The Medicines and Medical Devices Act 2021 classifies software “used in the prevention, diagnosis or treatment of illness or disease” as a medical device, and therefore regulated by the MHRA, which is developing the framework for future regulation. In an environment of evolving regulations, it will be vital for companies providing medical AI to stay ahead of the concerns regarding safety and accuracy, rather than risk being blindsided by regulations with which they are unable to comply.  There is also an opportunity for those that do – increasing confidence and adoption.

Safety, Security and Robustness

At Advai, we use Adversarial AI to address AI safety, security and robustness.  

We test the edges of the AI, identify weaknesses and aim to flag questionable results for the system before it learns bad habits.  Metrics for robustness can be particularly useful to understand how easily the system could fail in the wild and help identify issues (allowing the model to be refined).  Ultimately, we aim to give greater confidence in the AI with every iteration.  Advai’s tools are specifically designed for this purpose.

Ultimately, a lack of robustness highlights the importance of quantifying confidence in the outputs. How certain is the model that it diagnosed correctly? 90%? 53%? A doctor faced with a suggested diagnosis with only 53% certainty will likely take a far more cautious, laborious, and costly approach than a 90% certain diagnosis. As in the case of NHS’s attempts to reduce the burden on A&E, certainty in initial diagnoses can help keep the “worried well” at home rather than defaulting to the “better safe than sorry” approach of sending them to A&E. Humans want reassurance, to feel good about their decisions.  Without the additional confidence of quantifying AI uncertainty,  human instinct will default to expert opinion, ignoring the AI.

So what makes a healthcare AI model trustworthy? No single, impartial standard exists to deem a model safe. The threshold required for an AI to be certified robust, accurate, and secure enough to diagnose patients has yet to be defined. While that regulatory framework is being built, it is incumbent upon developers and companies to prioritize the safety of their clients. If this is done well, confidence in the systems and subsequent adoption will increase.  Testing systems ensures this – and we can help with that.

As the great detective once said, “when you have eliminated the impossible, whatever remains, however improbable, must be the truth.” Adversarial AI teaches AI to eliminate the impossible. It’s elementary, my dear Watson.

If you are interested in how Advai’s system may be applied to your AI, get in touch with us.