Can you really trust the medical apps on your phone?
Posted on 2/10/2017 by
We tested the symptom checkers of Ada, Babylon and Your.MD to find out how reliable they really are
It's never been easier to get medical advice. A simple Google search will come back with a series of symptoms and ailments. The doctor has moved into your pocket and there's an increasing number of apps that promise to give an accurate diagnosis.
"There are different types of diagnostic tools out there," says David Wong, a lecturer in health informatics at the Leeds Institute of Health Informatics. Wong, along with colleague Hamish Fraser help me test three of the most popular artificial intelligence-powered symptom checker apps in the UK: Ada, YourMD, and Babylon.
"Ada was by far the best," Wong says. "There were issues with both of the others. It was surprising to be able to find things wrong in a few minutes, from a non-clinical perspective." The pair say there needs to be stronger governance around these sort of apps. "The great concern is that somebody puts information in and they have a serious illness and they get reassurance they're ok and that's a false negative situation, which could be life threatening," Fraser says. He adds that in general there are some "reasonably good" symptom checkers available.
In the UK, the Medicines and Healthcare products Regulatory Agency (MHRA) classifies apps that make medical diagnosis as medical devices. This means they have to be certified by the agency. But the MHRA's guidance says mobile apps that make "general recommendations to seek further advice" aren't likely to be medical devices and, as such, don't need to meet set standards. "If you're giving a piece of advice that is life or death it has to be right. If it is wrong, who is responsible for it?" Wong says.
Back in 2015, the British Medical Journal published a study evaluating symptom checkers used for self-diagnosis. The research looked at 23 English-language symptom checkers and determined whether they listed the correct diagnosis first or within up to 20 possible options. For the symptom checkers that then offered a care recommendation, the academics also looked at the type of healthcare recommended.
The work concluded that 34 per cent of the time the checkers managed to make the correct diagnosis. Within the top 20 diagnosis given, they were correct 58 per cent of the time. Care advice was correct almost two-thirds of the times (ranging from 33 per cent to 78 per cent). The study didn't compare how human doctors would have made a recommendation or diagnosis if they were presented with the same information as the checkers.
"For patients, our results imply that in many cases symptom checkers can give the user a sense of possible diagnoses but also provide a note of caution, as the tools are frequently wrong and the triage advice overly cautious," the study concluded. It also explained that a symptom checker "may" be of more value than not seeking any medical advice "or simply using an internet search engine".
But getting good advice remains confusing. The NHS's app directory is muddled, though it is still in beta. A number of apps on the website are "NHS Approved" meaning they have "clinical evidence that it supports clinical outcomes". Others are marked as being tested by the NHS, while some have no label or recommendation at all. Neither YourMD or Ada are listed on the NHS website, Babylon is but isn't specifically recommended.
Wong and Fraser go further in their evaluation of the symptom checker industry. "The gold standard for apps needs to change on what the use case is," Wong says. He argues that if a medical app is aimed at doctors and experts then it can afford to not perform as well as it should be informing their decisions rather than being a completely authoritative source.
"With apps in general in the UK, there are issues with how these kind of things are legislated," he says "Mainly because the technologies are moving faster than the regulatory bodies can catch-up with".
RATING THE AI DOCTORS
- How we tested
- WIRED's ad-hoc test involved Ada, YourMD (beta) and Babylon. The three apps all offer some form of symptom checking and four common ailments were tested. Symptoms from the four illnesses – asthma, shingles, alcohol-related liver disease, and urinary tract infection (UTI) – were taken from the NHS Choices website and entered on Android and iOS versions of the app. Both researchers tested the apps separately. Here's what happened.
Ada was selected as the winner of WIRED's test. Fraser says that it asks about the most important symptoms, and provides the best diagnoses. "The questions are clear and it translates free text into sensible suggestions for the user to choose," he says.
For asthma it asked about the duration of symptoms although didn't ask about history of diagnosis or potential triggers. Across UTIs he says there were "lots of good question". Overall Faser says: "Best diagnoses of the three, includes good diagrams showing which of the symptoms for each disease were present and the strength of the link, and a diagram of number of people out of ten or 100 likely to have that diagnosis."
"Right form the beginning we started building something that has a deep medical knowledge and covering rare conditions as well as common straight forward conditions," says founder Claire Novorol. She explains that Ada started out as a platform service for doctors and was then altered to focus on the bits patients could understand. "We're still a platform though, in the UK at the end of the assessment you can share that with a doctor and they have all the doctor facing side of technology on their side of the platform".
Wong says it wasn't possible to get Your MD to diagnose shingles. "It did ask about whether I had previously had chickenpox (caused by the same virus) and diagnosed the related meningitis," he says, but it did provide the correct advice to see a GP. It correctly got to asthma and alcohol-related liver disease but failed with the UTIs, suggesting other sexual transmitted conditions instead.
"However, because the app did not ask about patient sex, it presented diagnoses that were impossible for me," Wong says.
In response to this Matteo Berlucchi, the CEO of Your.MD, which has its symptom checker in a beta stage, explains the issue has been identified and fixed. "Your.MD presents conditions based on their likelihood," Berlucchi says. "If our medical brain calculates that there are more potential outcomes with similar likelihood, Your.MD does present more than one condition". Since WIRED tested the app it has been upgraded to version three.
One of Babylon's strengths is its ability to parse common synonyms around UTIs, Wong says. This means it was able to understand that "pee, wee, etc" all referred to the same thing. However, it did come bottom of the ad-hoc test. Wong and Fraser say it provided a course of action for shingles but didn't ask about more of the "pertinent" symptoms. Wong says it was the least accurate of the three apps but points out the symptom checker is an add-on to its larger business model.
Babylon said it didn't not want to comment on any comparisons with other services. Although, it did respond to a critical letter in the British Medical Journal by claiming its AI symptom checker had "reduced same day GP consultations by 40 per cent by appropriately providing alternative care".