The O’reilly Radar has recently come out with an excellent article about big data and healthcare. The central point of the article is, simply, that we would benefit from a vastly improved health care system if we leverage big data and big data analytics developed over the last decade by large IT companies like Google.
I applaud the Radar for writing a comprehensive article that looks at many aspects of our health care system today, and the many ways that big data can help. However, the scientist in me wishes to point out a number of major inaccuracies in this article, dangerous, as most inaccuracies are, not for the specifics they get wrong, but for the larger, erroneous picture they paint.
First of all, the authors are simply wrong to say “Eventually, we’ll be able to treat 100% of the patients 100% of the time, precisely because we realize that each patient presents a unique problem.” Each patient does present a unique problem, but we do not necessarily know its solution. The compatibility between patient and treatment is determined by an immense number of variables, everything from their gene sequence to what they had for lunch today. At best, we can make probabilistic statements that treatment T will work on patient N with probability P, where P is always < 100%. I will agree that P should go up over time as we learn to profile patients more accurately, but one of the central maxims of machine learning, which the authors invoke several times in the article, is (roughly) that it is impossible to predict a system's behavior with complete certainty without describing it in its entirety. For as long as we cannot do the latter (impossible not only computationally, but also for a host of social and psychological reasons), we cannot hope to do the former.
Secondly, the authors' statement that "with enough data, we can get from correlation to causation" is a manifestation of a dangerous misunderstanding of big data. Data can never explain causation without theory. We can have excellent data about the relationship between a particular set of inputs (genotype, phenotype, environment) and a particular set of outcomes (longevity, treatment effectiveness), but that relationship always remains at the level of correlation or variance-explanation, not causation. For example, let's say we found out that all people over 5'7" benefit from a certain cancer treatment, while all those with a height of 5'7" or less do not benefit from it. An excellent statistical relationship – but it doesn't answer the fundamental question: what is it about height that is conducive to treatment effectiveness in this context? Without theory, we cannot answer that question and risk making erroneous conclusions (often based on incomplete or corrupted data). For example, it might turn out that in our dataset, height correlates perfectly with some gene that we forgot to include in our model – but that in the wider population, the correlation is far below 1.0. We release the treatment to tall people, and find out that it's only effective in 80% of patients. With theory, we can look for more likely explanations of the statistical relationship, and build models that actually explain the underlying cause of effective treatment.
Thirdly, and perhaps most importantly to me as a sociologist, the authors mention the word "privacy" only once in this article, all the while talking about breaking down silos and combining records. Privacy is not just about preventing scandals or avoiding horrible worst-case scenarios like misuse of information. It is about respecting all parties involved, and is a necessary human component of any good health care system. There is again a wide misconception in the world of big data analytics that privacy is just about satisfying some abstract set of requirements, a set of cryptographic algorithms and best practices that ensure The Bad People don't get access to some subset of data. Privacy is far more than that – it is about treating the patient, the doctor, the insurance agent as people with rights and agencies, not as machines or variables. My colleague Stephen Purpura and his coauthors wrote a brilliant satire of the way we easily forget about privacy in the name of abstraction when designing precisely the kinds of systems O’Reilly et al. discuss. System designers so often blissfully assume that “patients are willing” to endure living in a nightmarish big-brother-like system in the name of a 5% increase in treatment effectiveness, all the while forgetting to ask the patients themselves.
To give an illustration of the kind of world Tim O’Reilly and the other article writers push for, I would like to borrow a thought exercise from another colleague of mine, Marc Smith: imagine you’re at a cocktail party. Somebody offers you a glass of wine. You’re about to pick it up, when your phone buzzes.
“Dear patient,” it informs you in a dry text message, “the optical sensor on your glasses has just relayed information that you are 95% likely to drink another glass of wine tonight. This will be your third glass of wine this evening. We predict that the lasting damage to your liver will decrease life expectancy by 1.2 years. If you do drink this wine, we will be forced to notify your insurance company, which will raise your monthly payment by $33.56 to reflect the long-term cost of treatment for your cirrhosis five years down the road.”
Is it a more efficient world, with fewer deaths and sickness? Absolutely. Is it terrifying? I think so. To close, while I again applaud Tim O’Reilly and his colleagues for writing their piece, I urge the writers (and their readers) to consider the implications of a big-data vision for health care. Without a careful and humanist approach to the overall system of patients, physicians, and providers we risk to trade cost effectiveness and quality of care for the human element of Do No Harm.