Machine Learning (ML), as a focus in commercial applications, has hit a wall. Successful commercial application of ML is hampered by the difficulty sourcing adequate, clean data for the models. Machine Learning needs significantly more data for training the models than previous quantitative disciplines.
Too small or too dirty datasets, as well as datasets that do not represent the population under consideration, can yield biased results, inappropriate conclusions, and host of other problematic results.
Exciting innovations are happening in research facilities for AI and ML, but very few of them are operating in production because of the data problem. While this issue appears across the board in every industry, nowhere else is this problem as severe as it is in the healthcare industry.
What’s the problem with healthcare?
Healthcare is defined by Investopedia as “…businesses that provide medical services, manufacture medical equipment or drugs, provide medical insurance, or otherwise facilitate the provision of healthcare to patients.”
It's that last word, "patients" that is problematic. Pharmaceutical/Biotech have their own data problems, but they are mostly in control of the data sources. The same is true of insurance companies and medical equipment manufacturers. But when you get down to the patient level, and even the components of patient care, the data is everywhere, it's balkanized.
A single clinical operation, to the extent it has analytical data, has treatment protocols, population demographics, and other variables that must be part of AI training data for personalized medicine. It cannot be merged and integrated or aggregated with enough other operations to reach the needed volume for machine learning without losing its local character.
Can AI provide opportunities in clinical care to yield better diagnosis? Can it offer a potential leap in both patient care and delivery efficiency? Can it lead to the “precision medicine” approach, customizing treatments for individuals to dramatically improve outcomes, data is hindering the process?
A paper in Nature, The Inconvenient Truth about AI in Healthcare, describes the situation for AI in clinical medicine:
In the 21st Century, the age of big data and artificial intelligence (AI), each healthcare organization has built its own data infrastructure to support its individual needs, typically involving on-premises computing and storage.
and the obstacle:
Data is balkanized along organizational boundaries, severely constraining the ability to provide services to patients across a care continuum within one organization or across organizations. This situation evolved as individual organizations had to buy and maintain the costly hardware and software required for healthcare, and has been reinforced by vendor lock-in, most notably in electronic medical records (EMRs).
Why the adoption of new AI algorithms is slow to catch on in clinical healthcare is, as the authors stated, an issue of data, but there are other factors as well. It's the old culture walnut. The AI offerings cannot address existing incentives that support existing ways of working. AI models are not that smart. They provide reliable inferencing, but they cannot ensure people will adopt them. Besides, most healthcare organizations lack the data infrastructure required to collect the data needed to optimally train algorithms to “fit” the local population and to interrogate them for bias
Clinical practices can avail themselves of novel AI models, but only those that are developed elsewhere, where adequate data is available for training the models. For example, a well-trained pathology model that can recognize malignant skin lesions from images with high accuracy can be used anywhere. But to practice personalized medicine, a model has to be aware of local differences: in the population itself, in the provenance and semantics of the data and practice differences between locations, and even practitioners within a situation, that bleed into how the data was captured.
Within a practice or a hospital or even a small group of hospitals, the most detailed and most valuable store of data is in EMRs. To date, providers of EMR software have not been able to raise clinician satisfaction, which remains at a low point.
As a result, completeness and availability of data lack the quality and governance that other enterprise applications possess. Most difficult of all, interoperability between different EMR providers is low, and even data extraction is challenging.
Where is there hope? The article in Nature cited above mentions “islands of aggregated healthcare," such as data in the ICU, and in the Veterans Administration. Useful efforts, but not sufficient. What is needed is a data infrastructure far beyond these “silos” of data. The authors of the article cited above suggest:
To realize this vision and to realize the potential of AI across health systems, more fundamental issues have to be addressed: who owns health data, who is responsible for it, and who can use it? Cloud computing alone will not answer these questions—public discourse and policy intervention will be needed. The specific path forward will depend on the degree of a social compact around healthcare itself as a public good, the tolerance to public-private partnership, and crucially, the public's trust in both governments and the private sector to treat their healthcare data with due care and attention in the face of both commercial and political perverse incentives.
If you are an IT manager in a clinical healthcare operation, you have to ask yourself the following questions:
- What is the state of data available within our purview?
- Is it adequate for fueling AI models?
- Do we have the infrastructure and/or cloud expertise to host AI modeling?
- Who is responsible for assuring the output of the models is correct?
- What ethical issues do we face sharing patient and activity data with others?
The enthusiasm for AI to solve previously unsolvable problems is in opposition to the limited data in a clinical setting. To provide precision/personalized medicine, models cannot be trained with data from other sites that are not a match for local conditions. This is the conundrum.