COVID-19 pandemic models - are Machine Learning models useful?

Profile picture for user Neil Raden By Neil Raden April 10, 2020
Applying Machine Learning to Coronavirus data is tempting - but deeply problematic. DataRobot shared lessons on working with smaller data sets, but the predictive limitations of ML for assessing pandemics go much further.


When Richard Nixon appointed Henry Kissinger National Security Advisor, some of Kissinger's friends from Columbia University in New York, threw a little going away party for him. Guests included both his academic colleagues and his new political associates. At one point, some academics were off to the side, having a vigorous, loud argument.

A reporter asked Kissinger why academic arguments were so emotional. Kissinger, in his characteristic rumble, so basso that it automatically sounds­profundo, replied, "Because the stakes are so low." 

In Deep Learning for Physical Processes: Incorporating Prior Scientific Knowledge, Emmanuel de Bezenac adds that AI is most widely deployed (other than defense and intelligence, where it is not possible to gauge its breadth) in targeted selling. The reason is (as James Taylor and I pointed out in 2007), "little decisions add up." In sales, the price of being wrong is almost zero, and the value of being right is so high, that AI can be wrong a lot of the time and still do well.

This is precisely why I would not put any faith in the current AI/ML technique for forecasting the extent of the COVID-19 pandemic. Finding truth in data is too ragged at this point, because the data itself is such a poor proxy for the underlying phenomena of something so emergent and unprecedented (at least since we started using computers to model things). 

So let's stick to recommendation engines, supply chain optimization, and jet engine digital twins. We don't understand the biology of the SARS-CoV-2 virus. Frankly, we don't understand the intricate working of the immune system, only the action of parts of it in isolation, and the regulatory environment makes it impossible to act quickly (and affordably) with any breakthrough to mitigate the damage. 

How can you possibly put any faith in a model based on a statistical analysis of data? The problem is especially acute in the current crisis with COVID-19. I mentioned this in a December diginomica article about a paper in Nature, The Inconvenient Truth about AI in Healthcare, which describes the situation for AI in clinical medicine:

In the 21st Century, the age of big data and artificial intelligence (AI), each healthcare organization has built its own data infrastructure to support its individual needs, typically involving on-premises computing and storage.

and the obstacle:

Data is balkanized along organizational boundaries, severely constraining the ability to provide services to patients across a care continuum within one organization or across organizations. This situation evolved as individual organizations had to buy and maintain the costly hardware and software required for healthcare, and has been reinforced by vendor lock-in, most notably in electronic medical records (EMRs).

As we all know, Machine Learning is best served with mountains of data, because its purpose is to find patterns in the data that are predictive. The availability of large datasets today plays right into this appetite. Still, the question is, can ML and AI techniques be effective with much smaller sets of data, such as those in clinical silos mentioned above?

So what's the solution? Data Robot presented a training session available online, "Using Small Datasets to Build Models." The suggestion was that ML is a viable approach with small data sets (I found this amusing - in my model-building days, we only had small datasets) and made several suggestions:

  • Take extra care with outliers (obviously, with fewer observations, outliers have a more significant impact)
  • Years of training data are not useful in the present situation
  • Search for meaningful losses signals
  • Find real predictors, not chance
  • Overfitting is a big problem
  • Check for out-of-sample error
  • Use kerning curves to refine your model

This is all good advice, but it doesn't address the heart of the matter. With something moving as fast as COVID-19, especially now with rapidly accelerating case, scattered data, and no real way of knowing the infection rate without universal testing, what credence can you place in the data you have?

My opinion? Leave COVID-19 foresting to the experts, the epidemiologists, and public health quants. Let me give you an example of how bizarre this has become.

V. A. Shiva Ayyadurai, a.k.a. Dr. Shiva, wrote a letter to Donald Trump, demanding the firing of Dr. Fauci and proposing his solution to the coronoid virus pandemic. Incidentally, he is running for the Republican nomination to the Senate from Massachusetts, for the third time. While he possesses four degrees from MIT, he is not, as far as my research can tell, associated with any academic or research institution. I would say that, in the pantheon of quackery and self-promotion, this is a doozy.

He criticizes "one size fits all" medicine, which I agree with, but goes on to propose four-sizes-fit-all which, qualitatively, is no different. He claims "modern engineering systems approach to biology versus the old model of seeing the body as disconnected part" then he spends the rest of the letter focused on one part, the immune system, as if it is a "disconnected part." In a holistic view, there is no immune system. It's a series of functions of a greater whole. At this point, he's lost me. 

Then he goes on to promote "the need for personalized medicine," and proposes a protocol based on grouping people into five sets. This is diametrically opposed to any notion of personalized medicine. 

He provides no research or even basic science. What is the mechanism of action of the cytokine storm? The cytokine storm is the overreaction of the immune system that destroys the lungs of the COVID-19 sufferers. 

Ayyadurai called on Trump to categorize citizens into four groups:

  • Those who test positive for COVID-19, quarantined, 400,000 IU of vitamin A palmitate per day for two days and 50,000 IU of vitamin D per day for two days;
  • Those hospitalized in critical condition, same treatments, and a 100-gram drip of vitamin C per day;
  • Those immuno-compromised. Children, same levels of vitamin A palmitate and vitamin D as mentioned above, and 500mg of vitamin C per day and three drops per drink of iodine/iodide once per day. Adults should receive double the dose of vitamin C and iodine/iodide as children;
  • Those not in the above three groups, with children receiving 1,000 IU of vitamin A palmitate per day and 2,000 IU of vitamin D per day, along with 250 milligrams of vitamin C and three drops per drink, once a day, of iodine/iodide. He advises that adults in this group consume 10,000 IU of vitamin A palmitate and 5,000 IU of vitamin D per day, along with 1,000 milligrams of vitamin C per day and six drops of iodine/iodide per day.

There is no peer-reviewed science for any of this. He claims it's based on his "models." 

His regimen of vitamins does not describe the nature of "vitamins" A, C, and D. It's a comfortable name, vitamin, but in reality, they are all hormones and neurotransmitters. And one of the million things you should know about hormones is that they control gene expression, and they can only operate if they have a receptor. Receptors can be absent, or even more likely, occupied by other hormones or even supplements. Certain conditions in your body cause the pituitary, for example, to send a message for the receptors to wake up. Nothing is static. He describes no endocrine function that activates these vitamins or how they operate. Before suggesting therapeutic doses of vitamins (hormones), his protocol should be very specific about drug iterations and even timing.

My take

Primarily, the cause of death or permanent disability from COVID-19 is the cytokine storm. Cytokines are proteins released by the immune cells, but it is not clear why they overreact. China reports progress is slowing the deadly response with Interleukin-6 (IL-6), a drug already available, though not approved for this virus yet. If you want "models" for this pandemic, I'd suggest you stick to those who have some insight. 

There are other approaches to prediction. Bayesian models start with your "belief" in an assertion, stated as a probability (priors), and as the model grows, the weights or probabilities adjust. Bayesian methods are more about modeling and simulation, not prediction per se. What I like about them is that they start with what you know and work from there instead of allowing the data to speak for itself.  

DataRobot's training webinars are quite good, and the one cited on small data sets is good advice, but not for epidemiology. Machine Learning is, at its essence, about predicting the future from data about the past, data whose provenance and reliability are always in question, but given a choice between a small data set ML model, and some speculative nonsense from the likes of Dr. Shiva, I'd take the former.