An object lesson in addressing Rumsfeld’s Third Question – the real reason for analytics

Profile picture for user mbanks By Martin Banks January 23, 2019
Using complex analytics on huge columns of clinical data has led Medidata to develop new techniques in disease diagnosis, with  valuable learnings for other businesses in how to exploit Big Data.

Abstract DNA and technology © blackdogvfx -

When it comes to using analytics, AI and Machine Learning, the ultimate application is to answer the Rumsfeld III Question about unknown unknowns – in business terms, what are the unknowns that we haven’t realised yet that we should know right now. It is the most difficult question in practice, not least because it appears to be unanswerable.

That is the key justification for Big Data analytics – especially where `big’ starts at gargantuan. Many business managers, I suspect, see analytics as a means of justifying their current decisions: in Rumsfeld terms, the `known knowns’ application. A goodly percentage of the daily application of analytics will be in Rumsfeld II territory – the known unknowns, the current total available market for product X in both Y and Z regions, for example. You don’t the answer, yet, but you know it is an answer that is gettable, and that you need to get.

But when you don’t know anything about a new technology or market opportunity, for good or ill, or even if they exist yet, that is when important business opportunity boats can be missed. That is when the ability to make sense of Rumsfeld III and, ideally, turn it into a Rumsfeld I can prove to be a vital business advantage.

It is arguable that this is nowhere more important than in the healthcare business. Here, there are still many diseases and disorders which remain undiagnosed despite the numbers of people they afflict. One of the troubles here is that there are almost as many reasons why such diseases remain undiagnosed as there are the diseases. These can range from genuine difficulty in understanding the details of the biosciences involved – the 'we can’t work out how to spot it until it is far too late', situation, through to the sincere belief that patient are, in effect, not sick.

But one company at least, Medidata, is set on taking diagnostically-oriented AI and analytics to new lengths, and has already made some startling discoveries that have led to the development of new diagnostic tools and approaches. Perhaps more important, however, is that while it would be easy for the rest of us to park this story in the 'important-but-very-niche' folder, there are some learnings here that potentially apply across the analytics domain as a whole.

The medical results can be quite telling, however. For example, using its Rave Omics tool, which is designed to enable the identification of actionable hypotheses for ongoing studies, operating within its Rave Data Capture environment, discovered an example of a Rumsfeld 'unknown unknown' – a biomarker for the rare and usually fatal Castleman’s Disease.

Not only that, but it also discovered that what had been assumed to be a single disease is in fact six variants that affect different patients in slightly different ways with the standard current treatment affecting the groups in different ways. As Medidata CEO, Tarek Sherif, observes, it is now possible to see much more deeply into a subject:

Now you can into this rare disease you could actually start to do research to say, 'OKm something is working for this group, why is that? Are there other things that we can do for the other five sub groups?'. It is a shining example of how using old world techniques combine with the most sophisticated data science techniques to kind of create that new generation of information.

Medidata was founded in 1999 by Sherif, the business man, and Glen DeVries, a lab technician. It has about 2500 employees globally, services over 1100 customers in the pharmaceutical industry and biotech device manufacturing sectors, as well as some academic organisations and non-profit operations such as Cancer Research UK.

Analytics as connective tissue

The goal was to develop technology that had a positive impact on patients’ lives: to become, as Sherif calls it, connective tissue infrastructure that allows data to travel from the clinics, from the doctors who are involved in the clinical trials of pharmaceuticals and allow them to manage that data very effectively and efficiently. And therefore, to reduce the time that it takes to bring it to drug to market.

But as technology has evolved so is the company’s scope for developing new sources of data. For example, patients are using wearable sensors, to which Medidata con directly connect. It can collect that data and automatically combine it with any relevant data when a clinical trial is being run. This direct patient input can generate a whole new range of clinical insights.

Use of cloud and new communications tools also expands the reach of drug trials, making them global as a norm, he adds:

Global trials are very important, because in today's regulatory environment, pretty much most countries or regions require that you when you test a new drug that you test on people in every area as well. It is no longer good enough to do a trial in America, and then try to sell a drug into Japan, the Soviet Union or Europe. You have to you have to have patients enrolled in your trial everywhere in the world that you want to sell.

The company now works with over 100,000 physicians around the world, each of which can find themselves asked by a pharma company to enrol up to 10 patients for any particular. It is Medidata’s job to then aggregate all this information, including the effects of the drug on patients, the whole drug regimen, and all the specifics about the patient and their test results.

Managing all this data for much of the pharma industry led the company to consider new ways of addressing the task of unearthing Rumsfeld III questions. It asked for, and got, the rights to pharma industry drug trial data on a secondary anonymised basis. With this it has built the largest repository of clinical data in the world, one that looks across the world and across every therapy, and every major pharma company.

It was also decided that this would be a great place to apply some of the new data science techniques such as new algorithms, machine learning and AI, with the aim of figuring out something that no one has ever figured out before such as why are some people more likely to get a certain disease? is one disease actually five diseases rather than a single disease? are five kinds of cancer really just one kind of cancer?

One of the upshots of this work is that an increasing amount of the research that used to be on animals is now done on computers, says Sherif:

We're starting to scratch the surface on some interesting things that you can do with data that have not been done before and that are both creating real value for our customers and having a direct impact on patients.

One of those impacts is that drug trails on animals could very easily end entirely in the near future and indeed end on humans as well. The company has accumulated data from some 16,000 trials, involving some five million patients. According to Sherif this data is detailed, deep, and spans a long period of time.

Hello, I’m a virtual patient

Using it, the company has developed the ability to synthesise virtual patients that can be used as part of future drug trials. Sherif explains:

The reason this becomes important is that, when you run a clinical trial, you typically have it blinded and randomised doctors doesn't know what the patient is getting, whether they're getting a placebo, sugar pill, or the real medicine, or how much of the medicine they're getting so there is no influence on the statistics. But there's a problem with human trials. Some portion of patients are getting a placebo, they're getting not getting a drug. That's terrible, especially in paediatrics.

Using this approach pharma companies no longer need to use placebos in trials. They administer the drug to everybody with the placebo effect derived from historical data. This means that fewer patients are required, trial costs go down, and trials go faster and much more ethically.

This has also prompted the company to look at ways of taking these ideas further. It now has a data science research team drawn from a range of other industries as well as pharma, that spends most of its time thinking about what else the company can do to build new and better scientific hypotheses:

This is early days, we're just getting started. Even our customers recognise that, but we also see it ourselves. We're just starting, you know, it takes a long time, you can't just decide to do this. It took us years to take all that data and curated and created in a way that you can actually make sense of it.

My take

There are some ideas here – not least being the concept of the `virtual patient’ – that other companies in other industries could profitably pick up and run with when it comes to using analytics. And the company’s progress demonstrates just what might be possible for any company that does more than use analytics to help CEOs prove they are right. They are one of the first in my (admittedly limited) experience that has really set out to address the `Rumsfeld’s Third Question’ conundrum, and from my (equally limited) knowledge of medical data analytics technologies and methods, it does seem as though the effort has been well worth it already.