An astrophysicist will tell you that data about the universe is the biggest there is, and that it takes a supercomputer to crunch all the ones and zeros from vast telescope arrays. But in fact, the biggest dataset is you - or rather, us. At least, that's according to Thorben Seeger, Chief Business Development Officer of Lifebit.
The software company was founded in 2017 to address a massive but hidden problem: most biomedical data is stored in silos around the world, which means most researchers can't access it to save people's lives. But why is this? And what's the solution?
Humans constitute the biggest dataset there is, not because we are stardust, as Joni Mitchell once sang, but because we are all essentially data. According to Seeger, each person's genome, the complete set of genetic information about an individual, can be up to 300 gigabytes in size if stored digitally.
The precise figure is unimportant, but for the sake of argument let's assume Seeger is correct. What that means is, if the genome of every person on the planet - all 7.9 billion of us - were stored in a computer system, it would amount to nearly 2.4 zettabytes (2.4 trillion gigabytes). That's nearly twice the amount of every type of content stored in data centres today, according to 2021 Statista figures.
So, the first problem is biomedical data is potentially huge: you would need a data centre the size of Manhattan if everyone's genome were stored somewhere other than, well, inside them.
Right now, some size-queen of the data pedantry world will respond that every species has a genome, or that there may be 200 billion-trillion stars in the observable universe. OK, but the key point is simple: lots of humans means big data. And not just genomic data, but reams of other information about health, fitness, and more.
That data could help in the design of better drugs, or in earlier and more precise diagnoses, or in personalized medicine. But the second problem is, while analyzing everyone's unique code might bring enormous benefits - leaving aside controversial areas like ‘designer babies' and so on - it can't be done. Or rather, it couldn't until the likes of Lifebit and other data intermediaries came along.
That's because the relatively small amount of genomic data that's currently available is stored in lots of different places. Medical data of every kind is held by countless organizations worldwide - hospitals, universities, research labs, government departments, private companies, and others - hopefully in secure, trusted repositories. Meanwhile, data privacy advocates like Sir Tim Berners-Lee believe it should be stored in citizens' own virtual pods, and never belong to corporations.
All of which brings us to the third problem if you're a scientist trying to cure cancer or prevent the next pandemic. Because people live in nearly 200 different countries, they also exist in a host of different legal systems, and therefore different data protection, transfer, IP, and confidentiality regimes.
Globally this kind of data is growing, but it is so hard to access, or extraordinarily difficult to use, or completely unusable because it's so sensitive. We're living in a world where these datasets have exploded, but regulations like GDPR in Europe, and those pretty much everywhere else, prohibit moving it.
In other words, a researcher might want to examine data from breast cancer sufferers in London and compare it with records at the Texas Medical Center, or in Tokyo or Kigali, but they can't without moving the data. Patient confidentiality, plus data privacy and transfer regulations are just some of the things preventing that.
For example, the Department of Health, large hospitals, or a university might be considered the data custodian or controller under GDPR, and they keep the data where it is in their own environment, they keep it safe. But it is difficult for others to get to it [that's the point], which limits what you can do with it. It's safe, but you haven't achieved the usability that maximizes its research value and patient benefits.
True. But regulations exist for good reason: to protect people, especially from the overreaching power of Big Tech. Yet in other fields, such as economics and environmental science, it is only ingenuity and, perhaps, a low risk-appetite that stands between readily available data and saving the planet, he says.
With medical and life sciences data, however, the problems are even deeper. For example, you or I might not want scientists, corporations, policymakers, or cybercriminals poking around in our genetic code and identifying us from it, then using the findings to deny us services, scam us, or potentially decide that we shouldn't exist.
Your DNA, your medical records, your clinical data, most people would probably rather have their credit cards flying around in the ether than that.
But what if you could safely crunch anonymized datasets remotely, and within the trusted repositories where that data already sits? And deploy AI, deep learning, and analytics from wherever in the world you're based - including, at some point, quantum techniques?
This was the founding vision behind Lifebit. The start-up also guides precision-medicine pioneers, such as Genomics England and the Hong Kong Genome Project, on how to make their data both usable and secure.
Seeger tells me:
This data should be available as a resource to a lot more people than it is today. When you look around the world, there are very large sets of biomedical data, and patient populations with the genome already sequenced - which is phenomenal - plus deep clinical data. But it is kept in silos. The World Health Organisation estimates that 97% of it is not used at all. And that's a real shame.
It is. Indeed, it's a staggering statistic. So, what does Lifebit do?
Put simply, we bring computation and analysis to where the data already sits. We have a patent for the step-in solution, the federated architecture, the federated analysis, meaning you bring the computational power. Researchers bring analytics to the data, while the organization that has generated or stores the data is still safeguarding it.
So, when any data is analyzed via a solution like Lifebit - this type of grey-area, intermediary analytics is booming - it is completely anonymized?
That's correct. Clinical data is obviously specific to a person - nothing is more specific than your genome - so identification-risk is very real. How you mitigate that is by keeping the data in a place where you can control it and have oversight of it.
In short, it remains the custodian's problem, and not the researcher's. Just one of many areas where our legal landscape is being reshaped. (Another one that has been in the news recently is the occupant of a driverless car bearing no responsibility if it mows down a pedestrian.)
Lifebit is like a reading library as opposed to a lending library. You can come in, into the trusted research environment, then look at the data and query it.
Taking a different tack, does he believe that policymakers really understand these issues? Recent noises from Whitehall suggest that the UK wants to tear up regulations that protect citizens from invasive business, to encourage competition and economic growth - despite the obvious risk to data adequacy with the EU, for example.
My expectation, my hope, my recommendation, would be for the UK to stay reasonably close to the EU's GDPR [currently in UK law as the Data Protection Act 2018]. It has been a reality for a long time, and institutions from the financial sector, telecoms, healthcare, and more, have all changed in that regard. I do think that the government should be championing citizens' privacy protection going forward.
"I also think that policymakers would benefit from liaising more closely with industry, especially with homegrown technology vendors that are intimately familiar with these issues, so that decisions are not made in a vacuum.
Meanwhile, how big a role does AI play in what Lifebit does - given its ‘.ai' domain? It has critical applications, answers Seeger, including early detection of infectious diseases, and early discovery of when a disease jumps from one species to another, or spreads from an isolated region.
But he adds that size remains a problem:
Not even an army could read all the scientific papers out there continuously and absorb them. So deep learning that can understand the context of these papers via natural language processing - understanding the context with deep learning - is important. And neural networks can look at massive scale, at hundreds of millions of data points.
Remember, if you have 300 gigabytes of raw genomic data just for one patient, then every variation could matter - in your predictions for diseases, in which drugs a patient reacts to, and so on. These are all areas where, again, there's an overwhelming amount of information, but AI and deep learning can help.
These are not hypothetical issues. Via Genomics England, since the pandemic began Lifebit has been involved with analysing the genomes of tens of thousands of people who have had severe COVID, and people who only experienced mild illness from the virus. Last year, the technology helped uncover five biomarkers linked with severe symptoms.
A fascinating and worthy area of innovation, albeit one that raises tough questions for legislators. But the key lesson is this: GDPR and the like are fostering innovation, not preventing it.