Francis Crick Institute builds its own ‘Trusted Research Environment’ toolkit with Snowflake
Europe’s largest biomedical research center, Francis Crick Institute, says use of Snowflake can turn a two-year build into 30 minutes
The UK’s Francis Crick Institute says that the use of Snowflake Data Cloud is helping it build secure IT infrastructure needed to support international, multi-participant scientific research initiatives and it has slashed the build time from up to two years to approximately 30 minutes.
As a result, says the organization’s Chief Information Officer, James Fleming, these rapidly built ‘TREs’ (trusted research environments) are already supporting ground-breaking research in four collaborative projects - on everything from the effects of long COVID on rare cancers, to a new test for diagnosing Parkinson's.
Operational since 2016 and based near St Pancras International Station in London in a distinctive building that hosts over 4 kilometers of lab bench, the Institute - ‘the Crick’ to staffers - is Europe’s biggest single-site biomedical research facility.
With a mission to better understand why disease develops and to find new ways to treat, diagnose, and prevent illnesses such as cancer, heart disease, stroke, infections and neurodegenerative diseases, a key process for the organization is to connect networks of researchers with useful data.
Pre-COVID, that work tended to be less directly about working with patient data and more about core science, says Fleming.
However, during the global health crisis, Crick scientists developed new links with both UK and international clinical organizations. He says:
Lots of new science-to-science peer networks emerged because of that, and it's something we really wanted to capitalize on in our research going forward. Rather than doing discovery research, which typically doesn't involve in terms of patient data and tissue, we suddenly found ourselves much more heavily integrated into clinical environments that involve that end of medicine.
But since so many of its partners in academia and industry operate in different data governance regimes, such as GDPR in Europe and HIPAA in the United States, ensuring patient data is always appropriately protected can be a challenge, says Fleming.
That means that creating a secure computing environment that can hold all the data you want and enables access to it for analysis is a complex task.
For one, you must ensure that data is only ever used for agreed purposes, that access to the data always needs to be logged and audited, and no information that might identify an individual can ever leak out.
For a major research hub like the Crick, creating a TRE for just one program might involve linking potentially hundreds of groups across the world - each of which need to be able to have absolute traceability of their data to prove that data access is 100% consistent with what’s been approved by their local relevant ethics committees, and that all work is only ever conducted in line with identifiable patient consent.
TREs, then, are not exactly commodity IT. He says:
There are certain system integrators who will go out and build a TRE for you, and I do know of several places that have started multi-year projects to build the physical infrastructure for it. And they're not finished yet.
For sure, ‘the Crick’ does have significant data handling and cloud capabilities, including an 18 Petabyte on-premise Data Lake and high performance computing resources. He says:
We're very well served in terms of our existing technology stack in terms of bulk processing and working with data, but what we needed specifically was the ability to cordon off a set of resources outside of our traditional security perimeter and be able to demonstrate and audit in a very compliant way exactly who has access to what data, at what point in time and what they're doing with it.
A cloud-based solution
In February 2021, Fleming started looking at other options that might give him what he wanted sooner than waiting for a third party to build him one - and instead, create a TRE-builder for long-term use.
On-premise was not feasible, as a UK-based data center would not be HIPAA compliant.
That meant cloud, he says - but creating an in-house TRE wasn’t attractive either. He explains:
Doing it that way would mean an incredible amount of engineering: we'd have to build all of the necessary logical controls around deployment of infrastructure, sizing of infrastructure - basically, we’d have had to build the whole damn thing ourselves as a layer on top of our chosen cloud.
Fleming therefore went to market to find something that could instead do all the work for him using metadata - allowing him and his team to quickly model a virtual consortium in metadata, pass it onto a platform, and have a tool create it for them. He adds:
We worked out we could create a TRE if we could properly define, Who are the people? What are their roles? What is their login? What are they going to perform, what data is going to be loaded, et cetera? If everyone's working in the U.S., it needs to be deployed on the Eastern Seaboard and needs to have the HIPAA compliance business-critical wrapper around it, and so on.
Modelling your TRE as metadata like this gives you a series of decision trees that then determine the exact specification of the environment that you need. At that point, you've got a create a JSON file that can be outputted to create nicely designed, logically partitioned trusted research environment that still allows you to have all the processing power and storage capabilities native to the underlying cloud, but with all the controls you need to put around it.
Fleming says Snowflake allows him to do all this, with what he calls its ‘logical control layer’ of particular help.
Four projects have now got working TREs, including the COVID and Parkinson’s projects already mentioned, plus two others: one with multiple institutions across Africa looking at vaccination rates for Hepatitis B, and a fourth looking at various neurology aspects linked to schizophrenia. He says:
We have researchers working on just about any disease area of human biology you can think of, so we can now work with them to understand the parameters of any new consortium they want to work with.
Getting all the ethics and consent you need to do that still takes six to nine months - but once that’s done, you can literally tap the information in as a series of web forms and then build a TRE 30 minutes later that allows them to get on with the science.
For Fleming, use of this kind of cloud-based data storage and analytics means:
We can do novel science into human disease with sensitive data at a speed and cost-effectiveness that just wouldn’t be possible otherwise.