Open source software gets scientific data to shine at Diamond Light Source

Profile picture for user jtwentyman By Jessica Twentyman August 13, 2017
Summary:
The UK’s national synchrotron uses some pretty esoteric tools to get work done, but there may be room for cloud in future.

Campus aerial view
In a paper published in the EBioMedicine scientific journal this month, a group of scientists have revealed how leukaemia cells prevent themselves from being attacked by the human immune system. This discovery, they believe, could be an important step in the fight to develop new types of drugs for patients with acute myeloid leukaemia (AML), a kind of blood cancer that can often be fatal because of the shortcomings of current treatment strategies.

That the scientists were able to uncover exactly how AML cells evade attack by the immune cells that patrol our bodies owes much to work they conducted with the help of Diamond Light Source, the UK’s national synchrotron.

Government-funded through the UK’s Science and Technology Facilities Council (STFC) and also biomedical research charity the Wellcome Trust, the synchrotron at Diamond Light Source works as a giant microscope, harnessing the power of electrons to produce bright light that scientists can use to study anything from fossils and jet engine components to vaccines, viruses and historical works of art.

Access to the synchrotron is free at the point of access to researchers from both academia and industry, through a competitive application process. All results, meanwhile, must be placed in the public domain.

Open source approach

It’s hard to imagine equipment more complex and specialized than the synchrotron, which is 10,000 times more powerful than a traditional microscope – so it stands to reason that the technology infrastructure that enables scientists to conduct their experiments and make sense of results is highly complex and specialized, too.

The vast majority of it is open source, explains Andrew Richards, head of scientific computing at Diamond Light Source, which is based at the Harwell Science and Innovation Campus in Oxfordshire. In some cases, it comes in the form of enterprise distributions of open source technology, most notably RedHat Enterprise Linux (RHEL):

We go down that route because what we need is support. This is mission-critical infrastructure that our whole organization depends on so we need the confidence that we can get support when we need it. But some of what we do, we’re using open source software direct from the community because our work is so specialized that there aren’t really any commercial alternatives. The very bespoke nature of these software tools has developed around the needs and demands of the scientific user community.

A good example of this is EPICS (Experimental Physics and Industrial Control System), says Mark Basham, senior software scientist at Diamond. This is the result of an international open-source collaboration, primarily focused on automating the operations and controlling the movements of heavyweight scientific equipment such as telescopes and various types of particle accelerator, including synchrotrons. Diamond is one of the larger EPICS installations in the world and an active contributor to the software.

But since your average paleontologist, for example, doesn’t want to deal with the command lines of a control system like EPICS, Diamond’s data acquisition group has created an additional layer of software that sits between the end-user and the controls. This layer enables that end-user to sit down in front of screen, press a few buttons to have the system conduct an experiment on their behalf and show them the results. Here, Diamond uses OpenGDA, an open source framework for creating customized data acquisition software for science facilities. It’s based on the Generic Data Acquisition (GDA) software developed at Diamond Light Source itself.

Finally, end users also need to be able to process and and visualize their data, and here, Diamond provides them with Dawn, an open source data analysis workbench. This allows them to continue to analyse their findings on their own hardware, long after they’ve left Diamond’s facility, where time and resources allotted to end-users are at a premium.

Still room for cloud

While much of this may seem pretty esoteric to the average corporate IT user, Andrew Richards still seems plenty of scope for using bog-standard cloud infrastructure for storage and processing in future at Diamond.

Right now, the organization is storing around 7 petabytes of data in a tape-based archive provided to it by its funder, the STFC - but that’s growing at 2 petabytes per year and is expected to be growing at 3 petabytes per year within a year or two. While this has proved to be the most cost-effective approach to date, Diamond is open to other approaches, Richards says:

Cloud has been considered and is being considered right now, as it happens. At the moment, when you look at the cost of doing some of this storage in the cloud, it can actually be pretty expensive. Providers make it look quite cost-effective on a per-terabyte basis, but we also need to consider the costs involved in getting data back out of the cloud when it’s needed. Those network egress charges can quickly stack up, but having said that, I feel like cloud providers like Amazon and Microsoft are starting to recognize the kinds of volumes that an organization like ours wants to store and are working to make cloud more viable for us. It’s certainly something I’m interested in exploring further.

A perhaps more pressing concern for Diamond is its compute needs. Here, cloud processing could be a real boon as a supplement to its own high-performance computing (HPC) environment, Richards says:

That would help when we have peak loads, for example, that perhaps we can’t address with our own in-house systems. Or where we have commercial customers who don’t really know upfront how much compute resource they’ll need to solve their particular problem. This currently makes it quite challenging for us to know just how much on-premise infrastructure we should have - and then, as a result, the cloud starts to look much more attractive for us, insofar as we could push some data, push some work, do some calculations in the cloud and then scale down our use of these resources when we don’t need them. These more ‘spikey’ workloads are where the cloud is looking more interesting right now.

Already this year, researchers have used Diamond’s synchrotron to explore eco-friendly fuel cells, to research new preservation techniques for great artworks, and to investigate the structure and strength of human bones, as well as to make that groundbreaking leukaemia discovery. It’s all important work, so there’s good reason to hope that cloud providers will step up to the plate to help these efforts expand and diversify.