The biggest Big Data project in the universe
- Summary:
- The biggest amount of data ever gathered and processed passing through the UK, for scientists and SMBs to slice, dice, and turn into innovations and insights. When Big Data becomes Super-Massive Data.
Fifty-five years ago today, humanity left Earth for the first time, as cosmonaut Yuri Gagarin became the first man in space. Since then, we’ve gathered more data about the universe than in all of human history combined, via technology in space and on the ground.
Cosmological data is Big Data, the biggest there is. And one organization knows more about planning for Big Data, and how to process it when it arrives, than any other enterprise - private or public sector - on the planet. So much so, that we may need to coin a more appropriate phrase for what they gather in the decades to come - Super-Massive Data.
The Square Kilometre Array (SKA) Project is the biggest science project on or off Earth. It involves the building over the next two decades of a series of giant arrays of radio telescopes in remote parts of Australia and southern Africa, to create a globe-spanning dish, in effect, with a surface area over 200 times larger than that of the Lovell Telescope at Jodrell Bank.
This UK-centered international programme – headquartered at Jodrell Bank – is designed to understand aspects of fundamental physics on a universal scale, such as gravity and magnetism, all the way out to more traditional astronomy topics, such as supermassive black holes, the origins and evolution of the universe, and the nature of dark matter and dark energy.
Professor Philip Diamond is Director General of the SKA Organisation. He explains:
In many ways, you can think of the SKA as a time machine, as we will be able to look back in time and make movies of the evolving universe. We’ve recently published our science case. It comes in two volumes, totalling 2,000 pages, and when dropped on a minister’s desk, the nine kilograms make a resounding thump – which is the principal aim of the printed copy!
I haven’t mentioned SETI [the Search for Extraterrestrial Intelligence], but we will be the ultimate SETI machine, too. It’s not one of our main aims, it will be a byproduct, but if we do detect that little signal then I think that would address some of the funding issues we might have.
The UK has committed £200 million to SKA to date, and the Australian government A$300 million, but over the next few years the project will need billions of dollars of investment, the case for which the SKA Organisation is building. Currently it is a UK Limited Company, but will eventually become a treaty organization and inter-governmental project, similar to CERN.
Back to the Big Bang
Using the most common element in the universe, neutral hydrogen, as a tracer, the SKA will be able to follow the trail all the way back to the cosmic dawn, a few hundred thousand years after the Big Bang.
But over billions of years (a beam of light travelling at 671 million miles an hour would take 46.5 billion years to reach the edge of the observable universe) the wavelength of those ancient hydrogen signatures becomes stretched via the doppler effect, until it falls into the same range as the radiation emitted by mobile phones, aircraft, FM radio, and digital TV. This is why the SKA arrays are being built in remote, sparsely populated regions, says Diamond:
The aim is to get away from people. It’s not because we’re antisocial – although some of my colleagues probably are a little! – but we need to get away from radio interference, phones, microwaves, and so on, which are like shining a torch in the business end of an optical telescope.
Eventually there will be two SKA telescopes. The first, consisting of 130,000 2m dipole low-frequency antennae, is being built in the Shire of Murchison, a remote region about 800km north of Perth, Australia – an area the size of the Netherlands, but with a population of less than 100 people. Construction kicks off in 2018.
By Phase 2, said Diamond, the SKA will consist of half-a-million low and mid-frequency antennae, with arrays spread right across southern Africa as well as Australia, stretching all the way from South Africa to Ghana and Kenya – a multibillion-euro project on an engineering scale similar to the Large Hadron Collider.
Which brings us to that supermassive data challenge for what, ultimately, will be an ICT-driven science facility. Diamond says:
The antennae will generate enormous volumes of data: even by the mid-2020s [Phase 1 of the project] we will be looking at 5,000 petabytes – five exabytes – a day of raw data. This will go to huge banks of digital signal processors, which we’re in the process of designing, and then into high-performance computers, and into an archive for scientists worldwide to access.
Our archive growth rate will be somewhere will be somewhere between 300 and 500 petabytes a year – science-quality data coming out of the supercomputer.
Boldly going
But those volumes are only for SKA Phase 1, adds Diamond:
For the full SKA, the figures will go up by a factor of 100. But that’s in the 2030s. We’re designing now for the 2020s, but in the following decade, the data problem will become much worse.
To put all this in perspective, worldwide annual Google searches generate about 100 petabytes of data. Facebook is about twice that. Global business emails generate about 3,000 petabytes, of data. But the raw data from SKA Mid, we estimate, will be 62 exabytes (62,000 petabytes). So we’ve got to design equipment to handle something that’s 20 times larger than global email traffic.
Total global internet traffic is one zetabyte. Ultimately, will have five zetabytes within our internal systems alone. So we will need to build, or have access to, supercomputers with a speed of approximately 300 petaflops.
The fastest supercomputer in the world is currently China’s Tianhe-2, which runs at 33.86 petaflops, so the SKA will need access to a computer that is between six and 10 times faster than the fastest machine on earth. But this doesn’t bother Diamond:
The IBMs and Intels of this world tell us that this is entirely within their forecast capability. In fact, I’m pretty sure that the NSA already has something a little faster but they won’t tell us.
And as with all Big Data, SKA's Super-Massive Data will not only be defined by its volume and its velocity, but by that all-important third ‘V’ - value. Diamond says:
What we then have to do to these enormous volumes of raw data is detect and amplify them, digitise them and line them up, correlate them and integrate them, process them, and then create sky images, which the scientists will use. We at the SKA will be providing science-ready data products, calibrated and quality controlled.
Traditional radio astronomy goes through this process many times, but we will only be able to do it once. We won’t be able to store all the raw data, it’s a one-pass system. So we have to understand our systematics better than any existing facility on Earth.
For us, the main principles are scalability, affordability, and maintainability, but we also have to maintain innovation. We have bright people throughout the world developing the algorithms to process this data, but we’ve got to be able to replace them straightforwardly as new ideas emerge.
My take
The SKA is an awe-inspiring project, and Diamond hopes that the intergovernmental organisation behind it will be in place by 2017.
Let’s hope, therefore, that local politics don’t derail this extraordinary international collaboration. Might the UK’s possible exit from the European Union – the ‘Brexit’ – imperil this and many other ‘big science’ programmes? And has the SKA program considered deeply enough the physical (as opposed to data) security of vast arrays that cross national boundaries?
Much will come down to how important people consider such programs to be, and what their long-term terrestrial applications might be.
The biggest amount of data ever gathered and processed passing through the UK, for scientists and SMEs to slice, dice, and turn into innovations and insights? Let’s hope we can hang on to that big picture.