And the computing numbers at CERN, the home of the Large Hadron Collider (LHC) and where Tim Berners-Lee invented the web, bear this out.
The LHC is stored 100m below ground and cooled to 1.9 Kelvin or -271degrees Celsius. It’s equipped with 6,000-plus magnets, that do the work bending the particles so they can be accelerated and collide at various points to create high levels of energy.
One petabyte of data is generated per second, and CERN immediately has to filter this and find a reasonable amount of information to store and analyze, for physics reasons as well as budget considerations. The amount of data produced and kept indefinitely is in the region of 50 to 70PB per year.
The European Organization for Nuclear Research, as it is formally known, operates out of two computer centers, in Geneva, Switzerland and Budapest, Hungary. Between these two buildings, CERN runs 15,000 servers, 230,000 cores, 90,000 discs hosting 280PB of data, and 30,000 tapes storing 400PB. Even though most companies have long moved on from tape storage, for CERN and its mammoth data storage requirements, it’s still the most effective media due to its low cost and longevity compared to other formats.
As well as the two main sites, CERN also established the Worldwide LHC Computing Grid (WLCG) in the early 2000s, which operates out of 170 sites across 42 countries, offering much more capacity with 800,000 cores.
Based on CERN’s growth estimates, it’s facing some significant technology challenges ahead, both in terms of the amount of data it will generate and need to store, as well as the amount of processing power needed to analyze this data. As CERN moves from its current LHC Run 2 through to Run 3 in 2021 and Run 4 in 2025, it expects the amount of data it’s creating and storing to increase ten-fold, up from 70PB per year to the Exabyte range. Eric Grancher, Head of DB service group at CERN, said:
We are working closely with teams of researchers to improve the way we do computing, but also to make it work outside of the box. So maybe to use phone processors, or use a different type of computing, use GPUs, and to virtualize our code is really important for the way we are working.
A future in the cloud
To prepare for this data growth, CERN has started using Oracle cloud technology.
The organization has been an Oracle database user since 1982, and its biggest database now is Accelerator Logging, which generates 150TB of write activity data per month and is now storing over a petabyte of information.
It’s a key element, which actually is used by accelerator physicists on a daily basis to better understand how the accelerator is working, so it’s a key element for them. People have to be able to master and improve the efficiency of how things are done and for that we have a number of signals, two million signals, which retrieve data like your electricity level, current, temperature and magnets fields.
CERN is also using Oracle Big Data Discovery, which has brought huge productivity advantages as it lets those who aren’t database specialists to get visuals of those signals and find efficiencies by running Spark.
Spark is a very interesting and powerful language, but it’s quite difficult to work and handle. We have a number of engineers, specialists in the field of electronics and of magnets, and it wouldn’t be efficient that they go and learn Spark. They require tools that they can use for that.
Beyond its limits
Around a year ago, CERN also worked on a project to integrate Oracle Cloud Infrastructure inside the WLCG, to take advantage of available resources that weren’t running in all the centers. It moved the systems that globally monitor and give indications of resources via dashboards into the Oracle Cloud Infrastructure, including 10,000 cores with 10GB of RAM.
Exadata Cloud Service and Data Warehouse in the Cloud are also helping CERN take a database out of its data center, which is reaching the limits of its capacity and analytics, and transfer that data into the cloud.
We run commodity hardware in our data centers, but sometimes you have usages which go beyond what can be done with commodity hardware, so that’s why having it in the cloud makes sense. Most recently we've been working with Data Warehouse in the Cloud, one of the two types of databases which are provided as a service.
The move has given CERN huge reductions in the amount of data it needs to store – of the 620GB of indexed tables created from those two million signals, once these were run through the Autonomous Database Warehouse Cloud, this was compressed down into just 70GB.
The cloud also has its advantages for CERN as a public organization, allowing it so speed up the purchasing process and get access to new technology quicker.
We’re a public entity, we have to work with public tenders. When we need to procure new systems, new machines, this is not a fast process. You have to write technical specifications, you have to publish them, bid and select, and sometimes you have to get approval for buying the equipment.
Here we're talking about essentially a web interface where you just say the size and the amount of processors you want to use, and essentially a few minutes down the line you’re done.
Although it’s not quite this simple. For an organization the size of CERN, Grancher noted, there’s still the not inconsequential task of moving all those petabytes of data into the cloud service before it’s available to use.