Chasing the God particle with Pentaho
- Summary:
- Pentaho is playing a major role in helping CERN get its business management environment back under control and ready for growth
Everyone – or so one would assume – has heard of CERN. The popular perception is of an organisation with one of the largest particle accelerators in the world buried under the Swiss mountains, where many hundreds of propeller-headed scientists fire infinitesimally small particles at each other at something near the speed of light to see if they can find even smaller particles that just might be God……that’s the place isn’t it?
Well yes, but what lies behind the Large Hadron Collider and the startling scientific discoveries it is making in fundamental particle physics is a huge organisation employing some 2,500 staff, and also providing all the forms of support that some 10,000 visiting scientists will require.
Two states
What is often forgotten in the front-page splashes about hunting for and finding `God’ particles is the fact that one of CERN's most important roles is as an international university. In addition, because its facilities straddle two countries it is also, administratively, a completely independent 'state' that still has to report a good deal of information, such as HR-related data about the permanent staff, to either and or/both France and Switzerland.
This means the underlying management of the organisation has to provide a wide range of public services like health care, fire, waste collection and many social benefits, in addition to the more normal business management functions of accounting, procurement and the rest.
Providing the IT resources needed to manage all these operational functions is the responsibility of the Advanced Information System Unit (AIS). This provides all the compute resources and support needed for the business management of the CERN site, allowing all the researchers and university staff to concentrate entirely on their own work.
Like many organizations of its size, CERN had grown an IT infrastructure that no long matched its requirements, according to the Deputy Group Leader of AIS, Jan Janke. The problem was that over time the Unit had built some five major data warehouses for different parts of the business and the inevitable duplication, inconsistencies and errors were the result.
The Unit had developed some in-house reporting tools that used SAP to provide self-service report provision, but the underlying architecture of the data meant that the discrepancies were starting to become apparent:
Everything was held in different silos and there was no single version of the truth. So at the end of 2012 we looked at how we could do things differently. We also saw that there were difficulties staying up to date with new client services, and there was also a need for much better graphics in reporting. We also realised that there were now tools available that provided those capabilities.
This prompted a two-phase process for selecting the tools they considered would be needed, where stage one investigated all possible contenders and stage two defined a short list of tools that would be used to run proof of concept trials. The final stage was to make the necessary choices.
Goals
The two key goals were, therefore, to create a single data warehouse that could become that single version of the truth for all of CERN’s business management issues, and to get as close to real time in delivery of reports as possible. Janke explains:
CERN works on two different timescales, business time and technical time. Business time means the ability to deliver reports within one minute of an event. Technical time is important because the scientists need to have an accurate record of the data captured at a certain time on a certain day, so the tech data warehouse requires a permanent snapshot capability. This is not available in traditional warehouses, which normally update information with the latest available data.”
Part of the selection process was the choice of data management and reporting, and despite the data warehouses being based on Oracle and CERN’s Unit being essentially an 'Oracle shop’, it was Pentaho which got the nod over Oracle Enterprise Edition because it is open source and Java-based, which gave the Unit access to the source code. This made it much easier to customise to their needs, says Janke:
We took on Pentaho last September. It is strong in various areas with us because it is open source and written in Java, making it easy to integrate and work with. It provides all the BI reporting that is needed, and produces good looking, effective reports. It also has good dashboards so we can tell quickly what is going on in the organization.
Business experts within the Unit are already starting to use the Analyser tool with Pentaho to access and manipulate the data in new ways using comparative analyses. From this comes the opportunity to gain new insights on CERN’s operations. It is also being embedded into existing applications so that the business experts can use the new tools within a familiar environment and user interface. Janke says:
This means the users can focus on data and not worry about its preparation.
The Unit is also planning to exploit the data integration capabilities of Pentaho. This capability was not part of the original requirement but its availability has opened up new potential for the business experts. This includes the possibility of starting to work with unstructured data, such as documents and XML data from functions such as internal materials purchasing, and getting it into a form that makes it ready for use in the data warehouse.
Issues
The main challenge that has prompted this major upgrade of business management operations at CERN is the amount of data the organisation is generating above and beyond the obviously huge volumes generated by the university and the scientific processes themselves.
This is not helped by the fact that, in straddling two nation states, there has to be a fair amount of official duplication.
In addition, speed of access to the data is now getting to be a major factor for the Unit’s business managers. With the current environment of five major data warehouses, plus a number of smaller datamarts serving specialist areas, access time had started to become a significant problem, particularly where comparative analyses were required. Janke says:
They must be able to access the data they need in less than a second now, or there is trouble.
The new single warehouse is based Oracle 12C, with all data running in memory. He sees no upper limits on its size and expects it to be easy to scale.
For now Pentaho is being used to move data from the existing warehouses to the new environment, but the plan is to start offering its analysis capabilities to some business managers by the end of this year. It will then be rolled out during 2016, with all aspects of the Unit’s business management remit being up and running by the end of that year.