CERN: a scalable infrastructure for data integration and reporting

Miroslav Potocky from CERN
Miroslav Potocky accounts for CERN’s data integration and reporting infrastructure

How do you deal with infrastructure issues such as scalability if your organization is huge and your data sums up to several hundreds of terabytes? CERN, world’s biggest research organization, is implementing a scalable infrastructure for data integration and reporting using open source tools such as Pentaho Data Integration. At Pentaho User Meeting, Miroslav Potocky will present the project.

Miroslav, who are you?

I am an IT professional at CERN specialized in databases, database storage and more recently data integration. My formal education (Master equivalent) is in informatics with an emphasis on computer networks from Technical University in Košice, Slovakia.

During my career, I have moved from *NIX system administration to databases. For more than a decade I have worked as DB administrator/architect and in database storage administration and consultancy at CERN Information technology department database group. These experiences have allowed me to take the challenge of leading our internal project for implementing a scalable, on-demand data integration and reporting infrastructure based on the Pentaho tools used at CERN.

My personal hobbies match quite well with my daily work: I tend to spend a lot of time in front of (some form of) computer displays tinkering with home server infrastructure, array of IoT devices or relaxing with computer games. In order to avoid spending too much time with electronics I ski (semi)regularly and enjoy traveling to new places with my family.

What will you present in your talk?

In my contribution, I want to first give an overview of CERN – the European Organization for Nuclear Research and its mission. This will be followed by a summary of our infrastructure build-up with emphasis on deployment (Puppet based) using virtual machines (OpenStack) and containers (OpenShift/Kubernetes).

Starting from the decision to use Pentaho software as a central service I will give details of the integration with existing CERN IT services and the challenges we had to face (and are still facing) when building a centralized on-demand infrastructure. All of this is aimed towards the ultimate goal of bringing data together for the thousands of users in our organization.

CERN uses PDI on a large scale. Can you give more details?

In CERN, there are thousands of different data sources and even more relations between them. Pentaho tools (Pentaho Data Integration, Report designer, User console) are used to better understand, process and visualize those relations and the resulting data-sets. Since the service we built is still in its infancy, the scale is not yet reaching its full potential. However, if you imagine several hundreds of terabytes of data and several thousands of users eligible to take advantage of this infrastructure, the intended scale becomes very apparent.

CERN and other organizations present their data projects at Pentaho User Meeting in Frankfurt, Germany on March 11. Registration is free, more information and the agenda can be found on this page.

Articles you could find interesting: