Researchers at Cern, the Swiss nuclear physics lab which is home to the world’s largest particle accelerator, the Large Hadron Collider (LHC) aren’t just searching for the origins of the universe – they’re also working on the future of big data.

The LHC experiments alone will generate some 22PB of data this year, and that’s after throwing away 99 per cent of what is recorded by the LHC detectors.

Because the LHC experiments generate vast amounts of information and the physicists access that data from more than 150 datacentres across the globe, Cern doesn’t rely on a relational database to store raw experimental data.

The raw data from CERN’s experiments is instead stored in ROOT structured files, which are better suited to physics analysis. Transactional relational databases, Oracle 11gR2 with Real Application Clusters and Active Data Guard, store metadata information that is used to manage that raw data.
Raw data is analysed using batch processing, but batch processing is slow and the lab plans to investigate ways to allow researchers to tap into larger pools of data and query that data more rapidly. In doing so the lab will be a testbed for future big data technologies for its private partners HP, Huawei, Intel, Oracle, and Siemens, all of whom work with Cern on the openlab research project.

“As the amount of data continues to grow we have to keep trying to optimise the rate at which we can process the data,” said Bob Jones, Cern’s head of openlab.

“If Cern shows what works, and we publish that information with HP, Oracle, Intel, Siemens or Huawei, then that’s beneficial to their other customers and to other business sectors as well.”

The lab plans to study ways that it could place more data from its experiments in both relational databases and data stores based on NoSQL technologies, such as Hadoop and Dynamo in Amazon’s S3’s cloud storage service.

NoSQL technologies are well-suited to scaling out data – to distributing it across many different server clusters – without the problems of increased data management complexity that affects relational databases. Cern also favours off-the-shelf computers in its IT estate, and NoSQL tech works well with clusters of cheap commodity servers.

“The sort of things we’re looking at are – can we use the links between the data in an Oracle database to extract it and put it in to some other environment, do some Hadoop-style processing and use some business logic on that data and then shift the results back into Oracle,” he said.

As Cern tries to find more efficient ways of handling big data it will factor in the continuing evolution in computer hardware, particularly the prediction that data centres that can carry out more than one billion, billion calculations per second will exist within five years and the advent of cheap cloud storage services such as Amazon S3.

“One of the question we are looking at is ‘Imagine you had an infinite amount of compute capacity’. What would you change in your programming model? Would you continue to throw away quite so much data?,” Jones said.

“If we can explore that with our partners they can see what are the potential avenues that ought to be explored, and what are the potential barriers they are going to have to overcome,” he said, adding that he anticipates the main difficulties in running exascale data centre will be providing enough power and moving data in and out of servers rapidly enough to avoid creating a bottleneck.

Cern recently struck a deal to expand its 65,000 processor core and 30 PB core computer centre by building an additional data centre in Hungary that will add 20,000 cores and 5.5PB of storage.

Much of the analysis of the data generated by Cern’s experiments is carried out by a network of more than 150 computing centres in the Worldwide LHC Computing Grid (WLCG). The WCLG puts some 150,000 processors at Cern’s disposal but the research institute is examining whether it could double that number by turning to cloud computing.