The intent is not just to keep data for post­erity. Old data can be mined to test new theories and provide crucial references for new experiments, says Diaconu. Before the Higgs boson was discovered in 2012, for example, the Large Electron–Positron collider — the LHC’s predecessor at CERN — came back into the spotlight as physicists scoured its 1990s-era data, looking for an exotic type of Higgs that had not been theorized at the time the data had been gathered. In this way, the goals of keeping data alive and open are “enlightened self-interest”, says Michael Hildreth, a physicist at the University of Notre Dame in Indiana and leader of the US-funded Data and Software Preservation for Open Science (DASPOS) effort, which has similar goals to the DPHEP.
DASPOS is building a template for preserving data — a checklist of items that should be stored, and how to do it. Next year, in a ‘curation challenge’, DASPOS will task physicists with recreating results from other experiments using only the information collected with this template. One test will almost certainly use LHC data — challenging, for example, CMS physicists to recreate results from the rival ATLAS experiment. Another test could come from a different field, such as astrophysics. If successful, the model could form a generic and simplified architecture for preserving data, says Hildreth.
“When the LHC programme comes to an end, it will probably be the last data at this frontier for many years. We can’t afford to lose it.”
Part of the challenge is coping with ever-changing algorithms, operating systems and data-analysis hardware. At the German Electron Synchrotron (DESY) in Hamburg, computing coordinator David South is leading a project that is already attempting to protect data in this way. His team has devised a system that will automatically comb through data and software from experiments on DESY’s Hadron–Electron Ring Accelerator and test them for compatibility when hardware or operating systems change.
This plan to migrate data repeatedly onto new platforms stands in contrast to an approach at the BaBar experiment at the SLAC National Accelerator Laboratory in Menlo Park, California. There, versions of data and the operating systems needed to analyse them have been frozen in storage centres, where they are supposed to be accessible until at least 2018. South says that DESY’s approach is more reliable. Although DESY’s system needs monitoring — any incompatibilities must be fixed through human intervention — the goal is to deal with problems as they arise, rather than tackle them years later, when they may have compounded.
DESY scientists would know about that. In the 1990s, physicists wanted to take another look at data from a DESY collider that ran from 1979 to 1986, to further investigate the strong interaction that binds quarks together. They managed to measure it with increased precision, but Diaconu says that it took two years to reconstruct the data, which had not been maintained.
The data preservationists are quick to point out the expense associated with reconstruction efforts. Of course, preservation also costs money, but it is well worth it, says DPHEP project manager Jamie Shiers. He puts the bill for implementing good data-preservation at the LHC at around 1% of operating costs — just a few million dollars per year. “I think it’s justified,” he says.