Physics analysis in the LHC era

This article was originally within a longer text dedicated to ROOT, hence the references to it.

I wrote this a while ago and some numbers, for instance the amount of data processed, might be outdated.

The LHC is a circular particle accelerator located at the European Organization for Nuclear Research (CERN) and one of the most sophisticated machines ever built by humanity. In order to recreate conditions similar to those just after the Big Bang, the LHC circulates a beam of charged particles over its 27 km of circumference, increasing their momentum until they reach the appropriate energy, and then collides them at one of the 4 particle detectors in its ring: A Large Ion Collider Experiment (ALICE), A Toroidal LHC Apparatus (ATLAS), Compact Muon Solenoid (CMS) and Large Hadron Collider beauty (LHCb).

Collectively, the LHC detectors, or experiments, produce about 40 petabytes of raw data each year from collisions. This massive amount of data must be stored, processed, and analyzed very efficiently. This in an extremely complex challenge, and it requires several optimization steps of data selection, identification, filtering and reduction until the data is manageable by the data analysis framework of choice, typically ROOT.

This article introduces the end-to-end process of generating and analyzing data, from the production of charged particles out of hydrogen, to the final plotting of a physics process generated by a physicist using his personal computer. In Section 1 we give an overview of the LHC chain and the different experiments. Section 2 describes the different steps involved and the challenges to overcome in the analysis chain, from data collection until the plot obtained as result of the analysis. Finally, in Section 3 we expose the computing infrastructure put in place to support the physics analysis and data distribution.

ROOT plays a fundamental role processing LHC data and is involved across all stages of the analysis chain. Understanding the full process of generating and analyzing LHC data provides us with a better comprehension of the source, the magnitude and the complexity of the data processed by ROOT, and the heterogeneity of the analyses programmed with it. Moreover, the massive computational infrastructure currently in place suggests how much we can benefit from maximizing ROOT’s efficiency in exploiting modern hardware resources (e.g. by leveraging parallelism at multiple levels), as the amount of data generated will increase dramatically in the coming years.

1. The LHC chain#

CERN is a scientific laboratory created in 1954 as an intergovernmental organization in order to spark the study of particle physics, and it is most importantly known by the complex of accelerators it hosts.

A particle accelerator is an extremely complex machine built out of a large number of electromagnetic devices that guide a beam of particles through the path described by its shape. An accelerator aims to increase the energy of charged particles by accelerating them using electromagnetic fields. Depending on the path of an accelerator, we classify it as a linear or circular accelerator.

At CERN, linear accelerators (LINAC) are mainly used to inject the beam into circular accelerators at the right momentum. The circular accelerators constitute a chain in which the newer accelerators that were built for experimenting at higher energies require the beam to perform previous passes through the smaller, older, circular accelerators to ensure it is is injected into the higher energy accelerators at the appropriate energy. The circular topology of these accelerators allows for as many laps as necessary to increase the momentum of the particles until the desired energy is reached.

A schematic drawing of the CERN accelerator complex, including the LHC chain, is given in Figure 1. There are two beams circulating simultaneously on the accelerator chain, each consisting of a large number of bunches of O(1011) protons. For the LHC the chain starts by extracting protons from hydrogen gas and injecting them into the LINAC2. LINAC2 will accelerate the protons to 50 MeV and then inject them into the PS Booster, which will bring them up to 1.4 GeV. Then both beams are injected into the Proton Synchroton (PS) and accelerated to 25 GeV, after which they are transferred to the last link of the chain—and the second biggest accelerator at CERN—the Super Proton Synchroton (SPS), where they will reach the required injection energy for the LHC, 460 GeV. Finally, in the LHC, the beams will be able to reach up to 7 TeV of energy, in preparation for collision in one of the four main detectors placed at strategic points in the LHC cavern.

CERN accelerator complex

Figure 1: The CERN accelerator complex, including the LHC chain. The yellow dots in the LHC represent each one of the four big experiments. CERN

1.2 Main detectors in the LHC#

Figure 1 shows the four main detectors, also called experiments, installed in the LHC cavern: CMS, ATLAS, LHCb and ALICE. Detectors are highly complex machines, built out of a large number of state of the art materials and components, with the objective of recording the results of the collision of two particles moving at high energies. Figure 2 depicts the ATLAS detector and its most important sections, comparing it with the size of an average human.

The ATLAS detector

Figure 2: The ATLAS detector. ATLAS Experiment

CMS and ATLAS are two general-purpose detectors used to investigate a wide range of physics. They have complementary characteristics, and they are often used to validate each other’s experimental results. LHCb is a detector specialized in b-physics, and its objective is to look for possible indications for new physics by studying b-decays. ALICE performs heavy-ions experimentation to detect quark-gluon plasma, a state of matter thought to have formed just after the Big Bang.

CMS Slice

Figure 3: Slice of the CMS detector, showing the tracks of different particles. CC by 4.0. CERN, for the benefit of the CMS Collaboration

Particle detectors are situated at the collision points of the LHC beam. We collide two particles at extremely high energy to decompose them into smaller more fundamental particles and observe their behaviour, hoping to improve our understanding of physics. The trajectory of the particles resulting from the collision is tracked by pixel and strip detectors, and the particles are detected and identified according to their energy and their lifetime inside the detector, that is, to which layer of the detector they penetrate. Figure 3 illustrates this in a section of the CMS detector. Photons and electrons deposit their energy in the electromagnetic calorimeter and hadrons in the hadron calorimeter. Instead, muons go through every layer of the detector until they are noticed by the muon system.

2. The Analysis chain#

Frequently, experimental physicists express their analysis in terms of events (collisions) recorded at a certain point in time. These events are processed in several steps by the analysis chain.

The analysis chain is defined as the necessary actions to observe and analyze the interesting physics processes resulting from the collisions in the detector. It covers the work of going from the signals produced by the collision in the detector to the final analysis visualization and is divided in three different phases or steps: data acquisition, data reconstruction and simulation, and physics analysis.

Data acquisition (Section 2.1) refers to the process of recording a sustainable number of the physical events happening in the detector. Data reconstruction (Section 2.2) and simulation (Section 2.3) involves reconstructing the particles’ trajectories after the collision and classifying these particles by type (tagging) from either the raw data acquired from the detector, or from simulations based on a theoretical model. Finally, physics analysis (Section 2.4) focuses on filtering the physics processes of interest from the reconstructed data and comparing them against a theoretical model, in search for its experimental validation or indications of new physics.

2.1 Data acquisition#

CMS and ATLAS perform approximately O(109) collisions per second, generating around 1 megabyte of data each, which results in about 1 petabyte of data per second to record.

Unfortunately, the current state of data acquisition technology is several orders of magnitude away from being capable of handling such a massive amount of data. For this reason, we are forced to record data at a lower rate.

However, collecting data at a slower pace might result in omitting a large number of interesting events from the physics analysis. To avoid this situation, detectors implement the triggers, hybrid software-hardware systems capable of determining the interest of the events and rejecting those that don’t show signs of interesting physics processes.

A trigger system is designed with three levels of filters that events need to pass through before being recorded for offline analysis:

  • Level-1: hardware-based trigger. Selects events that produce large energy deposits in the calorimeters or hits in the muon chambers.

  • Level-2: software-based trigger. Selects events based on a preliminary analysis of the regions of interest identified in level-1.

  • Level-3: software trigger. Rudimentarily reconstructs the entire event.

Only the events passing all three filters are stored for further processing, with the next step being the transformation of this data into higher level objects used in physics analysis.

2.2 Reconstruction of physics objects from data#

The reconstruction step transforms raw detector information into higher level physics objects starting from either the events triggered by the detector’s data acquisition system, or from the events obtained from the simulation of the behaviour of the detectors. The latter occurs in programs such as GEANT41. The output format of this simulation needs to be exactly like the data generated by the detector so it is compatible with the same analysis chain.

The reconstruction process requires the performance of (at least) three operations: tracking, reconstruction of the particle trajectories into tracks, determining the parameters of the particles at their point of production and their momentum; vertexing, grouping particles into vertices, estimating the location of their production point; and particle identification, classifying particles based on their tracks (e.g. photons, muons, etc.).

The particle tracking process involves applying pattern recognition, mapping hits in the detector with specific tracks, and approximating the track with an equation (fitting), usually a Kalman filter2. An extra track refinement step is often added, tuning the pattern recognition in order to avoid false positives.

Vertexing involves clustering tracks from a specific event that originated from the same origin point, finding the vertex candidates through cluster analysis and then performing a fitting step, obtaining an estimated vertex position as well as the set of particle tracks associated with that vertex.

The last step, particle identification, aims to associate each identified track with a type of particle. This is a fundamental step, for example, to reduce the volume of data stored for offline analysis to the interesting events, that is, removing background signals which are not necessary for the data analysis process.

In addition, in this step of the analysis chains we perform other operations such as calorimeter reconstruction, to measure the energy of the electromagnetic and hadronic particles, or jet reconstruction, to combine particles in jets using the tracking and calorimeter information.

2.3 Monte Carlo generation#

The simulation and reconstruction of the detector chain is frequently the most time-consuming step in the analysis chain.

Often, physicists choose to work with very simplified simulations of the detector’s observable events . These simulations are produced by event generators, instances of simulation software used to model and simulate physics processes described by a theoretical model. They do so with the aid of Monte Carlo techniques and algorithms, relying on random sampling to produce events with the same average behaviour and fluctuations as real data3.

Monte Carlo event generators aim to simulate the experimental characteristics of the physics processes of interest, and they are a fundamental tool for HEP, with a wide variety of applications such as optimizing the design of new detectors for specific physics events, exploring analysis strategies to be used in real data or interpreting observed results in terms of the underlying theory.

A large number of Monte Carlo event generators are available for the simulation and analysis of HEP theory, e.g. Pythia4, Herwig 5 or Alpgen6.

2.4 Physics analysis#

The analysis process starts after reconstruction or simulation of the collisions. Once we obtain the reconstructed data, stored in ROOT format (Chapter [ch:ROOT]), we apply to it successive campaigns of data reduction and refinement. These campaigns consist of filtering and selecting the events relevant for the analysis at hand, resulting in a reduced dataset. The data reduction process is driven by the physics processes of interest, e.g. selecting all the events that involve a Higgs gamma-gamma decay, and aims to produce an amount of data manageable from the local computing infrastructures of the experiment or the physicist.

This reduced data is then processed by the analysis software to produce data frames (tables) and histogram visualizations, to which statistical inference is applied. This is done by applying sophisticated algorithms, such as those used for the classification of the signal versus the background data of an analysis, for regression analysis, or for the fitting of probabilistic distributions (Chapter [ch:Fitting]) for the estimation of physical quantities and observables, for instance estimating the Higgs mass.

Figure 4 displays two visualizations generated from the analysis of measured Higgs boson properties using the diphoton 7 and four-lepton8 decays channels, based on data collected by the CMS detector. These figures, produced by ROOT, showcase two of the most common procedures to be applied to reconstructed data for analysis. In Figure 4a, we aim to fit the data to a theoretical model. Through fitting, we estimate physical quantities values (parameters of the fit) to obtain the parameter values that describe best the distribution of the reconstructed data. Figure 4b plots the reconstructed data (points with error bars) against several stacked histograms obtained from simulated data based on theoretical models, representing expected signal and background distributions.

CMS Slice

(a): Data and signal-plus-background model fits for the vector-boson fusion Higgs production and top-Higgs production categories in the diphoton decay channel. The one (green) and two (yellow) standard deviation bands include the uncertainties in the background component of the fit. The lower panel shows the residuals after the background subtraction. The CMS collaboration

CMS Slice

(b): Distribution of the reconstructed four-lepton invariant mass m_4ℓ in the low-mass range. Points with error bars represent the data and stacked histograms represent expected signal and background distributions. The CMS collaboration

Figure 4: Mass spectra obtained in the CMS Higgs search using different decay channels, with a significance >5σ. Data collected in 2017 at a center-of-mass energy of 13 TeV and an integrated luminosity of 35.9 fb-1.

In addition to traditional statistical inference methods, recently there has been a rise in popularity of multivariate analysis methods based in machine learning. In the scope of HEP, we benefit from the (increasingly) intensive use of techniques such as or . These methods are mainly used for signal-background classification or regression analysis, e.g. for estimating the particle energy from the calorimeter data. These methods have proven very effective, dramatically reducing the time necessary for these processes and improving accuracy.

HEP analyses are typically performed by ROOT, the official LHC data analysis toolkit. ROOT provides most of the numerical and statistical methods necessary for the analysis of HEP data. In addition, it offers several attractive features for HEP analysis, such as a C++ interactive interpreter for prototyping and interactive exploration of the analysis, implicit parallelization, new analysis-centric programming paradigms such as TDataFrame, and, most importantly, a highly efficient subsystem. ROOT I/O provides an efficient columnar data format, data compression, serialization of C++ objects, and data structures optimized for dealing with the extreme necessities of today’s HEP experiments.

3. Computing infrastructure at CERN#

The LHC experiments generate new data at increasing rates, currently around 40 petabytes annually. This makes it essential to adopt a data-centric computing model, where the data, our most important asset, is at the core of the choices made in the deployment, distribution, storage and processing of the data, the design of computer centers or cluster infrastructures, etc. For instance, the storage, networking and processing resources that would be necessary to perform local analysis on such large remote data sets would exceed anything offered by the commercial solutions currently available. Thus, we need to think of potentially ad-hoc solutions, such as performing the analysis remotely, as close as possible to the data, to reduce the load on the network and facilitate low-latency, high-rate access to the data, or creating a hierarchical federation of interconnected computing centers to improve data distribution, locality and processing time.

For instance, processing this enormous amount of data requires, at least, 500000 typical PC processor cores. A computing center as big as CERN’s still cannot provide enough resources to store and process such a large amount of data, both in terms of processing and electrical power, and is estimated to be able to cover only around 20-30% of the storage and CPU power needed.

This task of unprecedented magnitude and complexity sparked the creation of a global computing infrastructure, the WLCG9. The WLCG is an international collaboration that involves several national and international grid infrastructures and many institutions, communities and projects from a wide variety of fields, such as HEP, astrophysics, earth sciences or biological and medical research.

The WLCG is a multi-tiered, geographically distributed, federation of computing centers of various sizes orchestrated from central points. Depending on which tier a computer center is part of, it will provide different services. In the scope of HEP and, specifically, the LHC data analysis chain:

  • Tier-0: A single computing center tier, built at CERN, providing 20% of the computing power of the WLCG. Stores all the raw data obtained from the detectors’ data acquisition system, runs the first pass of reconstruction and orchestrates the distribution of the raw data to the following tiers.

  • Tier-1: Thirteen large computing centers used for simulation and reconstruction of the data received from the Tier-0. They keep a secondary copy of the raw data and distribute the reconstructed data to the Tier-2 centers.

  • Tier-2: Around 160 smaller computing centers, typically universities or research centers, with adequate computing power for users to perform analysis processes, calibration measurements, and the generation of Monte Carlo simulations. They provide redundancy for the reconstructed data and also store the results of users’ data analyses.

  • Tier-3: Local computing clusters managed by local analysis groups or even individual computers. Although we often refer to them as Tier-3, the WLCG does not provide any specification or formal engagement with them.

Although the roles of each tier still hold, recent years have seen an improvement in and reduction of the cost of network links, and now it is possible, for instance, to transfer data between centers of the same tier or to build a set of fast network hubs connecting many Tier-2 to many Tier-1 and Tier-0.

The WLCG is a fundamental infrastructure for the HEP community and has been a core concept for the processing of the analysis chain during the last decade. Proof of that is that the computing model of the LHC experiments is still especially designed for the exploitation of grid resources.


  1. Agostinelli, S., Allison, J., Amako, K., Apostolakis, J., Araujo, H., Arce, P., … & Behner, F. (2003). GEANT4–a simulation toolkit. Nuclear Instruments and Methods in Physics Research. Section A, Accelerators, Spectrometers, Detectors and Associated Equipment, 506(3), 250-303. link ↩︎

  2. Kalman filter https://en.wikipedia.org/wiki/Kalman_filter ↩︎

  3. Siegert, F. (2010). Monte-Carlo event generation for the LHC (No. CERN-THESIS-2010-302). link ↩︎

  4. Pythia: http://home.thep.lu.se/~torbjorn/pythia81html/Welcome.html ↩︎

  5. Herwig: https://herwig.hepforge.org ↩︎

  6. AlpGen: http://mlm.web.cern.ch/mlm/alpgen/ ↩︎

  7. Sirunyan, A. M., Tumasyan, A., Adam, W., Ambrogi, F., Asilar, E., Bergauer, T., … & Del Valle, A. E. (2018). Measurements of Higgs boson properties in the diphoton decay channel in proton-proton collisions at $$\sqrt {s}= 13$$ TeV. Journal of High Energy Physics 2018(11), 185. link ↩︎

  8. Sirunyan, A. M., Tumasyan, A., Adam, W., Ambrogi, F., Asilar, E., Bergauer, T., … & Del Valle, A. E. (2018). Measurements of Higgs boson properties in the diphoton decay channel in proton-proton collisions at $$\sqrt {s}= 13$$ TeV. Journal of High Energy Physics, 2018(11), 185. link ↩︎

  9. Shiers, J. (2007). The worldwide LHC computing grid (worldwide LCG). Computer physics communications, 177(1-2), 219-223. link ↩︎