Operate a supercomputer for ATLAS
The ATLAS Collaboration uses a global network of data centers – the Global LHC Computing Grid – to perform data processing and analysis. These data centers are typically built from commodity hardware to perform the full spectrum of ATLAS data processing, from reducing the raw data coming out of the detector to a manageable size, to producing plots for publication.
While the grid’s distributed approach has proven to be very successful, ATLAS researchers are also exploring the potential of high-performance computing (HPC) centers. HPC harnesses the power of purpose-built supercomputers from specialized hardware and is widely used in other scientific disciplines.
However, HPC poses significant challenges for taking ATLAS data. First, access to supercomputers is usually strictly limited, with connections to HPC compute nodes severely restricted or non-existent. Second, the processor architecture may not be suitable for ATLAS software and the installation of any required local software may be tightly controlled. Third, the system may only allow very large jobs using several thousand nodes, which is atypical of an ATLAS workflow. Finally, the HPC may be geographically distant from the storage hosting the ATLAS data, which may cause network problems.
Despite these challenges, ATLAS collaborators have successfully exploited HPC over the past few years, including several topping the famous Top500 list of supercomputers. Technological barriers were overcome by isolating the main computation from the parts requiring network access, such as data transfer. Software issues have been addressed through the use of container technology, which allows ATLAS software to run on any operating system, and the development of “edge services”, which enable computations to run in offline mode without needing to contact external services.
The most recent HPC to process ATLAS data is Vega, the first new petascale EuroHPC JU machine, housed at the Institute of Information Sciences in Maribor, Slovenia (see Figure 1). Vega began operations in April 2021 and consists of 960 nodes, each containing 128 physical processor cores, for a total of 122,800 physical cores or 245,760 logical cores. To put that into perspective, the total number of cores provided to ATLAS from grid resources is around 300,000 cores.
The Vega supercomputer in Slovenia is the newest HPC to process data from the ATLAS experiment.
Due to close ties with the ATLAS community of physicists in Slovenia, some of whom were heavily involved in the design and commissioning of Vega, the ATLAS collaboration was one of the first users to be granted grants official times. This benefited both the ATLAS collaboration, which was able to take advantage of a significant additional resource, and Vega, which received a steady and well-understood stream of work to help it in the commissioning phase.
As shown in Figure 2, Vega was almost continuously full of ATLAS jobs from the time it was activated, and periods when fewer jobs are running are due to either other users on Vega or a lack of ATLAS work to be submitted. This enormous additional computing power – essentially doubling ATLAS’s available resources – was invaluable, allowing multiple large-scale data processing campaigns to run in parallel. Thus, the ATLAS collaboration is heading towards the restart of the LHC with a fully updated Run-2 dataset and the corresponding simulations, many of which have been significantly extended in terms of statistics thanks to the additional resources provided by Vega.
It is a testament to the robustness of ATLAS’ distributed computing systems that they could be extended to a single site equivalent in size to the entire grid. While Vega will eventually be devoted to other scientific projects, a part will continue to be dedicated to ATLAS. Moreover, the successful experience shows that ATLAS members (and their data) are ready to jump on the next available HPC and exploit its full potential!