-
IceCube experience using XRootD-based Origins with GPU workflows in PNRP
Authors:
David Schultz,
Igor Sfiligoi,
Benedikt Riedel,
Fabio Andrijauskas,
Derek Weitzel,
Frank Würthwein
Abstract:
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modeling of the systematic effects. IceCube's largest systematic effe…
▽ More
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modeling of the systematic effects. IceCube's largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. IceCube normally stores the results in a central storage system at the University of Wisconsin-Madison, but it ran out of disk space in 2022. The Prototype National Research Platform (PNRP) project thus offered to provide both GPU compute and storage capacity to IceCube in support of this activity. The storage access was provided via XRootD-based OSDF Origins, a first for IceCube computing. We report on the overall experience using PNRP resources, with both successes and pain points.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Evaluation of ARM CPUs for IceCube available through Google Kubernetes Engine
Authors:
Igor Sfiligoi,
David Schultz,
Benedikt Riedel,
Frank Würthwein
Abstract:
The IceCube experiment has substantial simulation needs and is in continuous search for the most cost-effective ways to satisfy them. The most CPU-intensive part relies on CORSIKA, a cosmic ray air shower simulation. Historically, IceCube relied exclusively on x86-based CPUs, like Intel Xeon and AMD EPYC, but recently server-class ARM-based CPUs are also becoming available, both on-prem and in the…
▽ More
The IceCube experiment has substantial simulation needs and is in continuous search for the most cost-effective ways to satisfy them. The most CPU-intensive part relies on CORSIKA, a cosmic ray air shower simulation. Historically, IceCube relied exclusively on x86-based CPUs, like Intel Xeon and AMD EPYC, but recently server-class ARM-based CPUs are also becoming available, both on-prem and in the cloud. In this paper we present our experience in running a sample CORSIKA simulation on both ARM and x86 CPUs available through Google Kubernetes Engine (GKE). We used the production binaries for the x86 instances, but had to build the binaries for ARM instances from source code, which turned out to be mostly painless. Our benchmarks show that ARM-based CPUs in GKE were not only the most cost-effective but were also the fastest in absolute terms in all the tested configurations. While the advantage is not drastic, about 20% in cost-effectiveness and less than 10% in absolute terms, it is still large enough to warrant an investment in ARM support for IceCube.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Observation of high-energy neutrinos from the Galactic plane
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
S. W. Barwick,
V. Basu,
S. Baur,
R. Bay,
J. J. Beatty,
K. -H. Becker,
J. Becker Tjus
, et al. (364 additional authors not shown)
Abstract:
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrin…
▽ More
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrino emission using machine learning techniques applied to ten years of data from the IceCube Neutrino Observatory. We identify neutrino emission from the Galactic plane at the 4.5$σ$ level of significance, by comparing diffuse emission models to a background-only hypothesis. The signal is consistent with modeled diffuse emission from the Galactic plane, but could also arise from a population of unresolved point sources.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
N. Aggarwal,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
V. Basu,
R. Bay,
J. J. Beatty,
K. -H. Becker
, et al. (359 additional authors not shown)
Abstract:
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challen…
▽ More
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challenge due to the irregular detector geometry, inhomogeneous scattering and absorption of light in the ice and, below 100 GeV, the relatively low number of signal photons produced per event. To address this challenge, it is possible to represent IceCube events as point cloud graphs and use a Graph Neural Network (GNN) as the classification and reconstruction method. The GNN is capable of distinguishing neutrino events from cosmic-ray backgrounds, classifying different neutrino event types, and reconstructing the deposited energy, direction and interaction vertex. Based on simulation, we provide a comparison in the 1-100 GeV energy range to the current state-of-the-art maximum likelihood techniques used in current IceCube analyses, including the effects of known systematic uncertainties. For neutrino event classification, the GNN increases the signal efficiency by 18% at a fixed false positive rate (FPR), compared to current IceCube methods. Alternatively, the GNN offers a reduction of the FPR by over a factor 8 (to below half a percent) at a fixed signal efficiency. For the reconstruction of energy, direction, and interaction vertex, the resolution improves by an average of 13%-20% compared to current maximum likelihood techniques in the energy range of 1-30 GeV. The GNN, when run on a GPU, is capable of processing IceCube events at a rate nearly double of the median IceCube trigger rate of 2.7 kHz, which opens the possibility of using low energy neutrinos in online searches for transient events.
△ Less
Submitted 11 October, 2022; v1 submitted 7 September, 2022;
originally announced September 2022.
-
The anachronism of whole-GPU accounting
Authors:
Igor Sfiligoi,
David Schultz,
Frank Würthwein,
Benedikt Riedel,
Dmitry Y. Mishin
Abstract:
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-ba…
▽ More
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Expanding IceCube GPU computing into the Clouds
Authors:
Igor Sfiligoi,
Shava Smallen,
Frank Würthwein,
Nicole Wolter,
David Schultz,
Benedikt Riedel
Abstract:
The IceCube collaboration relies on GPU compute for many of its needs, including ray tracing simulation and machine learning activities. GPUs are however still a relatively scarce commodity in the scientific resource provider community, so we expanded the available resource pool with GPUs provisioned from the commercial Cloud providers. The provisioned resources were fully integrated into the norm…
▽ More
The IceCube collaboration relies on GPU compute for many of its needs, including ray tracing simulation and machine learning activities. GPUs are however still a relatively scarce commodity in the scientific resource provider community, so we expanded the available resource pool with GPUs provisioned from the commercial Cloud providers. The provisioned resources were fully integrated into the normal IceCube workload management system through the Open Science Grid (OSG) infrastructure and used CloudBank for budget management. The result was an approximate doubling of GPU wall hours used by IceCube over a period of 2 weeks, adding over 3.1 fp32 EFLOP hours for a price tag of about $58k. This paper describes the setup used and the operational experience.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links
Authors:
Igor Sfiligoi,
Michael Hare,
David Schultz,
Frank Würthwein,
Benedikt Riedel,
Tom Hutton,
Steve Barnet,
Vladimir Brik
Abstract:
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to low…
▽ More
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
A Convolutional Neural Network based Cascade Reconstruction for the IceCube Neutrino Observatory
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
C. Alispach,
A. A. Alves Jr.,
N. M. Amin,
R. An,
K. Andeen,
T. Anderson,
I. Ansseau,
G. Anton,
C. Argüelles,
S. Axani,
X. Bai,
A. Balagopal V.,
A. Barbano,
S. W. Barwick,
B. Bastian,
V. Basu,
V. Baum,
S. Baur,
R. Bay
, et al. (343 additional authors not shown)
Abstract:
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue alerts to telescopes around the world, powerful an…
▽ More
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue alerts to telescopes around the world, powerful and fast reconstruction methods are desired. Deep neural networks can be extremely powerful, and their usage is computationally inexpensive once the networks are trained. These characteristics make a deep learning-based approach an excellent candidate for the application in IceCube. A reconstruction method based on convolutional architectures and hexagonally shaped kernels is presented. The presented method is robust towards systematic uncertainties in the simulation and has been tested on experimental data. In comparison to standard reconstruction methods in IceCube, it can improve upon the reconstruction accuracy, while reducing the time necessary to run the reconstruction by two to three orders of magnitude.
△ Less
Submitted 26 July, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing
Authors:
I. Sfiligoi,
D. Schultz,
B. Riedel,
F. Wuerthwein,
S. Barnet,
V. Brik
Abstract:
Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive opt…
▽ More
Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. The available capacity of cost-effective instances is not well understood. This paper presents expanding the IceCube's production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
△ Less
Submitted 18 April, 2020;
originally announced April 2020.
-
Running a Pre-Exascale, Geographically Distributed, Multi-Cloud Scientific Simulation
Authors:
Igor Sfiligoi,
Frank Wuerthwein,
Benedikt Riedel,
David Schultz
Abstract:
As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototy** and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen…
▽ More
As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototy** and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen application being IceCube's photon propagation simulation. I.e. this was not a purely demonstration run, but it was also used to produce valuable and much needed scientific results for the IceCube collaboration. In order to reach the desired scale, we aggregated GPU resources across 8 GPU models from many geographic regions across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. Using this setup, we reached a peak of over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated compute of about 100k GPU hours. In this paper we provide the description of the setup, the problems that were discovered and overcome, as well as a short description of the actual science output of the exercise.
△ Less
Submitted 16 February, 2020;
originally announced February 2020.
-
The XENON1T Data Distribution and Processing Scheme
Authors:
Daniel Ahlin,
Boris Bauermeister,
Jan Conrad,
Robert Gardner,
Luca Grandi,
Benedikt Riedel,
Evan Shockley,
Judith Stephen,
Ragnar Sundblad,
Suchandra Thapa,
Christopher Tunnell
Abstract:
The XENON experiment is looking for non-baryonic particle dark matter in the universe. The setup is a dual phase time projection chamber (TPC) filled with 3200 kg of ultra-pure liquid xenon. The setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS) in Italy. We present a full overview of the computing scheme for data distribution and job management in XENON1T. The software package Ru…
▽ More
The XENON experiment is looking for non-baryonic particle dark matter in the universe. The setup is a dual phase time projection chamber (TPC) filled with 3200 kg of ultra-pure liquid xenon. The setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS) in Italy. We present a full overview of the computing scheme for data distribution and job management in XENON1T. The software package Rucio, which is developed by the ATLAS collaboration, facilitates data handling on Open Science Grid (OSG) and European Grid Infrastructure (EGI) storage systems. A tape copy at the Center for High Performance Computing (PDC) is managed by the Tivoli Storage Manager (TSM). Data reduction and Monte Carlo production are handled by CI Connect which is integrated into the OSG network. The job submission system connects resources at the EGI, OSG, SDSC's Comet, and the campus HPC resources for distributed computing. The previous success in the XENON1T computing scheme is also the starting point for its successor experiment XENONnT, which starts to take data in autumn 2019.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
A simple but tough-to-beat baseline for the Fake News Challenge stance detection task
Authors:
Benjamin Riedel,
Isabelle Augenstein,
Georgios P. Spithourakis,
Sebastian Riedel
Abstract:
Identifying public misinformation is a complicated and challenging task. An important part of checking the veracity of a specific claim is to evaluate the stance different news sources take towards the assertion. Automatic stance evaluation, i.e. stance detection, would arguably facilitate the process of fact checking. In this paper, we present our stance detection system which claimed third place…
▽ More
Identifying public misinformation is a complicated and challenging task. An important part of checking the veracity of a specific claim is to evaluate the stance different news sources take towards the assertion. Automatic stance evaluation, i.e. stance detection, would arguably facilitate the process of fact checking. In this paper, we present our stance detection system which claimed third place in Stage 1 of the Fake News Challenge. Despite our straightforward approach, our system performs at a competitive level with the complex ensembles of the top two winning teams. We therefore propose our system as the 'simple but tough-to-beat baseline' for the Fake News Challenge stance detection task.
△ Less
Submitted 21 May, 2018; v1 submitted 11 July, 2017;
originally announced July 2017.
-
Classification of Major Depressive Disorder via Multi-Site Weighted LASSO Model
Authors:
Dajiang Zhu,
Brandalyn C. Riedel,
Neda Jahanshad,
Nynke A. Groenewold,
Dan J. Stein,
Ian H. Gotlib,
Matthew D. Sacchet,
Danai Dima,
James H. Cole,
Cynthia H. Y. Fu,
Henrik Walter,
Ilya M. Veer,
Thomas Frodl,
Lianne Schmaal,
Dick J. Veltman,
Paul M. Thompson
Abstract:
Large-scale collaborative analysis of brain imaging data, in psychiatry and neu-rology, offers a new source of statistical power to discover features that boost ac-curacy in disease classification, differential diagnosis, and outcome prediction. However, due to data privacy regulations or limited accessibility to large datasets across the world, it is challenging to efficiently integrate distribut…
▽ More
Large-scale collaborative analysis of brain imaging data, in psychiatry and neu-rology, offers a new source of statistical power to discover features that boost ac-curacy in disease classification, differential diagnosis, and outcome prediction. However, due to data privacy regulations or limited accessibility to large datasets across the world, it is challenging to efficiently integrate distributed information. Here we propose a novel classification framework through multi-site weighted LASSO: each site performs an iterative weighted LASSO for feature selection separately. Within each iteration, the classification result and the selected features are collected to update the weighting parameters for each feature. This new weight is used to guide the LASSO process at the next iteration. Only the fea-tures that help to improve the classification accuracy are preserved. In tests on da-ta from five sites (299 patients with major depressive disorder (MDD) and 258 normal controls), our method boosted classification accuracy for MDD by 4.9% on average. This result shows the potential of the proposed new strategy as an ef-fective and practical collaborative platform for machine learning on large scale distributed imaging and biobank data.
△ Less
Submitted 3 June, 2017; v1 submitted 26 May, 2017;
originally announced May 2017.
-
The IceProd Framework: Distributed Data Processing for the IceCube Neutrino Observatory
Authors:
M. G. Aartsen,
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
D. Altmann,
C. Arguelles,
J. Auffenberg,
X. Bai,
M. Baker,
S. W. Barwick,
V. Baum,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
K. -H. Becker,
S. BenZvi,
P. Berghaus,
D. Berley,
E. Bernardini,
A. Bernhard,
D. Z. Besson,
G. Binder,
D. Bindig
, et al. (262 additional authors not shown)
Abstract:
IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It…
▽ More
IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It is driven by a central database in order to coordinate and admin- ister production of simulations and processing of data produced by the IceCube detector. IceProd runs as a separate layer on top of other middleware and can take advantage of a variety of computing resources, including grids and batch systems such as CREAM, Condor, and PBS. This is accomplished by a set of dedicated daemons that process job submission in a coordinated fashion through the use of middleware plugins that serve to abstract the details of job submission and job management from the framework.
△ Less
Submitted 22 August, 2014; v1 submitted 22 November, 2013;
originally announced November 2013.