A new Standard DNA damage (SDD) data format
Authors:
J. Schuemann,
A. McNamara,
J. W. Warmenhoven,
N. T. Henthorn,
K. Kirkby,
M. J. Merchant,
S. Ingram,
H. Paganetti,
KD. Held,
J. Ramos-Mendez,
B. Faddegon,
J. Perl,
D. Goodhead,
I. Plante,
H. Rabus,
H. Nettelbeck,
W. Friedland,
P. Kundrat,
A. Ottolenghi,
G. Baiocco,
S. Barbieri,
M. Dingfelder,
S. Incerti,
C. Villagrasa,
M. Bueno
, et al. (26 additional authors not shown)
Abstract:
Our understanding of radiation induced cellular damage has greatly improved over the past decades. Despite this progress, there are still many obstacles to fully understanding how radiation interacts with biologically relevant cellular components to form observable endpoints. One hurdle is the difficulty faced by members of different research groups in directly comparing results. Multiple Monte Ca…
▽ More
Our understanding of radiation induced cellular damage has greatly improved over the past decades. Despite this progress, there are still many obstacles to fully understanding how radiation interacts with biologically relevant cellular components to form observable endpoints. One hurdle is the difficulty faced by members of different research groups in directly comparing results. Multiple Monte Carlo codes have been developed to simulate damage induction at the DNA scale, while at the same time various groups have developed models that describe DNA repair processes with varying levels of detail. These repair models are intrinsically linked to the damage model employed in their development, making it difficult to disentangle systematic effects in either part of the modelling chain. The modelling chain typically consists of track structure Monte Carlo simulations of the physics interactions creating direct damages to the DNA; followed by simulations of the production and initial reactions of chemical species causing indirect damages. After the DNA damage induction, DNA repair models combine the simulated damage patterns with biological models to determine the biological consequences of the damage. We propose a new Standard data format for DNA Damage to unify the interface between the simulation of damage induction and the biological modelling of cell repair processes. Such a standard greatly facilitates inter model comparisons, providing an ideal environment to tease out model assumptions and identify persistent, underlying mechanisms. Through inter model comparisons, this unified standard has the potential to greatly advance our understanding of the underlying mechanisms of radiation induced DNA damage and the resulting observable biological effects.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models
Authors:
Assaf Eisenman,
Kiran Kumar Matam,
Steven Ingram,
Dheevatsa Mudigere,
Raghuraman Krishnamoorthi,
Krishnakumar Nair,
Misha Smelyanskiy,
Murali Annavaram
Abstract:
Checkpoints play an important role in training long running machine learning (ML) models. Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that they can be used to recover from failures to ensure rapid training progress. In addition, they are used for online training to improve inference prediction accuracy with continuous learning. Given the large and ever incre…
▽ More
Checkpoints play an important role in training long running machine learning (ML) models. Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that they can be used to recover from failures to ensure rapid training progress. In addition, they are used for online training to improve inference prediction accuracy with continuous learning. Given the large and ever increasing model sizes, checkpoint frequency is often bottlenecked by the storage write bandwidth and capacity. When checkpoints are maintained on remote storage, as is the case with many industrial settings, they are also bottlenecked by network bandwidth. We present Check-N-Run, a scalable checkpointing system for training large ML models at Facebook. While Check-N-Run is applicable to long running ML jobs, we focus on checkpointing recommendation models which are currently the largest ML models with Terabytes of model size. Check-N-Run uses two primary techniques to address the size and bandwidth challenges. First, it applies incremental checkpointing, which tracks and checkpoints the modified part of the model. Incremental checkpointing is particularly valuable in the context of recommendation models where only a fraction of the model (stored as embedding tables) is updated on each iteration. Second, Check-N-Run leverages quantization techniques to significantly reduce the checkpoint size, without degrading training accuracy. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models at Facebook, and thereby significantly improve checkpoint capabilities while reducing the total cost of ownership.
△ Less
Submitted 4 May, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.