Skip to main content

Showing 1–1 of 1 results for author: Bharuka, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2011.02999  [pdf, other

    cs.LG cs.DC

    CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

    Authors: Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

    Abstract: The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial r… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.