Skip to main content

Showing 1–6 of 6 results for author: Nagrecha, K

.
  1. arXiv:2311.02840  [pdf, other

    cs.LG cs.AI cs.DC

    Saturn: Efficient Multi-Large-Model Deep Learning

    Authors: Kabir Nagrecha, Arun Kumar

    Abstract: In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting -- parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint pro… ▽ More

    Submitted 5 November, 2023; originally announced November 2023.

    Comments: 4 pages, 1 figure, 2 tables. Accepted to BayLearn 2023. Abstract of this paper: https://adalabucsd.github.io/papers/TR_2023_Saturn.pdf

  2. arXiv:2309.01226  [pdf, other

    cs.LG cs.AI cs.DC

    Saturn: An Optimized Data System for Large Model Deep Learning Workloads

    Authors: Kabir Nagrecha, Arun Kumar

    Abstract: Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of… ▽ More

    Submitted 13 December, 2023; v1 submitted 3 September, 2023; originally announced September 2023.

    Comments: Accepted at VLDB '24. Code available: https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages appendix

  3. arXiv:2308.08500  [pdf, other

    cs.IR cs.AI cs.DC cs.LG cs.PF

    InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

    Authors: Kabir Nagrecha, Lingyi Liu, Pablo Delgado, Prasanna Padmanabhan

    Abstract: Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model executi… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

    Comments: Accepted at RecSys 2023. 11 pages, 2 pages of references. 8 figures with 2 tables

  4. arXiv:2301.02691  [pdf, other

    cs.DC cs.LG

    Systems for Parallel and Distributed Large-Model Deep Learning Training

    Authors: Kabir Nagrecha

    Abstract: Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore increasingly large neural architectures, with some recent Transformer models spanning hundreds of billions of learnable parameters. These designs have introduced new sca… ▽ More

    Submitted 6 January, 2023; originally announced January 2023.

    Comments: 12 pages, 10 figures

  5. arXiv:2110.08633  [pdf, other

    cs.DC cs.DB cs.LG

    Hydra: A System for Large Multi-Model Deep Learning

    Authors: Kabir Nagrecha, Arun Kumar

    Abstract: Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and busin… ▽ More

    Submitted 3 August, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: 3 figures, 1 table, 11 pages including references

  6. Model-Parallel Model Selection for Deep Learning Systems

    Authors: Kabir Nagrecha

    Abstract: As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too large to be fit onto a single processor. To address the issue, many ML practitioners have turned to model parallelism as a method of distributing the computationa… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

    Comments: 2 pages, 3 figures. 1st place winner of ACM SIGMOD '21 Student Research Competition. Appeared in ACM SIGMOD/PODS '21 Proceedings

    ACM Class: C.3; I.5; I.7