Skip to main content

Showing 1–12 of 12 results for author: Rashidi, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19580  [pdf, other

    cs.AR cs.LG

    FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

    Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Puneet Gupta, Tushar Krishna

    Abstract: Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating h… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2309.04902  [pdf, other

    cs.CV

    Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art

    Authors: Aref Miri Rekavandi, Shima Rashidi, Farid Boussaid, Stephen Hoefs, Emre Akbas, Mohammed bennamoun

    Abstract: Transformers have rapidly gained popularity in computer vision, especially in the field of object recognition and detection. Upon examining the outcomes of state-of-the-art object detection methods, we noticed that transformers consistently outperformed well-established CNN-based detectors in almost every video or image dataset. While transformer-based approaches remain at the forefront of small o… ▽ More

    Submitted 9 September, 2023; originally announced September 2023.

  3. arXiv:2305.14516  [pdf, other

    cs.LG cs.DC

    Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

    Authors: Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna

    Abstract: Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once systems are fully designed and deployed. However, the pace of AI innovation demands a more agile methodol… ▽ More

    Submitted 26 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

  4. arXiv:2303.14006  [pdf, other

    cs.DC cs.LG

    ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

    Authors: William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

    Abstract: As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emergin… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

  5. arXiv:2211.16648  [pdf, other

    cs.DC cs.AI cs.LG

    COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

    Authors: Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis

    Abstract: Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, wi… ▽ More

    Submitted 14 March, 2024; v1 submitted 29 November, 2022; originally announced November 2022.

  6. arXiv:2210.12947  [pdf, other

    cs.LG cs.CV

    IT-RUDA: Information Theory Assisted Robust Unsupervised Domain Adaptation

    Authors: Shima Rashidi, Ruwan Tennakoon, Aref Miri Rekavandi, Papangkorn Jessadatavornwong, Amanda Freis, Garret Huff, Mark Easton, Adrian Mouritz, Reza Hoseinnezhad, Alireza Bab-Hadiashar

    Abstract: Distribution shift between train (source) and test (target) datasets is a common problem encountered in machine learning applications. One approach to resolve this issue is to use the Unsupervised Domain Adaptation (UDA) technique that carries out knowledge transfer from a label-rich source domain to an unlabeled target domain. Outliers that exist in either source or target datasets can introduce… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

  7. arXiv:2207.10898  [pdf, other

    cs.NI cs.AI

    Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

    Authors: Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

    Abstract: RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), s… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

  8. arXiv:2110.04478  [pdf, other

    cs.DC cs.AR cs.LG cs.NI

    Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

    Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

    Abstract: Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional n… ▽ More

    Submitted 7 July, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

  9. LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models

    Authors: William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

    Abstract: As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of… ▽ More

    Submitted 5 May, 2024; v1 submitted 24 September, 2021; originally announced September 2021.

    Comments: Contains 10 main pages, 21 figures, 3 tables

    Journal ref: Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '24)

  10. arXiv:2008.08289  [pdf, other

    cs.LG cs.DC stat.ML

    Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

    Authors: Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna

    Abstract: Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) where the deep model is divided into several parallel… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

  11. Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

    Authors: Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Ni, Tushar Krishna

    Abstract: Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and commu… ▽ More

    Submitted 4 May, 2022; v1 submitted 30 June, 2020; originally announced July 2020.

  12. Machine Learning Distinguishes Neurosurgical Skill Levels in a Virtual Reality Tumor Resection Task

    Authors: Samaneh Siyar, Hamed Azarnoush, Saeid Rashidi, Alexandre Winkler-Schwartz, Vincent Bissonnette, Nirros Ponnudurai, Rolando F. Del Maestro

    Abstract: Background: Virtual reality simulators and machine learning have the potential to augment understanding, assessment and training of psychomotor performance in neurosurgery residents. Objective: This study outlines the first application of machine learning to distinguish "skilled" and "novice" psychomotor performance during a virtual reality neurosurgical task. Methods: Twenty-three neurosurgeons a… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.