Search | arXiv e-print repository

arXiv:2406.19580 [pdf, other]

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Puneet Gupta, Tushar Krishna

Abstract: Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating h… ▽ More Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2307.14549 [pdf, other]

Adversarial Slee** Bandit Problems with Multiple Plays: Algorithm and Ranking Application

Authors: Jianjun Yuan, Wei Lee Woon, Ludovik Coba

Abstract: This paper presents an efficient algorithm to solve the slee** bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the slee** bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with reg… ▽ More This paper presents an efficient algorithm to solve the slee** bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the slee** bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with regret upper bounded by $\bigO(kN^2\sqrt{T\log T})$, where $k$ is the number of arms selected per time step, $N$ is the total number of arms, and $T$ is the time horizon. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted by RecSys 2023 conference

arXiv:2304.05301 [pdf, other]

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

Authors: William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna

Abstract: The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the fore… ▽ More The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the forefront of determining the performance. Unfortunately, communication libraries used in distributed machine learning today are limited by a fixed set of routing algorithms. This constraints collective performance within the domain of next-generation training clusters that employ intricate, heterogeneous, and asymmetric, large-scale topologies. Further, the emergence of irregular topologies attributed to runtime phenomena such as device failures serves to compound the complexity of the challenge. To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies. TACOS was able to synthesize All-Reduce algorithm for a heterogeneous 512-NPU system in just 6.09 minutes while achieving performance improvement up to 4.27x over state-of-the-art prior work. TACOS exhibits high scalability, with synthesis time scaling quadratically with the number of NPUs. In contrast to prior works' NP-hard approaches, TACOS with 40K NPUs completes in 2.52 hours. △ Less

Submitted 29 March, 2024; v1 submitted 11 April, 2023; originally announced April 2023.

arXiv:2303.14006 [pdf, other]

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Authors: William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

Abstract: As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emergin… ▽ More As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2110.04478 [pdf, other]

doi 10.1145/3470496.3527382

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Abstract: Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional n… ▽ More Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of kee** all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively. △ Less

Submitted 7 July, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

arXiv:2109.11762 [pdf, other]

doi 10.1109/ispass61541.2024.00028

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models

Authors: William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

Abstract: As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of… ▽ More As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth. We also identify that optimal bandwidth allocation is pivotal for multi-dimensional networks to ensure efficient resource utilization. We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures. Through case studies, we demonstrate the value of LIBRA, both in architecting optimized fabrics under diverse constraints and in enabling co-optimization opportunities. △ Less

Submitted 5 May, 2024; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: Contains 10 main pages, 21 figures, 3 tables

Journal ref: Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '24)

arXiv:2103.10452 [pdf]

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

Authors: Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, Sudarshan Srinivasan, Dipankar Das, Gordon E. Moon, Sivasankaran Rajamanickam, Tushar Krishna

Abstract: Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enab… ▽ More Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ~4X speedup over performing format conversions in software. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: Accepted for publication at the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)

arXiv:1806.02615 [pdf, ps, other]

doi 10.1007/978-3-030-64583-0_24

Explainable AI as a Social Microscope: A Case Study on Academic Performance

Authors: Anahit Sargsyan, Areg Karapetyan, Wei Lee Woon, Aamena Alshamsi

Abstract: Academic performance is perceived as a product of complex interactions between students' overall experience, personal characteristics and upbringing. Data science techniques, most commonly involving regression analysis and related approaches, serve as a viable means to explore this interplay. However, these tend to extract factors with wide-ranging impact, while overlooking variations specific to… ▽ More Academic performance is perceived as a product of complex interactions between students' overall experience, personal characteristics and upbringing. Data science techniques, most commonly involving regression analysis and related approaches, serve as a viable means to explore this interplay. However, these tend to extract factors with wide-ranging impact, while overlooking variations specific to individual students. Focusing on each student's peculiarities is generally impossible with thousands or even hundreds of subjects, yet data mining methods might prove effective in devising more targeted approaches. For instance, subjects with shared characteristics can be assigned to clusters, which can then be examined separately with machine learning algorithms, thereby providing a more nuanced view of the factors affecting individuals in a particular group. In this context, we introduce a data science workflow allowing for fine-grained analysis of academic performance correlates that captures the subtle differences in students' sensitivities to these factors. Leveraging the Local Interpretable Model-Agnostic Explanations (LIME) algorithm from the toolbox of Explainable Artificial Intelligence (XAI) techniques, the proposed pipeline yields groups of students having similar academic attainment indicators, rather than similar features (e.g. familial background) as typically practiced in prior studies. As a proof-of-concept case study, a rich longitudinal dataset is selected to evaluate the effectiveness of the proposed approach versus a standard regression model. △ Less

Submitted 4 June, 2020; v1 submitted 7 June, 2018; originally announced June 2018.

arXiv:1803.02282 [pdf, other]

doi 10.1038/s41467-018-07634-8

The Preeminence of Ethnic Diversity in Scientific Collaboration

Authors: Bedoor K AlShebli, Talal Rahwan, Wei Lee Woon

Abstract: Inspired by the social and economic benefits of diversity, we analyze over 9 million papers and 6 million scientists to study the relationship between research impact and five classes of diversity: ethnicity, discipline, gender, affiliation, and academic age. Using randomized baseline models, we establish the presence of homophily in ethnicity, gender and affiliation. We then study the effect of d… ▽ More Inspired by the social and economic benefits of diversity, we analyze over 9 million papers and 6 million scientists to study the relationship between research impact and five classes of diversity: ethnicity, discipline, gender, affiliation, and academic age. Using randomized baseline models, we establish the presence of homophily in ethnicity, gender and affiliation. We then study the effect of diversity on scientific impact, as reflected in citations. Remarkably, of the classes considered, ethnic diversity had the strongest correlation with scientific impact. To further isolate the effects of ethnic diversity, we used randomized baseline models and again found a clear link between diversity and impact. To further support these findings, we use coarsened exact matching to compare the scientific impact of ethnically diverse papers and scientists with closely-matched control groups. Here, we find that ethnic diversity resulted in an impact gain of 10.63% for papers, and 47.67% for scientists. △ Less

Submitted 20 November, 2020; v1 submitted 6 March, 2018; originally announced March 2018.

Journal ref: Nature communications, 9(1), 2018, 5163

arXiv:1802.06964 [pdf, other]

Co-occurrence matrix analysis-based semi-supervised training for object detection

Authors: Min-Kook Choi, Jaehyeong Park, Jihun Jung, Heechul Jung, **-Hee Lee, Woong Jae Won, Woo Young Jung, **cheol Kim, Soon Kwon

Abstract: One of the most important factors in training object recognition networks using convolutional neural networks (CNNs) is the provision of annotated data accompanying human judgment. Particularly, in object detection or semantic segmentation, the annotation process requires considerable human effort. In this paper, we propose a semi-supervised learning (SSL)-based training methodology for object det… ▽ More One of the most important factors in training object recognition networks using convolutional neural networks (CNNs) is the provision of annotated data accompanying human judgment. Particularly, in object detection or semantic segmentation, the annotation process requires considerable human effort. In this paper, we propose a semi-supervised learning (SSL)-based training methodology for object detection, which makes use of automatic labeling of un-annotated data by applying a network previously trained from an annotated dataset. Because an inferred label by the trained network is dependent on the learned parameters, it is often meaningless for re-training the network. To transfer a valuable inferred label to the unlabeled data, we propose a re-alignment method based on co-occurrence matrix analysis that takes into account one-hot-vector encoding of the estimated label and the correlation between the objects in the image. We used an MS-COCO detection dataset to verify the performance of the proposed SSL method and deformable neural networks (D-ConvNets) as an object detector for basic training. The performance of the existing state-of-the-art detectors (DConvNets, YOLO v2, and single shot multi-box detector (SSD)) can be improved by the proposed SSL method without using the additional model parameter or modifying the network architecture. △ Less

Submitted 19 February, 2018; originally announced February 2018.

Comments: Submitted to International Conference on Image Processing (ICIP) 2018

arXiv:1006.2570 [pdf, ps, other]

Power Circuits, Exponential Algebra, and Time Complexity

Authors: Alexei G. Myasnikov, Alexander Ushakov, Dong Wook Won

Abstract: Motivated by algorithmic problems from combinatorial group theory we study computational properties of integers equipped with binary operations +, -, z = x 2^y, z = x 2^{-y} (the former two are partial) and predicates < and =. Notice that in this case very large numbers, which are obtained as n towers of exponentiation in the base 2 can be realized as n applications of the operation x2^y, so worki… ▽ More Motivated by algorithmic problems from combinatorial group theory we study computational properties of integers equipped with binary operations +, -, z = x 2^y, z = x 2^{-y} (the former two are partial) and predicates < and =. Notice that in this case very large numbers, which are obtained as n towers of exponentiation in the base 2 can be realized as n applications of the operation x2^y, so working with such numbers given in the usual binary expansions requires super exponential space. We define a new compressed representation for integers by power circuits (a particular type of straight-line programs) which is unique and easily computable, and show that the operations above can be performed in polynomial time if the numbers are presented by power circuits. We mention several applications of this technique to algorithmic problems, in particular, we prove that the quantifier-free theories of various exponential algebras are decidable in polynomial time, as well as the word problems in some "hard to crack" one-relator groups. △ Less

Submitted 13 June, 2010; originally announced June 2010.

Showing 1–11 of 11 results for author: Won, W