Search | arXiv e-print repository

doi 10.1145/3627703.3629580

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Authors: Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, Wei Lin

Abstract: Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large model learning. This paper presents \OurSystem, an automated system designed to expedite SPMD DNN training on heterogeneous clusters. \OurSystem jointly optimizes t… ▽ More Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large model learning. This paper presents \OurSystem, an automated system designed to expedite SPMD DNN training on heterogeneous clusters. \OurSystem jointly optimizes the tensor sharding strategy, sharding ratios across heterogeneous devices and the communication methods for tensor exchanges for optimized distributed training with SPMD parallelism. We novelly formulate model partitioning as a program synthesis problem, in which we generate a distributed program from scratch on a distributed instruction set that semantically resembles the program designed for a single device, and systematically explore the solution space with an A*-based search algorithm. We derive the optimal tensor sharding ratios by formulating it as a linear programming problem. Additionally, \OurSystem explores tensor communication optimization in a heterogeneous cluster and integrates it as part of the program synthesis process, for automatically choosing optimal collective communication primitives and applying sufficient factor broadcasting technique. Extensive experiments on representative workloads demonstrate that \OurSystem achieves up to 2.41x speed-up on heterogeneous clusters. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: EuroSys '24

arXiv:2303.01675 [pdf, other]

Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

Authors: Siyu Wang, Zongyan Cao, Chang Si, Lansong Diao, Jiamang Wang, Wei Lin

Abstract: Pipeline parallelism has been demonstrated to be a remarkable approach to improve throughput for training deep neural networks with billions of parameters over heterogeneous clusters. The 1F1B scheduling plan is a widely adopted strategy for memory and performance optimization, which interchanges the forward and backward stage computations of different micro-batches. On the other hand, a common is… ▽ More Pipeline parallelism has been demonstrated to be a remarkable approach to improve throughput for training deep neural networks with billions of parameters over heterogeneous clusters. The 1F1B scheduling plan is a widely adopted strategy for memory and performance optimization, which interchanges the forward and backward stage computations of different micro-batches. On the other hand, a common issue in using the 1F1B scheduling is that stage computation is delayed due to the data transfer when network resources are preempted by other tasks, even with the minimum communication between stages. The exclusive access of these network resources cannot be guaranteed in cloud offerings. We present a general scheduling technique to accommodate pipeline parallelism to preempted network environments at the expense of a certain amount of memory pressure. The core concept is to extend 1F1B schedule scheme to kFkB, which groups k micro-batches, and alternately executes k forward and backward computations. We propose Ada-Grouper, an adaptive kFkB scheduler which regularly adjusts the number of group members k to maintain an optimal balance between communication and computation efficiency correspond to changes in a changing network environment under the memory limit. Experimental results demonstrate that our design maintain stable performance for pipeline parallelism, yielding a performance increase of up from 4% to 30%, compared with 1F1B in preempted network scenarios. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2302.08141 [pdf, other]

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Authors: Shiwei Zhang, Lansong Diao, Siyu Wang, Zongyan Cao, Yiliang Gu, Chang Si, Ziji Shi, Zhen Zheng, Chuan Wu, Wei Lin

Abstract: We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of… ▽ More We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2302.06126 [pdf, other]

Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

Authors: Shiwei Zhang, Xiaodong Yi, Lansong Diao, Chuan Wu, Siyu Wang, Wei Lin

Abstract: This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized… ▽ More This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: Accepted by IEEE Transactions on Parallel and Distributed Systems (TPDS) 2023

arXiv:2209.12769 [pdf, ps, other]

doi 10.1109/TPDS.2022.3201531

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

Authors: Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing Fan, Siyu Wang, Jun Yang, Wei Lin

Abstract: This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communicatio… ▽ More This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNN-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare DisCo with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Journal ref: IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4694-4706, 1 Dec. 2022

arXiv:2103.05288 [pdf, other]

DISC: A Dynamic Shape Compiler for Machine Learning Workloads

Authors: Kai Zhu, Wenyi Zhao, Zhen Zheng, Tianyou Guo, Pengzhan Zhao, Feiwen Zhu, Junjie Bai, Jun Yang, Xiaoyong Liu, Lansong Diao, Wei Lin

Abstract: Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage, optimization pipeline and deployment complexity. This paper provides a compiler system to natively support optimization for dynamic shape workloads, named DISC. DISC enric… ▽ More Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage, optimization pipeline and deployment complexity. This paper provides a compiler system to natively support optimization for dynamic shape workloads, named DISC. DISC enriches a set of IR to form a fully dynamic shape representation. It generates the runtime flow at compile time to support processing dynamic shape based logic, which avoids the interpretation overhead at runtime and enlarges the opportunity of host-device co-optimization. It addresses the kernel fusion problem of dynamic shapes with shape propagation and constraints collecting methods. This is the first work to demonstrate how to build an end-to-end dynamic shape compiler based on MLIR infrastructure. Experiments show that DISC achieves up to 3.3x speedup than TensorFlow/PyTorch, and 1.8x than Nimble. △ Less

Submitted 23 November, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

arXiv:2009.10924 [pdf, other]

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Authors: Zhen Zheng, Pengzhan Zhao, Guo** Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin

Abstract: We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability.… ▽ More We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability. We propose FusionStitching, a deep learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and context switch overhead automatically. FusionStitching widens the range of operation combinations that fusion can target beyond previous JIT works by introducing data reuse of intermediate values. It explores large fusion spaces to decide optimal fusion plans with considerations of memory access costs, kernel calls and resource usage constraints. FusionStitching tunes the optimal stitching scheme with a domain-specific cost model efficiently. Experimental results show that FusionStitching can reach up to 2.21x speedup compared to state-of-the-art, with 1.45x on average. Besides these experimental results, we integrated our approach into a compiler product and deployed it onto a production cluster for AI workloads with thousands of GPUs. The system has been in operation for more than 4 months and saves 7,000 GPU hours on average for approximately 30,000 tasks per month. △ Less

Submitted 17 December, 2021; v1 submitted 23 September, 2020; originally announced September 2020.

arXiv:2007.04069 [pdf, other]

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

Authors: Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guo** Long, Jun Yang, Xiaoyong Liu, Wei Lin

Abstract: The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between diffe… ▽ More The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between different type of models. In this paper, we propose Auto-MAP, a framework for exploring distributed execution plans for DNN workloads, which can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models. Efficient exploration remains a major challenge for reinforcement learning. We leverage DQN with task-specific pruning strategies to help efficiently explore the search space including optimized strategies. Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2007.01045 [pdf, other]

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Authors: Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guo** Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, Wei Lin

Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additi… ▽ More It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategy of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23x under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of training throughput and reduces the memory consumption of 12% at the same time. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2004.12087 [pdf]

Clustering by Constructing Hyper-Planes

Authors: Luhong Diao, **ying Gao1, Manman Deng

Abstract: As a kind of basic machine learning method, clustering algorithms group data points into different categories based on their similarity or distribution. We present a clustering algorithm by finding hyper-planes to distinguish the data points. It relies on the marginal space between the points. Then we combine these hyper-planes to determine centers and numbers of clusters. Because the algorithm is… ▽ More As a kind of basic machine learning method, clustering algorithms group data points into different categories based on their similarity or distribution. We present a clustering algorithm by finding hyper-planes to distinguish the data points. It relies on the marginal space between the points. Then we combine these hyper-planes to determine centers and numbers of clusters. Because the algorithm is based on linear structures, it can approximate the distribution of datasets accurately and flexibly. To evaluate its performance, we compared it with some famous clustering algorithms by carrying experiments on different kinds of benchmark datasets. It outperforms other methods clearly. △ Less

Submitted 25 April, 2020; originally announced April 2020.

arXiv:1705.02743 [pdf, other]

ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition

Authors: Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, Dongyan Wang

Abstract: In this paper, we introduce a new and challenging large-scale food image dataset called "ChineseFoodNet", which aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In our dataset, images of each food category of our dataset consists of not only web recipe and menu pictures but photos taken… ▽ More In this paper, we introduce a new and challenging large-scale food image dataset called "ChineseFoodNet", which aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In our dataset, images of each food category of our dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food. We present our efforts to build this large-scale image dataset, including food category selection, data collection, and data clean and label, in particular how to use machine learning methods to reduce manual labeling work that is an expensive process. We share a detailed benchmark of several state-of-the-art deep convolutional neural networks (CNNs) on ChineseFoodNet. We further propose a novel two-step data fusion approach referred as "TastyNet", which combines prediction results from different CNNs with voting method. Our proposed approach achieves top-1 accuracies of 81.43% on the validation set and 81.55% on the test set, respectively. The latest dataset is public available for research and can be achieved at https://sites.google.com/view/chinesefoodnet. △ Less

Submitted 15 October, 2017; v1 submitted 8 May, 2017; originally announced May 2017.

Comments: 8 pages, 5 figure, 2 tables

Showing 1–11 of 11 results for author: Diao, L