Skip to main content

Showing 1–11 of 11 results for author: Diao, L

Searching in archive cs. Search in all archives.
.
  1. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

    Authors: Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, Wei Lin

    Abstract: Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large model learning. This paper presents \OurSystem, an automated system designed to expedite SPMD DNN training on heterogeneous clusters. \OurSystem jointly optimizes t… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: EuroSys '24

  2. arXiv:2303.01675  [pdf, other

    cs.DC

    Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

    Authors: Siyu Wang, Zongyan Cao, Chang Si, Lansong Diao, Jiamang Wang, Wei Lin

    Abstract: Pipeline parallelism has been demonstrated to be a remarkable approach to improve throughput for training deep neural networks with billions of parameters over heterogeneous clusters. The 1F1B scheduling plan is a widely adopted strategy for memory and performance optimization, which interchanges the forward and backward stage computations of different micro-batches. On the other hand, a common is… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

  3. arXiv:2302.08141  [pdf, other

    cs.DC cs.LG cs.PL

    Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

    Authors: Shiwei Zhang, Lansong Diao, Siyu Wang, Zongyan Cao, Yiliang Gu, Chang Si, Ziji Shi, Zhen Zheng, Chuan Wu, Wei Lin

    Abstract: We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  4. arXiv:2302.06126  [pdf, other

    cs.LG cs.DC

    Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

    Authors: Shiwei Zhang, Xiaodong Yi, Lansong Diao, Chuan Wu, Siyu Wang, Wei Lin

    Abstract: This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized… ▽ More

    Submitted 13 February, 2023; originally announced February 2023.

    Comments: Accepted by IEEE Transactions on Parallel and Distributed Systems (TPDS) 2023

  5. Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

    Authors: Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing Fan, Siyu Wang, Jun Yang, Wei Lin

    Abstract: This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communicatio… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

    Journal ref: IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4694-4706, 1 Dec. 2022

  6. arXiv:2103.05288  [pdf, other

    cs.DC

    DISC: A Dynamic Shape Compiler for Machine Learning Workloads

    Authors: Kai Zhu, Wenyi Zhao, Zhen Zheng, Tianyou Guo, Pengzhan Zhao, Feiwen Zhu, Junjie Bai, Jun Yang, Xiaoyong Liu, Lansong Diao, Wei Lin

    Abstract: Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage, optimization pipeline and deployment complexity. This paper provides a compiler system to natively support optimization for dynamic shape workloads, named DISC. DISC enric… ▽ More

    Submitted 23 November, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

  7. arXiv:2009.10924  [pdf, other

    cs.DC cs.LG

    FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

    Authors: Zhen Zheng, Pengzhan Zhao, Guo** Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin

    Abstract: We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability.… ▽ More

    Submitted 17 December, 2021; v1 submitted 23 September, 2020; originally announced September 2020.

  8. arXiv:2007.04069  [pdf, other

    cs.DC cs.AI

    Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

    Authors: Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guo** Long, Jun Yang, Xiaoyong Liu, Wei Lin

    Abstract: The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between diffe… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

  9. arXiv:2007.01045  [pdf, other

    cs.DC

    DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

    Authors: Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guo** Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, Wei Lin

    Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additi… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  10. arXiv:2004.12087  [pdf

    cs.CV

    Clustering by Constructing Hyper-Planes

    Authors: Luhong Diao, **ying Gao1, Manman Deng

    Abstract: As a kind of basic machine learning method, clustering algorithms group data points into different categories based on their similarity or distribution. We present a clustering algorithm by finding hyper-planes to distinguish the data points. It relies on the marginal space between the points. Then we combine these hyper-planes to determine centers and numbers of clusters. Because the algorithm is… ▽ More

    Submitted 25 April, 2020; originally announced April 2020.

  11. arXiv:1705.02743  [pdf, other

    cs.CV

    ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition

    Authors: Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, Dongyan Wang

    Abstract: In this paper, we introduce a new and challenging large-scale food image dataset called "ChineseFoodNet", which aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In our dataset, images of each food category of our dataset consists of not only web recipe and menu pictures but photos taken… ▽ More

    Submitted 15 October, 2017; v1 submitted 8 May, 2017; originally announced May 2017.

    Comments: 8 pages, 5 figure, 2 tables