-
CoNST: Code Generator for Sparse Tensor Networks
Authors:
Saurabh Raje,
Yufan Xu,
Atanas Rountev,
Edward F. Valeev,
Saday Sadayappan
Abstract:
Sparse tensor networks are commonly used to represent contractions over sparse tensors. Tensor contractions are higher-order analogs of matrix multiplication. Tensor networks arise commonly in many domains of scientific computing and data science. After a transformation into a tree of binary contractions, the network is implemented as a sequence of individual contractions. Several critical aspects…
▽ More
Sparse tensor networks are commonly used to represent contractions over sparse tensors. Tensor contractions are higher-order analogs of matrix multiplication. Tensor networks arise commonly in many domains of scientific computing and data science. After a transformation into a tree of binary contractions, the network is implemented as a sequence of individual contractions. Several critical aspects must be considered in the generation of efficient code for a contraction tree, including sparse tensor layout mode order, loop fusion to reduce intermediate tensors, and the interdependence of loop order, mode order, and contraction order. We propose CoNST, a novel approach that considers these factors in an integrated manner using a single formulation. Our approach creates a constraint system that encodes these decisions and their interdependence, while aiming to produce reduced-order intermediate tensors via fusion. The constraint system is solved by the Z3 SMT solver and the result is used to create the desired fused loop structure and tensor mode layouts for the entire contraction tree. This structure is lowered to the IR of the TACO compiler, which is then used to generate executable code. Our experimental evaluation demonstrates very significant (sometimes orders of magnitude) performance improvements over current state-of-the-art sparse tensor compiler/library alternatives.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
DataPerf: Benchmarks for Data-Centric AI Development
Authors:
Mark Mazumder,
Colby Banbury,
Xiaozhe Yao,
Bojan Karlaš,
William Gaviria Rojas,
Sudnya Diamos,
Greg Diamos,
Lynn He,
Alicia Parrish,
Hannah Rose Kirk,
Jessica Quaye,
Charvi Rastogi,
Douwe Kiela,
David Jurado,
David Kanter,
Rafael Mosquera,
Juan Ciro,
Lora Aroyo,
Bilge Acun,
Lingjiao Chen,
Mehul Smriti Raje,
Max Bartolo,
Sabri Eyuboglu,
Amirata Ghorbani,
Emmett Goodman
, et al. (20 additional authors not shown)
Abstract:
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing datase…
▽ More
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
△ Less
Submitted 13 October, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Efficient Scaling of Dynamic Graph Neural Networks
Authors:
Venkatesan T. Chakaravarthy,
Shivmaran S. Pandian,
Saurabh Raje,
Yogish Sabharwal,
Toyotaro Suzumura,
Shashanka Ubaru
Abstract:
We present distributed algorithms for training dynamic Graph Neural Networks (GNN) on large scale graphs spanning multi-node, multi-GPU systems. To the best of our knowledge, this is the first scaling study on dynamic GNN. We devise mechanisms for reducing the GPU memory usage and identify two execution time bottlenecks: CPU-GPU data transfer; and communication volume. Exploiting properties of dyn…
▽ More
We present distributed algorithms for training dynamic Graph Neural Networks (GNN) on large scale graphs spanning multi-node, multi-GPU systems. To the best of our knowledge, this is the first scaling study on dynamic GNN. We devise mechanisms for reducing the GPU memory usage and identify two execution time bottlenecks: CPU-GPU data transfer; and communication volume. Exploiting properties of dynamic graphs, we design a graph difference-based strategy to significantly reduce the transfer time. We develop a simple, but effective data distribution technique under which the communication volume remains fixed and linear in the input size, for any number of GPUs. Our experiments using billion-size graphs on a system of 128 GPUs shows that: (i) the distribution scheme achieves up to 30x speedup on 128 GPUs; (ii) the graph-difference technique reduces the transfer time by a factor of up to 4.1x and the overall execution time by up to 40%
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
Kvik: A task based middleware with composable scheduling policies
Authors:
Saurabh Raje,
Frédéric Wagner
Abstract:
In this paper we present Kvik: an implementation of a task-based "middleware" for shared memory parallel programming in the Rust language built on top of the Rayon library. We devise a system allowing several task-splitting schedulers to be finely tuned by the end users. Among these, we propose an implementation of an adaptive scheduler reducing tasks creations (splits) to bare minimum by linking…
▽ More
In this paper we present Kvik: an implementation of a task-based "middleware" for shared memory parallel programming in the Rust language built on top of the Rayon library. We devise a system allowing several task-splitting schedulers to be finely tuned by the end users. Among these, we propose an implementation of an adaptive scheduler reducing tasks creations (splits) to bare minimum by linking tasks splitting to steal requests. Another important scheduler that allows turning computations into sequences of parallel operations is described. This operator proves itself particularly useful for interruptible computations. We exhibit different code examples well suited for different types of schedulers. We conclude our work with a set of benchmarks making heavy use of composability. In particular we present a parallel stable sort implementation with up to 1.5x more speedup when compared to the state-of-the-art parallel sorting implementation.
△ Less
Submitted 11 September, 2020;
originally announced September 2020.
-
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
Authors:
Saurabh Goyal,
Anamitra R. Choudhury,
Saurabh M. Raje,
Venkatesan T. Chakaravarthy,
Yogish Sabharwal,
Ashish Verma
Abstract:
We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and eliminating the redundant vectors. b) determining which word-vectors to eliminate by develo** a strategy for measuring their significance, based on the self-att…
▽ More
We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and eliminating the redundant vectors. b) determining which word-vectors to eliminate by develo** a strategy for measuring their significance, based on the self-attention mechanism. c) learning how many word-vectors to eliminate by augmenting the BERT model and the loss function. Experiments on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference time over BERT with <1% loss in accuracy. We show that PoWER-BERT offers significantly better trade-off between accuracy and inference time compared to prior methods. We demonstrate that our method attains up to 6.8x reduction in inference time with <1% loss in accuracy when applied over ALBERT, a highly compressed version of BERT. The code for PoWER-BERT is publicly available at https://github.com/IBM/PoWER-BERT.
△ Less
Submitted 8 September, 2020; v1 submitted 24 January, 2020;
originally announced January 2020.
-
Decentralised firewall for malware detection
Authors:
Saurabh Raje,
Shyamal Vaderia,
Neil Wilson,
Rudrakh Panigrahi
Abstract:
This paper describes the design and development of a decentralized firewall system powered by a novel malware detection engine. The firewall is built using blockchain technology. The detection engine aims to classify Portable Executable (PE) files as malicious or benign. File classification is carried out using a deep belief neural network (DBN) as the detection engine. Our approach is to model th…
▽ More
This paper describes the design and development of a decentralized firewall system powered by a novel malware detection engine. The firewall is built using blockchain technology. The detection engine aims to classify Portable Executable (PE) files as malicious or benign. File classification is carried out using a deep belief neural network (DBN) as the detection engine. Our approach is to model the files as grayscale images and use the DBN to classify those images into the aforementioned two classes. An extensive data set of 10,000 files is used to train the DBN. Validation is carried out using 4,000 files previously unexposed to the network. The final result of whether to allow or block a file is obtained by arriving at a proof of work based consensus in the blockchain network.
△ Less
Submitted 3 November, 2017;
originally announced November 2017.