-
TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment
Authors:
Hsin-I Cindy Liu,
Marius Brehler,
Mahesh Ravishankar,
Nicolas Vasilache,
Ben Vanik,
Stella Laurenzo
Abstract:
Machine learning model deployment for training and execution has been an important topic for industry and academic research in the last decade. Much of the attention has been focused on develo** specific toolchains to support acceleration hardware. In this paper, we present IREE, a unified compiler and runtime stack with the explicit goal to scale down machine learning programs to the smallest f…
▽ More
Machine learning model deployment for training and execution has been an important topic for industry and academic research in the last decade. Much of the attention has been focused on develo** specific toolchains to support acceleration hardware. In this paper, we present IREE, a unified compiler and runtime stack with the explicit goal to scale down machine learning programs to the smallest footprints for mobile and edge devices, while maintaining the ability to scale up to larger deployment targets. IREE adopts a compiler-based approach and optimizes for heterogeneous hardware accelerators through the use of the MLIR compiler infrastructure which provides the means to quickly design and implement multi-level compiler intermediate representations (IR). More specifically, this paper is focused on TinyIREE, which is a set of deployment options in IREE that accommodate the limited memory and computation resources in embedded systems and bare-metal platforms, while also demonstrating IREE's intuitive workflow that generates workloads for different ISA extensions and ABIs through LLVM.
△ Less
Submitted 28 May, 2022;
originally announced May 2022.
-
Compiler Support for Sparse Tensor Computations in MLIR
Authors:
Aart J. C. Bik,
Penporn Koanantakool,
Tatiana Shpeisman,
Nicolas Vasilache,
Bixia Zheng,
Fredrik Kjolstad
Abstract:
Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Develo** and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation…
▽ More
Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Develo** and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation task, and letting a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses integrating this idea into MLIR.
△ Less
Submitted 9 February, 2022;
originally announced February 2022.
-
Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction
Authors:
Nicolas Vasilache,
Oleksandr Zinenko,
Aart J. C. Bik,
Mahesh Ravishankar,
Thomas Raoux,
Alexander Belyaev,
Matthias Springer,
Tobias Gysi,
Diego Caballero,
Stephan Herhut,
Stella Laurenzo,
Albert Cohen
Abstract:
Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving th…
▽ More
Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving the productivity of both compiler engineers and end-users. The approach leverages the natural structure of tensor algebra. It has been the main driver for the design of progressive lowering paths in \MLIR. The proposed abstractions and transformations span data structures and control flow with both functional (SSA form) and imperative (side-effecting) semantics. We discuss the implications of this infrastructure on compiler construction and present preliminary experimental results.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
MLIR: A Compiler Infrastructure for the End of Moore's Law
Authors:
Chris Lattner,
Mehdi Amini,
Uday Bondhugula,
Albert Cohen,
Andy Davis,
Jacques Pienaar,
River Riddle,
Tatiana Shpeisman,
Nicolas Vasilache,
Oleksandr Zinenko
Abstract:
This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and o…
▽ More
This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and also across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, and identifying the challenges and opportunities posed by this novel design point in design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics.
△ Less
Submitted 29 February, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Authors:
Nicolas Vasilache,
Oleksandr Zinenko,
Theodoros Theodoridis,
Priya Goyal,
Zachary DeVito,
William S. Moses,
Sven Verdoolaege,
Andrew Adams,
Albert Cohen
Abstract:
Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and…
▽ More
Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrap** high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff]
△ Less
Submitted 28 June, 2018; v1 submitted 13 February, 2018;
originally announced February 2018.
-
Diagonal Rescaling For Neural Networks
Authors:
Jean Lafond,
Nicolas Vasilache,
Léon Bottou
Abstract:
We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tric…
▽ More
We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.
△ Less
Submitted 25 May, 2017;
originally announced May 2017.
-
Training Language Models Using Target-Propagation
Authors:
Sam Wiseman,
Sumit Chopra,
Marc'Aurelio Ranzato,
Arthur Szlam,
Ruoyu Sun,
Soumith Chintala,
Nicolas Vasilache
Abstract:
While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experim…
▽ More
While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work.
△ Less
Submitted 15 February, 2017;
originally announced February 2017.
-
Learning Visual Features from Large Weakly Supervised Data
Authors:
Armand Joulin,
Laurens van der Maaten,
Allan Jabri,
Nicolas Vasilache
Abstract:
Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly…
▽ More
Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
Authors:
Nicolas Vasilache,
Jeff Johnson,
Michael Mathieu,
Soumith Chintala,
Serkan Piantino,
Yann LeCun
Abstract:
We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of t…
▽ More
We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA's cuDNN implementation for many common convolutional layers (up to 23.5x for some synthetic kernel configurations). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. Details on algorithmic applications of NVIDIA GPU hardware specifics in the implementation of fbfft are also provided.
△ Less
Submitted 10 April, 2015; v1 submitted 23 December, 2014;
originally announced December 2014.
-
A Tale of Three Runtimes
Authors:
Nicolas Vasilache,
Muthu Baskaran,
Tom Henretty,
Benoit Meister,
M. Harper Langston,
Sanket Tavarageri,
Richard Lethin
Abstract:
This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical map** solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types e…
▽ More
This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical map** solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime. In this context, permutable loops are of particular importance as they translate immediately into conservative point-to-point synchronizations of distance 1. Our solution generates calls into a runtime-agnostic C++ layer, which we have retargeted to Intel's Concurrent Collections (CnC), ETI's SWARM, and the Open Community Runtime (OCR). Experience with other runtime systems motivates our introduction of support for hierarchical async-finishes in CnC. Experimental data is provided to show the benefit of automatically generated code for EDT-based runtimes as well as comparisons across runtimes.
△ Less
Submitted 5 September, 2014;
originally announced September 2014.