Search | arXiv e-print repository

TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment

Authors: Hsin-I Cindy Liu, Marius Brehler, Mahesh Ravishankar, Nicolas Vasilache, Ben Vanik, Stella Laurenzo

Abstract: Machine learning model deployment for training and execution has been an important topic for industry and academic research in the last decade. Much of the attention has been focused on develo** specific toolchains to support acceleration hardware. In this paper, we present IREE, a unified compiler and runtime stack with the explicit goal to scale down machine learning programs to the smallest f… ▽ More Machine learning model deployment for training and execution has been an important topic for industry and academic research in the last decade. Much of the attention has been focused on develo** specific toolchains to support acceleration hardware. In this paper, we present IREE, a unified compiler and runtime stack with the explicit goal to scale down machine learning programs to the smallest footprints for mobile and edge devices, while maintaining the ability to scale up to larger deployment targets. IREE adopts a compiler-based approach and optimizes for heterogeneous hardware accelerators through the use of the MLIR compiler infrastructure which provides the means to quickly design and implement multi-level compiler intermediate representations (IR). More specifically, this paper is focused on TinyIREE, which is a set of deployment options in IREE that accommodate the limited memory and computation resources in embedded systems and bare-metal platforms, while also demonstrating IREE's intuitive workflow that generates workloads for different ISA extensions and ABIs through LLVM. △ Less

Submitted 28 May, 2022; originally announced May 2022.

Comments: 9 pages, 3 figures, to be published in IEEE Micro

arXiv:2202.04305 [pdf, other]

doi 10.1145/3544559

Compiler Support for Sparse Tensor Computations in MLIR

Authors: Aart J. C. Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, Fredrik Kjolstad

Abstract: Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Develo** and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation… ▽ More Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Develo** and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation task, and letting a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses integrating this idea into MLIR. △ Less

Submitted 9 February, 2022; originally announced February 2022.

arXiv:2202.03293 [pdf, other]

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

Authors: Nicolas Vasilache, Oleksandr Zinenko, Aart J. C. Bik, Mahesh Ravishankar, Thomas Raoux, Alexander Belyaev, Matthias Springer, Tobias Gysi, Diego Caballero, Stephan Herhut, Stella Laurenzo, Albert Cohen

Abstract: Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving th… ▽ More Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving the productivity of both compiler engineers and end-users. The approach leverages the natural structure of tensor algebra. It has been the main driver for the design of progressive lowering paths in \MLIR. The proposed abstractions and transformations span data structures and control flow with both functional (SSA form) and imperative (side-effecting) semantics. We discuss the implications of this infrastructure on compiler construction and present preliminary experimental results. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2002.11054 [pdf, other]

MLIR: A Compiler Infrastructure for the End of Moore's Law

Authors: Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, Oleksandr Zinenko

Abstract: This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and o… ▽ More This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and also across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, and identifying the challenges and opportunities posed by this novel design point in design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. △ Less

Submitted 29 February, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

arXiv:1802.04730 [pdf, other]

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Authors: Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen

Abstract: Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and… ▽ More Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrap** high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff] △ Less

Submitted 28 June, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

arXiv:1705.09319 [pdf, other]

Diagonal Rescaling For Neural Networks

Authors: Jean Lafond, Nicolas Vasilache, Léon Bottou

Abstract: We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tric… ▽ More We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost. △ Less

Submitted 25 May, 2017; originally announced May 2017.

arXiv:1702.04770 [pdf, other]

Training Language Models Using Target-Propagation

Authors: Sam Wiseman, Sumit Chopra, Marc'Aurelio Ranzato, Arthur Szlam, Ruoyu Sun, Soumith Chintala, Nicolas Vasilache

Abstract: While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experim… ▽ More While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work. △ Less

Submitted 15 February, 2017; originally announced February 2017.

arXiv:1511.02251 [pdf, other]

Learning Visual Features from Large Weakly Supervised Data

Authors: Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache

Abstract: Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly… ▽ More Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages. △ Less

Submitted 6 November, 2015; originally announced November 2015.

arXiv:1412.7580 [pdf, ps, other]

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Authors: Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun

Abstract: We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of t… ▽ More We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA's cuDNN implementation for many common convolutional layers (up to 23.5x for some synthetic kernel configurations). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. Details on algorithmic applications of NVIDIA GPU hardware specifics in the implementation of fbfft are also provided. △ Less

Submitted 10 April, 2015; v1 submitted 23 December, 2014; originally announced December 2014.

Comments: Camera ready for ICLR2015

arXiv:1409.1914 [pdf, ps, other]

A Tale of Three Runtimes

Authors: Nicolas Vasilache, Muthu Baskaran, Tom Henretty, Benoit Meister, M. Harper Langston, Sanket Tavarageri, Richard Lethin

Abstract: This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical map** solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types e… ▽ More This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical map** solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime. In this context, permutable loops are of particular importance as they translate immediately into conservative point-to-point synchronizations of distance 1. Our solution generates calls into a runtime-agnostic C++ layer, which we have retargeted to Intel's Concurrent Collections (CnC), ETI's SWARM, and the Open Community Runtime (OCR). Experience with other runtime systems motivates our introduction of support for hierarchical async-finishes in CnC. Experimental data is provided to show the benefit of automatically generated code for EDT-based runtimes as well as comparisons across runtimes. △ Less

Submitted 5 September, 2014; originally announced September 2014.

Showing 1–10 of 10 results for author: Vasilache, N