Search | arXiv e-print repository

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Authors: Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti , et al. (37 additional authors not shown)

Abstract: We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned var… ▽ More We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2401.11202 [pdf, other]

PartIR: Composing SPMD Partitioning Strategies for Machine Learning

Authors: Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Molloy, Tom Natan, Tamara Norman, Xiaoyue Pan, Adam Paszke, Norman A. Rink, Michael Schaarschmidt, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, Joel Wee

Abstract: Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN par… ▽ More Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance.. △ Less

Submitted 3 March, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2210.04729 [pdf, ps, other]

The Foil: Capture-Avoiding Substitution With No Sharp Edges

Authors: Dougal Maclaurin, Alexey Radul, Adam Paszke

Abstract: Correctly manipulating program terms in a compiler is surprisingly difficult because of the need to avoid name capture. The rapier from "Secrets of the Glasgow Haskell Compiler inliner" is a cutting-edge technique for fast, stateless capture-avoiding substitution for expressions represented with explicit names. It is, however, a sharp tool: its invariants are tricky and need to be maintained throu… ▽ More Correctly manipulating program terms in a compiler is surprisingly difficult because of the need to avoid name capture. The rapier from "Secrets of the Glasgow Haskell Compiler inliner" is a cutting-edge technique for fast, stateless capture-avoiding substitution for expressions represented with explicit names. It is, however, a sharp tool: its invariants are tricky and need to be maintained throughout the whole compiler that uses it. We describe the foil, an elaboration of the rapier that uses Haskell's type system to enforce the rapier's invariants statically, preventing a class of hard-to-find bugs, but without adding any run-time overheads. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: Presented at IFL 2022

arXiv:2204.10923 [pdf, other]

You Only Linearize Once: Tangents Transpose to Gradients

Authors: Alexey Radul, Adam Paszke, Roy Frostig, Matthew Johnson, Dougal Maclaurin

Abstract: Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) forward-mode AD followed by (ii) unzip** the… ▽ More Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) forward-mode AD followed by (ii) unzip** the linear and non-linear parts and then (iii) transposition of the linear part. To that end, we define a (substructurally) linear type system that can prove a class of functions are (algebraically) linear. Our main results are that forward-mode AD produces such linear functions, and that we can unzip and transpose any such linear function, conserving cost, size, and linearity. Composing these three transformations recovers reverse-mode AD. This decomposition also sheds light on checkpointing, which emerges naturally from a free choice in unzip** `let` expressions. As a corollary, checkpointing techniques are applicable to general-purpose partial evaluation, not just AD. We hope that our formalization will lead to a deeper understanding of automatic differentiation and that it will simplify implementations, by separating the concerns of differentiation proper from the concerns of gaining efficiency (namely, separating the derivative computation from the act of running it backward). △ Less

Submitted 6 December, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

arXiv:2112.02958 [pdf, other]

Automap: Towards Ergonomic Automated Parallelism for ML Models

Authors: Michael Schaarschmidt, Dominik Grewe, Dimitrios Vytiniotis, Adam Paszke, Georg Stefan Schmid, Tamara Norman, James Molloy, Jonathan Godwin, Norman Alexander Rink, Vinod Nair, Dan Belov

Abstract: The rapid rise in demand for training large neural network architectures has brought into focus the need for partitioning strategies, for example by using data, model, or pipeline parallelism. Implementing these methods is increasingly supported through program primitives, but identifying efficient partitioning strategies requires expensive experimentation and expertise. We present the prototype o… ▽ More The rapid rise in demand for training large neural network architectures has brought into focus the need for partitioning strategies, for example by using data, model, or pipeline parallelism. Implementing these methods is increasingly supported through program primitives, but identifying efficient partitioning strategies requires expensive experimentation and expertise. We present the prototype of an automated partitioner that seamlessly integrates into existing compilers and existing user workflows. Our partitioner enables SPMD-style parallelism that encompasses data parallelism and parameter/activation sharding. Through a combination of inductive tactics and search in a platform-independent partitioning IR, automap can recover expert partitioning strategies such as Megatron sharding for transformer layers. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: Workshop on ML for Systems at NeurIPS 2021

arXiv:2112.01075 [pdf, other]

Memory-efficient array redistribution through portable collective communication

Authors: Norman A. Rink, Adam Paszke, Dimitrios Vytiniotis, Georg Stefan Schmid

Abstract: Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD comp… ▽ More Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD computations, the most prevalent form of parallelism in deep learning. We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. Array redistribution for SPMD computations using collective operations has also been implemented in the context of the XLA SPMD partitioner, a production-grade tool for partitioning programs across accelerator systems. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22\times$, with maximum speedups as a high as $5.7\times$, while offering provable memory guarantees, making our system particularly appealing for large-scale models. △ Less

Submitted 28 November, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: minor errata fixed

arXiv:2110.07493 [pdf, ps, other]

Parallel Algebraic Effect Handlers

Authors: Ningning Xie, Daniel D. Johnson, Dougal Maclaurin, Adam Paszke

Abstract: Algebraic effects and handlers support composable and structured control-flow abstraction. However, existing designs of algebraic effects often require effects to be executed sequentially. This paper studies parallel algebraic effect handlers. In particular, we formalize λp, an untyped lambda calculus which models two key features, effect handlers and parallelizable computations, the latter of whi… ▽ More Algebraic effects and handlers support composable and structured control-flow abstraction. However, existing designs of algebraic effects often require effects to be executed sequentially. This paper studies parallel algebraic effect handlers. In particular, we formalize λp, an untyped lambda calculus which models two key features, effect handlers and parallelizable computations, the latter of which takes the form of a for expression as inspired by the Dex programming language. We present various interesting examples expressible in our calculus, and provide a Haskell implementation. We hope this paper provides a basis for future designs and implementations of parallel algebraic effect handlers. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: Short paper submitted to the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation (PEPM) 2022

arXiv:2105.09469 [pdf, other]

Decomposing reverse-mode automatic differentiation

Authors: Roy Frostig, Matthew J. Johnson, Dougal Maclaurin, Adam Paszke, Alexey Radul

Abstract: We decompose reverse-mode automatic differentiation into (forward-mode) linearization followed by transposition. Doing so isolates the essential difference between forward- and reverse-mode AD, and simplifies their joint implementation. In particular, once forward-mode AD rules are defined for every primitive operation in a source language, only linear primitives require an additional transpositio… ▽ More We decompose reverse-mode automatic differentiation into (forward-mode) linearization followed by transposition. Doing so isolates the essential difference between forward- and reverse-mode AD, and simplifies their joint implementation. In particular, once forward-mode AD rules are defined for every primitive operation in a source language, only linear primitives require an additional transposition rule in order to arrive at a complete reverse-mode AD implementation. This is how reverse-mode AD is written in JAX and Dex. △ Less

Submitted 19 May, 2021; originally announced May 2021.

Comments: Presented at the LAFI 2021 workshop at POPL, 17 January 2021

arXiv:2104.05372 [pdf, other]

Getting to the Point. Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming

Authors: Adam Paszke, Daniel Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew Johnson, Jonathan Ragan-Kelley, Dougal Maclaurin

Abstract: We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In contrast to composing primitive bulk-array operatio… ▽ More We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In contrast to composing primitive bulk-array operations, we argue for an explicit nested indexing style that mirrors application of functions to arguments. We also introduce a fine-grained typed effects system which affords concise and automatically-parallelized in-place updates. Specifically, an associative accumulation effect allows reverse-mode automatic differentiation of in-place updates in a way that preserves parallelism. Empirically, we benchmark against the Futhark array programming language, and demonstrate that aggressive inlining and type-driven compilation allows array programs to be written in an expressive, "pointful" style with little performance penalty. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: 31 pages with appendix, 11 figures. A conference submission is still under review

arXiv:2102.13254 [pdf, ps, other]

Tensors Fitting Perfectly

Authors: Adam Paszke, Brennan Saeta

Abstract: Multidimensional arrays (NDArrays) are a central abstraction in modern scientific computing environments. Unfortunately, they can make reasoning about programs harder as the number of different array shapes used in an execution of a program is usually very large, and they rarely appear explicitly in program text. To make things worse, many operators make implicit assumptions about the shapes of th… ▽ More Multidimensional arrays (NDArrays) are a central abstraction in modern scientific computing environments. Unfortunately, they can make reasoning about programs harder as the number of different array shapes used in an execution of a program is usually very large, and they rarely appear explicitly in program text. To make things worse, many operators make implicit assumptions about the shapes of their inputs: array addition is commonly enriched with broadcasting semantics, while matrix multiplication assumes that the lengths of contracted dimensions are equal. Because precise reasoning about shapes is crucial to write correct programs using NDArrays, and because shapes are often hard to infer from a quick glance at the program, we developed Tensors Fitting Perfectly, a static analysis tool that reasons about NDArray shapes in Swift for TensorFlow programs by synthesizing a set of shape constraints from an abstract interpretation of the program. It can both (1) check for possible inconsistencies, and (2) provide direct insights about the shapes of intermediate values appearing in the program, including via a mechanism called shape holes. The static analysis works in concert with optional runtime assertions to improve the productivity of program authors. △ Less

Submitted 25 February, 2021; originally announced February 2021.

arXiv:2006.15704 [pdf, other]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Authors: Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith Chintala

Abstract: This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. D… ▽ More This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlap** computation with communication, and skip** gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs. △ Less

Submitted 28 June, 2020; originally announced June 2020.

Comments: To appear in VLDB 2020

arXiv:2003.14177 [pdf, other]

VC density of set systems defnable in tree-like graphs

Authors: Adam Paszke, Michał Pilipczuk

Abstract: We study set systems definable in graphs using variants of logic with different expressive power. Our focus is on the notion of Vapnik-Chervonenkis density: the smallest possible degree of a polynomial bounding the cardinalities of restrictions of such set systems. On one hand, we prove that if $\varphi(\bar x,\bar y)$ is a fixed CMSO$_1$ formula and $\cal C$ is a class of graphs with uniformly bo… ▽ More We study set systems definable in graphs using variants of logic with different expressive power. Our focus is on the notion of Vapnik-Chervonenkis density: the smallest possible degree of a polynomial bounding the cardinalities of restrictions of such set systems. On one hand, we prove that if $\varphi(\bar x,\bar y)$ is a fixed CMSO$_1$ formula and $\cal C$ is a class of graphs with uniformly bounded cliquewidth, then the set systems defined by $\varphi$ in graphs from $\cal C$ have VC density at most $|\bar y|$, which is the smallest bound that one could expect. We also show an analogous statement for the case when $\varphi(\bar x,\bar y)$ is a CMSO$_2$ formula and $\cal C$ is a class of graphs with uniformly bounded treewidth. We complement these results by showing that if $\cal C$ has unbounded cliquewidth (respectively, treewidth), then, under some mild technical assumptions on $\cal C$, the set systems definable by CMSO$_1$ (respectively, CMSO$_2$) formulas in graphs from $\cal C$ may have unbounded VC dimension, hence also unbounded VC density. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: 14 pages, 1 figure

arXiv:1912.01703 [pdf, other]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Authors: Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

Abstract: Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting… ▽ More Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks. △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: 12 pages, 3 figures, NeurIPS 2019

arXiv:1606.02147 [pdf, other]

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Authors: Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello

Abstract: The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural netwo… ▽ More The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster. △ Less

Submitted 7 June, 2016; originally announced June 2016.

arXiv:1605.07678 [pdf, other]

An Analysis of Deep Neural Network Models for Practical Applications

Authors: Alfredo Canziani, Adam Paszke, Eugenio Culurciello

Abstract: Since the emergence of Deep Neural Networks (DNNs) as a prominent technique in the field of computer vision, the ImageNet classification challenge has played a major role in advancing the state-of-the-art. While accuracy figures have steadily increased, the resource utilisation of winning models has not been properly taken into account. In this work, we present a comprehensive analysis of importan… ▽ More Since the emergence of Deep Neural Networks (DNNs) as a prominent technique in the field of computer vision, the ImageNet classification challenge has played a major role in advancing the state-of-the-art. While accuracy figures have steadily increased, the resource utilisation of winning models has not been properly taken into account. In this work, we present a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption. Key findings are: (1) power consumption is independent of batch size and architecture; (2) accuracy and inference time are in a hyperbolic relationship; (3) energy constraint is an upper bound on the maximum achievable accuracy and model complexity; (4) the number of operations is a reliable estimate of the inference time. We believe our analysis provides a compelling set of information that helps design and engineer efficient DNNs. △ Less

Submitted 14 April, 2017; v1 submitted 24 May, 2016; originally announced May 2016.

Comments: 7 pages, 10 figures, legend for Figure 2 got lost :/

Showing 1–17 of 17 results for author: Paszke, A