Search | arXiv e-print repository

arXiv:2406.19522 [pdf, other]

Reliable edge machine learning hardware for scientific applications

Authors: Tommaso Baldi, Javier Campos, Ben Hawks, Jennifer Ngadiuba, Nhan Tran, Daniel Diaz, Javier Duarte, Ryan Kastner, Andres Meza, Melissa Quinnan, Olivia Weng, Caleb Geniesse, Amir Gholami, Michael W. Mahoney, Vladimir Loncar, Philip Harris, Joshua Agar, Shuyu Qin

Abstract: Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling… ▽ More Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to develo** and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for develo** robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of develo** autonomous scientific experimentation methods for accelerated scientific discovery. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: IEEE VLSI Test Symposium 2024 (VTS)

Report number: FERMILAB-CONF-24-0116-CSAID

arXiv:2403.15042 [pdf, other]

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Authors: Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Abstract: Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation st… ▽ More Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Our code is available at https://github.com/SqueezeAILab/LLM2LLM

arXiv:2403.14123 [pdf, other]

AI and Memory Wall

Authors: Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

Abstract: The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and… ▽ More The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Published in IEEE Micro Journal

arXiv:2401.18079 [pdf, other]

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Abstract: LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurat… ▽ More LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve $<0.1$ perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. △ Less

Submitted 4 April, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

arXiv:2312.04511 [pdf, other]

An LLM Compiler for Parallel Function Calling

Authors: Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Abstract: The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LLMs to select and coordinate multiple functions based on the context to tackle more complex problems. However, current methods for function calling oft… ▽ More The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LLMs to select and coordinate multiple functions based on the context to tackle more complex problems. However, current methods for function calling often require sequential reasoning and acting for each function which can result in high latency, cost, and sometimes inaccurate behavior. To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function calls. Drawing inspiration from the principles of classical compilers, LLMCompiler enables parallel function calling with three components: (i) a Function Calling Planner, formulating execution plans for function calling; (ii) a Task Fetching Unit, dispatching function calling tasks; and (iii) an Executor, executing these tasks in parallel. LLMCompiler automatically generates an optimized orchestration for the function calls and can be used with both open-source and closed-source models. We have benchmarked LLMCompiler on a range of tasks with different patterns of function calling. We observe consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% compared to ReAct. Our code is available at https://github.com/SqueezeAILab/LLMCompiler. △ Less

Submitted 4 June, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: ICML 2024

arXiv:2310.12072 [pdf, other]

SPEED: Speculative Pipelined Execution for Efficient Decoding

Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao

Abstract: Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nat… ▽ More Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead. △ Less

Submitted 2 January, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: NeurIPS Workshop on Efficient Natural Language and Speech Processing (2023)

arXiv:2306.07629 [pdf, other]

SqueezeLLM: Dense-and-Sparse Quantization

Authors: Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models.… ▽ More Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM. △ Less

Submitted 4 June, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: ICML 2024

arXiv:2306.00258 [pdf, other]

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Authors: Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael Mahoney, Amir Gholami

Abstract: Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained… ▽ More Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: 16 pages, 11 figures

Journal ref: NeurIPS 2023

arXiv:2305.05332 [pdf, ps, other]

On Multi-Message Private Computation

Authors: Ali Gholami, Kai Wan, Tayyebeh Jahani-Nezhad, Hua Sun, Mingyue Ji, Giuseppe Caire

Abstract: In a typical formulation of the private information retrieval (PIR) problem, a single user wishes to retrieve one out of K files from N servers without revealing the demanded file index to any server. This paper formulates an extended model of PIR, referred to as multi-message private computation (MM-PC), where instead of retrieving a single file, the user wishes to retrieve P > 1 linear combinati… ▽ More In a typical formulation of the private information retrieval (PIR) problem, a single user wishes to retrieve one out of K files from N servers without revealing the demanded file index to any server. This paper formulates an extended model of PIR, referred to as multi-message private computation (MM-PC), where instead of retrieving a single file, the user wishes to retrieve P > 1 linear combinations of files while preserving the privacy of the demand information. The MM-PC problem is a generalization of the private computation (PC) problem (where the user requests one linear combination of the files), and the multi-message private information retrieval (MM-PIR) problem (where the user requests P > 1 files). A baseline achievable scheme repeats the optimal PC scheme by Sun and Jafar P times, or treats each possible demanded linear combination as an independent file and then uses the near optimal MM-PIR scheme by Banawan and Ulukus. In this paper, we propose an achievable MM-PC scheme that significantly improves upon the baseline scheme. Doing so, we design the queries inspiring from the structure in the cache-aided scalar linear function retrieval scheme by Wan et al., where they leverage the dependency between messages to reduce the amount of communication. To ensure the decodability of our scheme, we propose a new method to benefit from the existing dependency, referred to as the sign assignment step. In the end, we use Maximum Distance Separable matrices to code the queries, which allows the reduction of download from the servers, while preserving privacy. △ Less

Submitted 22 February, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.06745 [pdf, other]

End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs

Authors: Javier Campos, Zhen Dong, Javier Duarte, Amir Gholami, Michael W. Mahoney, Jovan Mitrevski, Nhan Tran

Abstract: We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpi… ▽ More We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA and ASIC firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on custom ASIC and FPGA hardware within a strict area and latency. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 19 pages, 6 figures, 2 tables

Report number: FERMILAB-PUB-23-150-CSAID-ETD

arXiv:2302.14017 [pdf, other]

Full Stack Optimization of Transformer Inference: a Survey

Authors: Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qi**g Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami

Abstract: Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing… ▽ More Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to develo** dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right map** and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Journal ref: Presented in Workshop on Architecture and System Support for Transformer Models (ASSYST) at ISCA 2023

arXiv:2302.07863 [pdf, other]

Speculative Decoding with Big Little Decoder

Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

Abstract: The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks,… ▽ More The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced △ Less

Submitted 12 October, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023

arXiv:2211.09120 [pdf, other]

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Authors: Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel

Abstract: Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive ma… ▽ More Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: Code available at: https://github.com/wgcban/adamae

arXiv:2210.00055 [pdf, other]

MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Authors: Saeid Asgari Taghanaki, Aliasghar Khani, Fereshte Khani, Ali Gholami, Linh Tran, Ali Mahdavi-Amiri, Ghassan Hamarneh

Abstract: A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that yield good performance on a downstream task without over-fitting spurious input features. This work proposes MaskTune, a masking strategy that prevents over-reliance on spurious (or a limited number of) features. MaskTune forces the trained model to explore new features during a sing… ▽ More A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that yield good performance on a downstream task without over-fitting spurious input features. This work proposes MaskTune, a masking strategy that prevents over-reliance on spurious (or a limited number of) features. MaskTune forces the trained model to explore new features during a single epoch finetuning by masking previously discovered features. MaskTune, unlike earlier approaches for mitigating shortcut learning, does not require any supervision, such as annotating spurious features or labels for subgroup samples in a dataset. Our empirical results on biased MNIST, CelebA, Waterbirds, and ImagenNet-9L datasets show that MaskTune is effective on tasks that often suffer from the existence of spurious correlations. Finally, we show that MaskTune outperforms or achieves similar performance to the competing methods when applied to the selective classification (classification with rejection option) task. Code for MaskTune is available at https://github.com/aliasgharkhani/Masktune. △ Less

Submitted 8 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2207.04084 [pdf, other]

Adaptive Self-supervision Algorithms for Physics-informed Neural Networks

Authors: Shashank Subramanian, Robert M. Kirby, Michael W. Mahoney, Amir Gholami

Abstract: Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function, but recent work has shown that this can lead to optimization difficulties. Here, we study the impact of the location of the collocation points on the trainability of these models. We find that the vanilla PINN performance can be significantly boosted by adaptin… ▽ More Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function, but recent work has shown that this can lead to optimization difficulties. Here, we study the impact of the location of the collocation points on the trainability of these models. We find that the vanilla PINN performance can be significantly boosted by adapting the location of the collocation points as training proceeds. Specifically, we propose a novel adaptive collocation scheme which progressively allocates more collocation points (without increasing their number) to areas where the model is making higher errors (based on the gradient of the loss function in the domain). This, coupled with a judicious restarting of the training during any optimization stalls (by simply resampling the collocation points in order to adjust the loss landscape) leads to better estimates for the prediction error. We present results for several problems, including a 2D Poisson and diffusion-advection system with different forcing functions. We find that training vanilla PINNs for these problems can result in up to 70% prediction error in the solution, especially in the regime of low collocation points. In contrast, our adaptive schemes can achieve up to an order of magnitude smaller error, with similar computational complexity as the baseline. Furthermore, we find that the adaptive methods consistently perform on-par or slightly better than vanilla PINN method, even for large collocation point regimes. The code for all the experiments has been open sourced. △ Less

Submitted 8 July, 2022; originally announced July 2022.

Comments: 15 pages

arXiv:2207.01548 [pdf, other]

Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Authors: Saeid Asgari Taghanaki, Ali Gholami, Fereshte Khani, Kristy Choi, Linh Tran, Ran Zhang, Aliasghar Khani

Abstract: Batch normalization (BN) is a ubiquitous technique for training deep neural networks that accelerates their convergence to reach higher accuracy. However, we demonstrate that BN comes with a fundamental drawback: it incentivizes the model to rely on low-variance features that are highly specific to the training (in-domain) data, hurting generalization performance on out-of-domain examples. In this… ▽ More Batch normalization (BN) is a ubiquitous technique for training deep neural networks that accelerates their convergence to reach higher accuracy. However, we demonstrate that BN comes with a fundamental drawback: it incentivizes the model to rely on low-variance features that are highly specific to the training (in-domain) data, hurting generalization performance on out-of-domain examples. In this work, we investigate this phenomenon by first showing that removing BN layers across a wide range of architectures leads to lower out-of-domain and corruption errors at the cost of higher in-domain errors. We then propose Counterbalancing Teacher (CT), a method which leverages a frozen copy of the same model without BN as a teacher to enforce the student network's learning of robust representations by substantially adapting its weights through a consistency loss function. This regularization signal helps CT perform well in unforeseen data shifts, even without information from the target domain as in prior works. We theoretically show in an overparameterized linear regression setting why normalization leads to a model's reliance on such in-domain features, and empirically demonstrate the efficacy of CT by outperforming several baselines on robustness benchmarks such as CIFAR-10-C, CIFAR-100-C, and VLCS. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.00888 [pdf, other]

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Authors: Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

Abstract: The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and mi… ▽ More The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online. △ Less

Submitted 15 October, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2204.09656 [pdf, other]

A Fast Post-Training Pruning Framework for Transformers

Authors: Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

Abstract: Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any… ▽ More Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-base and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain the models. △ Less

Submitted 17 October, 2022; v1 submitted 29 March, 2022; originally announced April 2022.

Comments: NeurIPS 2022

arXiv:2201.11539 [pdf, ps, other]

Coded Caching with Private Demands and Caches

Authors: Ali Gholami, Kai Wan, Hua Sun, Mingyue Ji, Giuseppe Caire

Abstract: Recently it was shown that the seminal Maddah-Ali and Niesen (MAN) coded caching scheme leaks the demand information of each user to the others. Many works have considered coded caching with demand privacy, while each non-trivial existing coded caching scheme with private demands was built on the fact that the cache information of each user is private to the others. However, most of these schemes… ▽ More Recently it was shown that the seminal Maddah-Ali and Niesen (MAN) coded caching scheme leaks the demand information of each user to the others. Many works have considered coded caching with demand privacy, while each non-trivial existing coded caching scheme with private demands was built on the fact that the cache information of each user is private to the others. However, most of these schemes leak the users' cache information. Consequently, in most realistic settings (e.g., video streaming), where the system is used over time with multiple sequential transmission rounds, these schemes leak demand privacy beyond the first round. This observation motivates our new formulation of coded caching with simultaneously private demands and caches. The main contribution of this paper is a new construction that generates private coded caching schemes by leveraging two-server private information retrieval (PIR) schemes. We show that if in the PIR scheme the demand is uniform over all files and the queries are independent, the resulting caching scheme is private on both the demands and on the caches. Interestingly, we propose a new construction of two-server PIR schemes in this class by leveraging coded caching schemes. By applying the seminal MAN coded caching scheme into our construction, the resulting two-server PIR scheme is proved to be order optimal. This is a second new structural result, somehow closing the loop in the relation between coded caching and PIR. Finally, to explore a broader tradeoff between cache privacy and transmission load, we relax the cache privacy constraint and introduce the definition of leakage on cache information. Then, again as a by-product of our new construction, we propose new schemes with perfect demand privacy and imperfect cache privacy that achieve an order-gain in load with respect to the scheme with perfect privacy on both demands and caches. △ Less

Submitted 2 November, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: 46 pages

arXiv:2201.11067 [pdf, other]

ROMA: Resource Orchestration for Microservices-based 5G Applications

Authors: Anousheh Gholami, Kunal Rao, Wang-Pin Hsiung, Oliver Po, Murugan Sankaradas, Srimat Chakradhar

Abstract: With the growth of 5G, Internet of Things (IoT), edge computing and cloud computing technologies, the infrastructure (compute and network) available to emerging applications (AR/VR, autonomous driving, industry 4.0, etc.) has become quite complex. There are multiple tiers of computing (IoT devices, near edge, far edge, cloud, etc.) that are connected with different types of networking technologies… ▽ More With the growth of 5G, Internet of Things (IoT), edge computing and cloud computing technologies, the infrastructure (compute and network) available to emerging applications (AR/VR, autonomous driving, industry 4.0, etc.) has become quite complex. There are multiple tiers of computing (IoT devices, near edge, far edge, cloud, etc.) that are connected with different types of networking technologies (LAN, LTE, 5G, MAN, WAN, etc.). Deployment and management of applications in such an environment is quite challenging. In this paper, we propose ROMA, which performs resource orchestration for microservices-based 5G applications in a dynamic, heterogeneous, multi-tiered compute and network fabric. We assume that only application-level requirements are known, and the detailed requirements of the individual microservices in the application are not specified. As part of our solution, ROMA identifies and leverages the coupling relationship between compute and network usage for various microservices and solves an optimization problem in order to appropriately identify how each microservice should be deployed in the complex, multi-tiered compute and network fabric, so that the end-to-end application requirements are optimally met. We implemented two real-world 5G applications in video surveillance and intelligent transportation system (ITS) domains. Through extensive experiments, we show that ROMA is able to save up to 90%, 55% and 44% compute and up to 80%, 95% and 75% network bandwidth for the surveillance (watchlist) and transportation application (person and car detection), respectively. This improvement is achieved while honoring the application performance requirements, and it is over an alternative scheme that employs a static and overprovisioned resource allocation strategy by ignoring the resource coupling relationships. △ Less

Submitted 25 February, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

Comments: Accepted at 2022 IEEE/IFIP Network Operations and Management Symposium

arXiv:2110.13041 [pdf, other]

doi 10.3389/fdata.2022.787421

Applications and Techniques for Fast Machine Learning in Science

Authors: Allison McCarn Deiana, Nhan Tran, Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Scott Hauck, Mia Liu, Mark S. Neubauer, Jennifer Ngadiuba, Seda Ogrenci-Memik, Maurizio Pierini, Thea Aarrestad, Steffen Bahr, Jurgen Becker, Anne-Sophie Berthold, Richard J. Bonventre, Tomas E. Muller Bravo, Markus Diefenthaler, Zhen Dong, Nick Fritzsche, Amir Gholami, Ekaterina Govorkova, Kyle J Hazelwood , et al. (62 additional authors not shown)

Abstract: In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML ac… ▽ More In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlap** challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 66 pages, 13 figures, 5 tables

Report number: FERMILAB-PUB-21-502-AD-E-SCD

Journal ref: Front. Big Data 5, 787421 (2022)

arXiv:2109.01050 [pdf, other]

Characterizing possible failure modes in physics-informed neural networks

Authors: Aditi S. Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M. Kirby, Michael W. Mahoney

Abstract: Recent work in scientific machine learning has developed so-called physics-informed neural network (PINN) models. The typical approach is to incorporate physical domain knowledge as soft constraints on an empirical loss function and use existing machine learning methodologies to train the model. We demonstrate that, while existing PINN methodologies can learn good models for relatively trivial pro… ▽ More Recent work in scientific machine learning has developed so-called physics-informed neural network (PINN) models. The typical approach is to incorporate physical domain knowledge as soft constraints on an empirical loss function and use existing machine learning methodologies to train the model. We demonstrate that, while existing PINN methodologies can learn good models for relatively trivial problems, they can easily fail to learn relevant physical phenomena for even slightly more complex problems. In particular, we analyze several distinct situations of widespread physical interest, including learning differential equations with convection, reaction, and diffusion operators. We provide evidence that the soft regularization in PINNs, which involves PDE-based differential operators, can introduce a number of subtle problems, including making the problem more ill-conditioned. Importantly, we show that these possible failure modes are not due to the lack of expressivity in the NN architecture, but that the PINN's setup makes the loss landscape very hard to optimize. We then describe two promising solutions to address these failure modes. The first approach is to use curriculum regularization, where the PINN's loss term starts from a simple PDE regularization, and becomes progressively more complex as the NN gets trained. The second approach is to pose the problem as a sequence-to-sequence learning task, rather than learning to predict the entire space-time at once. Extensive testing shows that we can achieve up to 1-2 orders of magnitude lower error with these methods as compared to regular PINN training. △ Less

Submitted 11 November, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: 22 pages

Journal ref: NeurIPS 2021

arXiv:2108.08730 [pdf, other]

Accurate 3D frequency-domain seismic wave modeling with the wavelength-adaptive 27-point finite-difference stencil: a tool for full waveform inversion

Authors: Hossein S. Aghamiry, Ali Gholami, Laure Combe, Stéphane Operto

Abstract: Efficient frequency-domain Full Waveform Inversion (FWI) of long-offset/wide-azimuth node data can be designed with a few discrete frequencies. However, 3D frequency-domain seismic modeling remains challenging since it requires solving a large and sparse linear indefinite system per frequency. When such systems are solved with direct methods or hybrid direct/iterative solvers, based upon domain de… ▽ More Efficient frequency-domain Full Waveform Inversion (FWI) of long-offset/wide-azimuth node data can be designed with a few discrete frequencies. However, 3D frequency-domain seismic modeling remains challenging since it requires solving a large and sparse linear indefinite system per frequency. When such systems are solved with direct methods or hybrid direct/iterative solvers, based upon domain decomposition preconditioner, finite-difference stencils on regular Cartesian grids should be designed to conciliate compactness and accuracy, the former being necessary to mitigate the fill-in induced by the Lower-Upper (LU) factorization. Compactness is classically implemented by combining several second-order accurate stencils covering the eight cells surrounding the collocation point, leading to the so-called 27-point stencil. Accuracy is obtained by applying optimal weights on the different stiffness and consistent mass matrices such that numerical dispersion is jointly minimized for several number of grid points per wavelength ($G$). However, with this approach, the same weights are used at each collocation point, leading to suboptimal accuracy in heterogeneous media. In this study, we propose a straightforward recipe to improve the accuracy of the 27-point stencil. First, we finely tabulate the values of $G$ covering the range of wavelengths spanned by the subsurface model and the frequency. Then, we estimate with a classical dispersion analysis in homogeneous media the corresponding table of optimal weights that minimize dispersion for each $G$ treated separately. We however apply a Tikhonov regularization to guarantee smooth variation of the weights with $G$. Finally, we build the impedance matrix by selecting the optimal weights at each collocation point according to the local wavelength, hence leading to a wavelength-adaptive stencil. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2107.00910 [pdf, other]

Learned Token Pruning for Transformers

Authors: Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer

Abstract: Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is… ▽ More Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is learned for each layer during training. Our threshold-based method allows the length of the pruned sequence to vary adaptively based on the input sequence, and avoids algorithmically expensive operations such as top-k token selection. We extensively test the performance of LTP on GLUE tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ~2.5% higher accuracy with the same amount of FLOPs. In particular, LTP achieves up to 2.1x FLOPs reduction with less than 1% accuracy drop, which results in up to 1.9x and 2.0x throughput improvement on Intel Haswell CPUs and NVIDIA V100 GPUs, respectively. Furthermore, we demonstrate that LTP is more robust than prior methods to variations on input sentence lengths. Our code has been developed in PyTorch and has been open-sourced. △ Less

Submitted 2 June, 2022; v1 submitted 2 July, 2021; originally announced July 2021.

Comments: KDD 2022 (Research Track)

arXiv:2104.07853 [pdf, other]

On the Importance of Trust in Next-Generation Networked CPS Systems: An AI Perspective

Authors: Anousheh Gholami, Nariman Torkzaban, John S. Baras

Abstract: With the increasing scale, complexity, and heterogeneity of the next generation networked systems, seamless control, management, and security of such systems becomes increasingly challenging. Many diverse applications have driven interest in networked systems, including large-scale distributed learning, multi-agent optimization, 5G service provisioning, and network slicing, etc. In this paper, we… ▽ More With the increasing scale, complexity, and heterogeneity of the next generation networked systems, seamless control, management, and security of such systems becomes increasingly challenging. Many diverse applications have driven interest in networked systems, including large-scale distributed learning, multi-agent optimization, 5G service provisioning, and network slicing, etc. In this paper, we propose trust as a measure to evaluate the status of network agents and improve the decision-making process. We interpret trust as a relation among entities that participate in various protocols. Trust relations are based on evidence created by the interactions of entities within a protocol and may be a composite of multiple metrics such as availability, reliability, resilience, etc. depending on application context. We first elaborate on the importance of trust as a metric and then present a mathematical framework for trust computation and aggregation within a network. Then we show in practice, how trust can be integrated into network decision-making processes by presenting two examples. In the first example, we show how utilizing the trust evidence can improve the performance and the security of Federated Learning. Second, we show how a 5G network resource provisioning framework can be improved when augmented with a trust-aware decision-making scheme. We verify the validity of our trust-based approach through simulations. Finally, we explain the challenges associated with aggregating the trust evidence and briefly explain our ideas to tackle them. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2103.16827 [pdf, other]

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

Abstract: End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. In particular, the previous ap… ▽ More End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. In particular, the previous approaches use floating-point arithmetic during inference and thus they cannot fully exploit efficient integer processing units. Moreover, they require training and/or validation data during quantization, which may not be available due to security or privacy concerns. To address these limitations, we propose an integer-only, zero-shot quantization scheme for ASR models. In particular, we generate synthetic data whose runtime statistics resemble the real data, and we use it to calibrate models during quantization. We apply our method to quantize QuartzNet, Jasper, and Conformer and show negligible WER degradation as compared to the full-precision baseline models, even without using any data. Moreover, we achieve up to 2.35x speedup on a T4 GPU and 4x compression rate, with a modest WER degradation of <1% with INT8 quantization. △ Less

Submitted 30 January, 2022; v1 submitted 31 March, 2021; originally announced March 2021.

Journal ref: ICASSP 2022

arXiv:2103.13630 [pdf, other]

A Survey of Quantization Methods for Efficient Neural Network Inference

Authors: Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Abstract: As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fi… ▽ More As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area. △ Less

Submitted 21 June, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

Comments: Book Chapter: Low-Power Computer Vision: Improving the Efficiency of Artificial Intelligence

arXiv:2101.08940 [pdf, other]

Hessian-Aware Pruning and Optimal Neural Implant

Authors: Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer

Abstract: Pruning is an effective method to reduce the memory footprint and FLOPs associated with neural network models. However, existing structured-pruning methods often result in significant accuracy degradation for moderate pruning levels. To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric f… ▽ More Pruning is an effective method to reduce the memory footprint and FLOPs associated with neural network models. However, existing structured-pruning methods often result in significant accuracy degradation for moderate pruning levels. To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning. The basic idea is to prune insensitive components and to use a Neural Implant for moderately sensitive components, instead of completely pruning them. For the latter approach, the moderately sensitive components are replaced with with a low rank implant that is smaller and less computationally expensive than the original component. We use the relative Hessian trace to measure sensitivity, as opposed to the magnitude based sensitivity metric commonly used in the literature. We test HAP for both computer vision tasks and natural language tasks, and we achieve new state-of-the-art results. Specifically, HAP achieves less than $0.1\%$/$0.5\%$ degradation on PreResNet29/ResNet50 (CIFAR-10/ImageNet) with more than 70\%/50\% of parameters pruned. Meanwhile, HAP also achieves significantly better performance (up to 0.8\% with 60\% of parameters pruned) as compared to gradient based method for head pruning on transformer-based models. The framework has been open sourced and available online. △ Less

Submitted 21 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

arXiv:2101.01321 [pdf, other]

I-BERT: Integer-only BERT Quantization

Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Abstract: Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floati… ▽ More Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced. △ Less

Submitted 8 June, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Journal ref: ICML 2021 (Oral)

arXiv:2012.02206 [pdf, other]

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Authors: Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

Abstract: We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in… ▽ More We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected]). △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: Video: https://youtu.be/AgmIpDbwTCY

arXiv:2011.10680 [pdf, other]

HAWQV3: Dyadic Neural Network Quantization

Authors: Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qi**g Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer

Abstract: Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-on… ▽ More Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced. △ Less

Submitted 23 June, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

Journal ref: ICML 2021

arXiv:2009.14446 [pdf, other]

Joint Mobility-Aware UAV Placement and Routing in Multi-Hop UAV Relaying Systems

Authors: Anousheh Gholami, Nariman Torkzaban, John S. Baras, Chrysa Papagianni

Abstract: Unmanned Aerial Vehicles (UAVs) have been extensively utilized to provide wireless connectivity in rural and under-developed areas, enhance network capacity and provide support for peaks or unexpected surges in user demand, mainly due to their fast deployment, cost-efficiency and superior communication performance resulting from Line of Sight (LoS)-dominated wireless channels. In order to exploit… ▽ More Unmanned Aerial Vehicles (UAVs) have been extensively utilized to provide wireless connectivity in rural and under-developed areas, enhance network capacity and provide support for peaks or unexpected surges in user demand, mainly due to their fast deployment, cost-efficiency and superior communication performance resulting from Line of Sight (LoS)-dominated wireless channels. In order to exploit the benefits of UAVs as base stations or relays in a mobile network, a major challenge is to determine the optimal UAV placement and relocation strategy with respect to the mobility and traffic patterns of the ground network nodes. Moreover, considering that the UAVs form a multi-hop aerial network, capacity and connectivity constraints have significant impacts on the end-to-end network performance. To this end, we formulate the joint UAV placement and routing problem as a Mixed Integer Linear Program (MILP) and propose an approximation that leads to a LP rounding algorithm and achieves a balance between time-complexity and optimality. △ Less

Submitted 30 September, 2020; originally announced September 2020.

Comments: 15 Pages, Accepted at ADHOCNETS2020

arXiv:2007.05086 [pdf, other]

Boundary thickness and robustness in learning models

Authors: Yaoqing Yang, Rajiv Khanna, Yaodong Yu, Amir Gholami, Kurt Keutzer, Joseph E. Gonzalez, Kannan Ramchandran, Michael W. Mahoney

Abstract: Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured… ▽ More Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured by the robust generalization gap between training and testing) and lower robustness. We show that a thicker boundary helps improve robustness against adversarial examples (e.g., improving the robust test accuracy of adversarial training) as well as so-called out-of-distribution (OOD) transforms, and we show that many commonly-used regularization and data augmentation procedures can increase boundary thickness. On the theoretical side, we establish that maximizing boundary thickness during training is akin to the so-called mixup training. Using these observations, we show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms. We can also show that the performance improvement in several lines of recent work happens in conjunction with a thicker boundary. △ Less

Submitted 12 January, 2021; v1 submitted 9 July, 2020; originally announced July 2020.

Journal ref: NeurIPS 2020

arXiv:2006.00719 [pdf, other]

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney

Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order m… ▽ More We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters. △ Less

Submitted 28 April, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

Journal ref: AAAI 2021

arXiv:2003.07845 [pdf, other]

PowerNorm: Rethinking Batch Normalization in Transformers

Authors: Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Abstract: The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks;… ▽ More The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at \url{https://github.com/sIncerass/powernorm}. △ Less

Submitted 28 June, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

Journal ref: ICML 2020

arXiv:2002.03071 [pdf, other]

Joint Satellite Gateway Placement and Routing for Integrated Satellite-Terrestrial Networks

Authors: Nariman Torkzaban, Anousheh Gholami, Chrysa Papagianni, John S. Baras

Abstract: With the increasing attention to the integrated satellite-terrestrial networks (ISTNs), the satellite gateway placement problem becomes of paramount importance. The resulting network performance may vary depending on the different design strategies. In this paper, a joint satellite gateway placement and routing strategy for the terrestrial network is proposed to minimize the overall cost of gatewa… ▽ More With the increasing attention to the integrated satellite-terrestrial networks (ISTNs), the satellite gateway placement problem becomes of paramount importance. The resulting network performance may vary depending on the different design strategies. In this paper, a joint satellite gateway placement and routing strategy for the terrestrial network is proposed to minimize the overall cost of gateway deployment and traffic routing, while adhering to the average delay requirement for traffic demands. Although traffic routing and gateway placement can be solved independently, the dependence between the routing decisions for different demands makes it more realistic to solve an aggregated model instead. We develop a mixed-integer linear program (MILP) formulation for the problem. We relax the integrality constraints to achieve a linear program (LP) which reduces time-complexity at the expense of a sub-optimal solution. We further propose a variant of the proposed model to balance the load between the selected gateways. △ Less

Submitted 5 October, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: 6 pages, In Proceedings of IEEE ICC 2020. https://ieeexplore.ieee.org/document/9149175 N. Torkzaban, A. Gholami, J. S. Baras and C. Papagianni, "Joint Satellite Gateway Placement and Routing for Integrated Satellite-Terrestrial Networks," ICC 2020 - 2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 2020, pp. 1-6. doi: 10.1109/ICC40277.2020.9149175

arXiv:2001.04802 [pdf]

A Bayesian Monte-Carlo Uncertainty Model for Assessment of Shear Stress Entropy

Authors: Amin Kazemian-Kale-Kale, Azadeh Gholami, Mohammad Rezaie-Balf, Amir Mosavi, Ahmed A Sattar, Bahram Gharabaghi, Hossein Bonakdari

Abstract: The entropy models have been recently adopted in many studies to evaluate the distribution of the shear stress in circular channels. However, the uncertainty in their predictions and their reliability remains an open question. We present a novel method to evaluate the uncertainty of four popular entropy models, including Shannon, Shannon-Power Low (PL), Tsallis, and Renyi, in shear stress estimati… ▽ More The entropy models have been recently adopted in many studies to evaluate the distribution of the shear stress in circular channels. However, the uncertainty in their predictions and their reliability remains an open question. We present a novel method to evaluate the uncertainty of four popular entropy models, including Shannon, Shannon-Power Low (PL), Tsallis, and Renyi, in shear stress estimation in circular channels. The Bayesian Monte-Carlo (BMC) uncertainty method is simplified considering a 95% Confidence Bound (CB). We developed a new statistic index called as FREEopt-based OCB (FOCB) using the statistical indices Forecasting Range of Error Estimation (FREE) and the percentage of observed data in the CB (Nin), which integrates their combined effect. The Shannon and Shannon PL entropies had close values of the FOCB equal to 8.781 and 9.808, respectively, had the highest certainty in the calculation of shear stress values in circular channels followed by traditional uniform flow shear stress and Tsallis models with close values of 14.491 and 14.895, respectively. However, Renyi entropy with much higher values of FOCB equal to 57.726 has less certainty in the estimation of shear stress than other models. Using the presented results in this study, the amount of confidence in entropy methods in the calculation of shear stress to design and implement different types of open channels and their stability is determined. △ Less

Submitted 10 January, 2020; originally announced January 2020.

Comments: 48 pages, 7 figures

MSC Class: 65Z05

arXiv:2001.00281 [pdf, other]

ZeroQ: A Novel Zero Shot Quantization Framework

Authors: Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Abstract: Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization method… ▽ More Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultra-low precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to address this. ZeroQ enables mixed-precision quantization without any access to the training or validation data. This is achieved by optimizing for a Distilled Dataset, which is engineered to match the statistics of batch normalization across different layers of the network. ZeroQ supports both uniform and mixed-precision quantization. For the latter, we introduce a novel Pareto frontier based method to automatically determine the mixed-precision bit setting for all layers, with no manual search involved. We extensively test our proposed method on a diverse set of models, including ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as RetinaNet-ResNet50 on the Microsoft COCO dataset. In particular, we show that ZeroQ can achieve 1.71\% higher accuracy on MobileNetV2, as compared to the recently proposed DFQ method. Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0.5\% of one epoch training time of ResNet50 on ImageNet). We have open-sourced the ZeroQ framework\footnote{https://github.com/amirgholami/ZeroQ}. △ Less

Submitted 1 January, 2020; originally announced January 2020.

Comments: CVPR 2020

arXiv:1912.07145 [pdf, other]

PyHessian: Neural Networks Through the Lens of the Hessian

Authors: Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney

Abstract: We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open sour… ▽ More We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks. △ Less

Submitted 5 March, 2020; v1 submitted 15 December, 2019; originally announced December 2019.

Journal ref: IEEE BigData 2020 (and ICML Workshop 2020)

arXiv:1911.03852 [pdf, other]

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

Authors: Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Abstract: Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at hig… ▽ More Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at higher precision. However, the search space for a mixed-precision quantization is exponential in the number of layers. Recent work has proposed HAWQ, a novel Hessian based framework, with the aim of reducing this exponential search space by using second-order information. While promising, this prior work has three major limitations: (i) HAWQV1 only uses the top Hessian eigenvalue as a measure of sensitivity and do not consider the rest of the Hessian spectrum; (ii) HAWQV1 approach only provides relative sensitivity of different layers and therefore requires a manual selection of the mixed-precision setting; and (iii) HAWQV1 does not consider mixed-precision activation quantization. Here, we present HAWQV2 which addresses these shortcomings. For (i), we perform a theoretical analysis showing that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues. For (ii), we develop a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection. For (iii), we extend the Hessian analysis to mixed-precision activation quantization. We have found this to be very beneficial for object detection. We show that HAWQV2 achieves new state-of-the-art results for a wide range of tasks. △ Less

Submitted 9 November, 2019; originally announced November 2019.

Journal ref: NeurIPS 2020 paper, link: https://proceedings.neurips.cc/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Supplemental.pdf

arXiv:1910.02653 [pdf, other]

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

Authors: Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph E. Gonzalez

Abstract: We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm,… ▽ More We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes. Checkmate is an open-source project, available at https://github.com/parasj/checkmate. △ Less

Submitted 14 May, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

Comments: In Proceedings of 3rd Conference Machine Learning and Systems 2020 (MLSys 2020)

arXiv:1909.05840 [pdf, other]

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Authors: Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Abstract: Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challengin… ▽ More Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most $2.3\%$ performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to $13\times$ compression of the model parameters, and up to $4\times$ compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD. △ Less

Submitted 24 September, 2019; v1 submitted 12 September, 2019; originally announced September 2019.

Journal ref: AAAI 2020

arXiv:1909.02150 [pdf, other]

Drone-Assisted Communications for Remote Areas and Disaster Relief

Authors: Anousheh Gholami, Usman A. Fiaz, John S. Baras

Abstract: We explore an end-to-end (including access and backhaul links) UAV-assisted wireless communication system, considering both uplink and downlink traffics, with the goal of supporting demand of the Ground Users (GUs) using the minimum number of UAVs. Moreover, in order to extend the operational (flight) time of UAVs, we exploit an energy-aware routing scheme. Our intention is to design and analyze t… ▽ More We explore an end-to-end (including access and backhaul links) UAV-assisted wireless communication system, considering both uplink and downlink traffics, with the goal of supporting demand of the Ground Users (GUs) using the minimum number of UAVs. Moreover, in order to extend the operational (flight) time of UAVs, we exploit an energy-aware routing scheme. Our intention is to design and analyze the access and backhaul connectivity of a drone-assisted communication network for remote and crowded areas and disaster relief, while minimizing the resources required i.e., the number of UAVs. △ Less

Submitted 4 September, 2019; originally announced September 2019.

Comments: Accepted at DGRS 2019

arXiv:1906.04596 [pdf, other]

ANODEV2: A Coupled Neural ODE Evolution Framework

Authors: Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, Michael Mahoney

Abstract: It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time step**. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network… ▽ More It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time step**. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach. △ Less

Submitted 9 June, 2019; originally announced June 2019.

Journal ref: NeurIPS 2019

arXiv:1905.03696 [pdf, other]

HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision

Authors: Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer

Abstract: Model size and inference speed/power have become a major challenge in the deployment of Neural Networks for many applications. A promising approach to address these problems is quantization. However, uniformly quantizing a model to ultra low precision leads to significant accuracy degradation. A novel solution for this is to use mixed-precision quantization, as some parts of the network may allow… ▽ More Model size and inference speed/power have become a major challenge in the deployment of Neural Networks for many applications. A promising approach to address these problems is quantization. However, uniformly quantizing a model to ultra low precision leads to significant accuracy degradation. A novel solution for this is to use mixed-precision quantization, as some parts of the network may allow lower precision as compared to other layers. However, there is no systematic way to determine the precision of different layers. A brute force approach is not feasible for deep networks, as the search space for mixed-precision is exponential in the number of layers. Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision. Here, we introduce Hessian AWare Quantization (HAWQ), a novel second-order quantization method to address these problems. HAWQ allows for the automatic selection of the relative quantization precision of each layer, based on the layer's Hessian spectrum. Moreover, HAWQ provides a deterministic fine-tuning order for quantizing layers, based on second-order information. We show the results of our method on Cifar-10 using ResNet20, and on ImageNet using Inception-V3, ResNet50 and SqueezeNext models. Comparing HAWQ with state-of-the-art shows that we can achieve similar/better accuracy with $8\times$ activation compression ratio on ResNet20, as compared to DNAS~\cite{wu2018mixed}, and up to $1\%$ higher accuracy with up to $14\%$ smaller models on ResNet50 and Inception-V3, compared to recently proposed methods of RVQuant~\cite{park2018value} and HAQ~\cite{wang2018haq}. Furthermore, we show that we can quantize SqueezeNext to just 1MB model size while achieving above $68\%$ top1 accuracy on ImageNet. △ Less

Submitted 29 April, 2019; originally announced May 2019.

Comments: ICCV 2019

Journal ref: ICCV 2019 paper

arXiv:1904.00800 [pdf, ps, other]

Private Shotgun DNA Sequencing: A Structured Approach

Authors: Ali Gholami, Mohammad Ali Maddah-Ali, Seyed Abolfazl Motahari

Abstract: DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals' sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at… ▽ More DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals' sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at a sequencing machine, and assembling the reads, which is done at a trusted local data collector. To confuse the sequencer, in a pooled sequencing scenario, in which multiple sequences are going to be sequenced simultaneously, for each target individual, we add fragments of one non-target individual, with a known DNA sequence at the data collector. Then coverage depth of the individuals, defined as the number of DNA fragments per DNA site, are selected proportional to the powers of two. This layered structured solution allows us to ensure privacy, using only one sequencing machine, in contrast to our previous solution, where we relied on the existence of multiple non-colluding sequencing machines. △ Less

Submitted 2 April, 2019; v1 submitted 28 March, 2019; originally announced April 2019.

Comments: 10 pages, 3 figures. arXiv admin note: text overlap with arXiv:1811.10693

ACM Class: E.4; H.1.1

arXiv:1903.06237 [pdf, other]

Inefficiency of K-FAC for Large Batch Size Training

Authors: Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney

Abstract: In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it ha… ▽ More In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it has been suggested that the Kronecker-Factored Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to large batch sizes, for non-convex machine learning problems such as neural network optimization, as well as greater robustness to variation in model hyperparameters. Here, we perform a detailed empirical analysis of large batch size training %of these two hypotheses, for both \mbox{K-FAC} and SGD, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that both \mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain batch size, and that \mbox{K-FAC} does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that \mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers from similar hyperparameter sensitivity behavior as does SGD. We discuss extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN, respectively, as well as more general implications of our findings. △ Less

Submitted 31 July, 2019; v1 submitted 14 March, 2019; originally announced March 2019.

Journal ref: AAAI 2020

arXiv:1902.10298 [pdf, other]

ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs

Authors: Amir Gholami, Kurt Keutzer, George Biros

Abstract: Residual neural networks can be viewed as the forward Euler discretization of an Ordinary Differential Equation (ODE) with a unit time step. This has recently motivated researchers to explore other discretization approaches and train ODE based networks. However, an important challenge of neural ODEs is their prohibitive memory cost during gradient backpropogation. Recently a method proposed in [8]… ▽ More Residual neural networks can be viewed as the forward Euler discretization of an Ordinary Differential Equation (ODE) with a unit time step. This has recently motivated researchers to explore other discretization approaches and train ODE based networks. However, an important challenge of neural ODEs is their prohibitive memory cost during gradient backpropogation. Recently a method proposed in [8], claimed that this memory overhead can be reduced from O(LN_t), where N_t is the number of time steps, down to O(L) by solving forward ODE backwards in time, where L is the depth of the network. However, we will show that this approach may lead to several problems: (i) it may be numerically unstable for ReLU/non-ReLU activations and general convolution operators, and (ii) the proposed optimize-then-discretize approach may lead to divergent training due to inconsistent gradients for small time step sizes. We discuss the underlying problems, and to address them we propose ANODE, an Adjoint based Neural ODE framework which avoids the numerical instability related problems noted above, and provides unconditionally accurate gradients. ANODE has a memory footprint of O(L) + O(N_t), with the same computational cost as reversing ODE solve. We furthermore, discuss a memory efficient algorithm which can further reduce this footprint with a trade-off of additional computational cost. We show results on Cifar-10/100 datasets using ResNet and SqueezeNext neural networks. △ Less

Submitted 1 July, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

arXiv:1812.06371 [pdf, other]

Trust Region Based Adversarial Attack on Neural Networks

Authors: Zhewei Yao, Amir Gholami, Peng Xu, Kurt Keutzer, Michael Mahoney

Abstract: Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial pertu… ▽ More Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently. We propose several attacks based on variants of the trust region optimization method. We test the proposed methods on Cifar-10 and ImageNet datasets using several different models including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods achieve comparable results with the Carlini-Wagner (CW) attack, but with significant speed up of up to $37\times$, for the VGG-16 model on a Titan Xp GPU. For the case of ResNet-50 on ImageNet, we can bring down its classification accuracy to less than 0.1\% with at most $1.5\%$ relative $L_\infty$ (or $L_2$) perturbation requiring only $1.02$ seconds as compared to $27.04$ seconds for the CW attack. We have open sourced our method which can be accessed at [1]. △ Less

Submitted 15 December, 2018; originally announced December 2018.

Journal ref: CVPR 2019

arXiv:1812.01216 [pdf, other]

Parameter Re-Initialization through Cyclical Batch Size Schedules

Authors: Norman Mu, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney

Abstract: Optimal parameter initialization remains a crucial problem for neural network training. A poor weight initialization may take longer to train and/or converge to sub-optimal solutions. Here, we propose a method of weight re-initialization by repeated annealing and injection of noise in the training process. We implement this through a cyclical batch size schedule motivated by a Bayesian perspective… ▽ More Optimal parameter initialization remains a crucial problem for neural network training. A poor weight initialization may take longer to train and/or converge to sub-optimal solutions. Here, we propose a method of weight re-initialization by repeated annealing and injection of noise in the training process. We implement this through a cyclical batch size schedule motivated by a Bayesian perspective of neural network training. We evaluate our methods through extensive experiments on tasks in language modeling, natural language inference, and image classification. We demonstrate the ability of our method to improve language modeling performance by up to 7.91 perplexity and reduce training iterations by up to $61\%$, in addition to its flexibility in enabling snapshot ensembling and use with adversarial training. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: Presented in Systems for Machine Learning Workshop at NeurIPS'18 conference

Journal ref: NeurIPS 2018 Workshop

Showing 1–50 of 65 results for author: Gholami, A