-
Stability of Multi-Microgrids: New Certificates, Distributed Control, and Braess's Paradox
Authors:
Amin Gholami,
Xu Andy Sun
Abstract:
This paper investigates the theory of resilience and stability in multi-microgrid networks. We derive new sufficient conditions to guarantee small-signal stability of multi-microgrids in both lossless and lossy networks. The new stability certificate for lossy networks only requires local information, thus leads to a fully distributed control scheme. Moreover, we study the impact of network topolo…
▽ More
This paper investigates the theory of resilience and stability in multi-microgrid networks. We derive new sufficient conditions to guarantee small-signal stability of multi-microgrids in both lossless and lossy networks. The new stability certificate for lossy networks only requires local information, thus leads to a fully distributed control scheme. Moreover, we study the impact of network topology, interface parameters (virtual inertia and dam**), and local measurements (voltage magnitude and reactive power) on the stability of the system. The proposed stability certificate suggests the existence of Braess's Paradox in the stability of multi-microgrids, i.e. adding more connections between microgrids could worsen the multi-microgrid system stability as a whole. We also extend the presented analysis to structure-preserving network models, and provide a stability certificate as a function of original network parameters, instead of the Kron reduced network parameters. We provide a detailed numerical study of the proposed certificate, the distributed control scheme, and a coordinated control approach with line switching. The simulation shows the effectiveness of the proposed stability conditions and control schemes in a four-microgrid network, IEEE 33-bus system, and several large-scale synthetic grids.
△ Less
Submitted 28 March, 2021;
originally announced March 2021.
-
A Survey of Quantization Methods for Efficient Neural Network Inference
Authors:
Amir Gholami,
Sehoon Kim,
Zhen Dong,
Zhewei Yao,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fi…
▽ More
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
△ Less
Submitted 21 June, 2021; v1 submitted 25 March, 2021;
originally announced March 2021.
-
Efficient extended-search space full-waveform inversion with unknown source signatures
Authors:
Hossein S. Aghamiry,
Frichnel W. Mamfoumbi-Ozoumet,
Ali Gholami,
Stéphane Operto
Abstract:
Full waveform inversion (FWI) requires an accurate estimation of source signatures. Due to the coupling between the source signatures and the subsurface model, small errors in the former can translate into large errors in the latter. When direct methods are used to solve the forward problem, classical frequency-domain FWI efficiently processes multiple sources for source signature and wavefield es…
▽ More
Full waveform inversion (FWI) requires an accurate estimation of source signatures. Due to the coupling between the source signatures and the subsurface model, small errors in the former can translate into large errors in the latter. When direct methods are used to solve the forward problem, classical frequency-domain FWI efficiently processes multiple sources for source signature and wavefield estimations once a single Lower-Upper (LU) decomposition of the wave-equation operator has been performed. However, this efficient FWI formulation is based on the exact solution of the wave equation and hence is highly sensitive to the inaccuracy of the velocity model due to the cycle skip** pathology. Recent extended-space FWI variants tackle this sensitivity issue through a relaxation of the wave equation combined with data assimilation, allowing the wavefields to closely match the data from the first inversion iteration. Then, the subsurface parameters are updated by minimizing the wave-equation violations. When the wavefields and the source signatures are jointly estimated with this approach, the extended wave equation operator becomes source dependent, hence making direct methods ineffective. In this paper, we propose a simple method to bypass this issue and estimate source signatures efficiently during extended FWI. The proposed method replaces each source with a blended source during each data-assimilated wavefield reconstruction to make the extended wave equation operator source independent. Besides computational efficiency, the additional degrees of freedom introduced by spatially distributing the sources allows for a better signature estimation at the physical location when the velocity model is rough. Numerical tests on the Marmousi II and 2004 BP salt synthetic models confirm the efficiency and the robustness of the proposed method.
△ Less
Submitted 30 April, 2021; v1 submitted 7 February, 2021;
originally announced February 2021.
-
Hessian-Aware Pruning and Optimal Neural Implant
Authors:
Shixing Yu,
Zhewei Yao,
Amir Gholami,
Zhen Dong,
Sehoon Kim,
Michael W Mahoney,
Kurt Keutzer
Abstract:
Pruning is an effective method to reduce the memory footprint and FLOPs associated with neural network models. However, existing structured-pruning methods often result in significant accuracy degradation for moderate pruning levels. To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric f…
▽ More
Pruning is an effective method to reduce the memory footprint and FLOPs associated with neural network models. However, existing structured-pruning methods often result in significant accuracy degradation for moderate pruning levels. To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning. The basic idea is to prune insensitive components and to use a Neural Implant for moderately sensitive components, instead of completely pruning them. For the latter approach, the moderately sensitive components are replaced with with a low rank implant that is smaller and less computationally expensive than the original component. We use the relative Hessian trace to measure sensitivity, as opposed to the magnitude based sensitivity metric commonly used in the literature. We test HAP for both computer vision tasks and natural language tasks, and we achieve new state-of-the-art results. Specifically, HAP achieves less than $0.1\%$/$0.5\%$ degradation on PreResNet29/ResNet50 (CIFAR-10/ImageNet) with more than 70\%/50\% of parameters pruned. Meanwhile, HAP also achieves significantly better performance (up to 0.8\% with 60\% of parameters pruned) as compared to gradient based method for head pruning on transformer-based models. The framework has been open sourced and available online.
△ Less
Submitted 21 June, 2021; v1 submitted 21 January, 2021;
originally announced January 2021.
-
I-BERT: Integer-only BERT Quantization
Authors:
Sehoon Kim,
Amir Gholami,
Zhewei Yao,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floati…
▽ More
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.
△ Less
Submitted 8 June, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
Authors:
Dave Zhenyu Chen,
Ali Gholami,
Matthias Nießner,
Angel X. Chang
Abstract:
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in…
▽ More
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected]).
△ Less
Submitted 3 December, 2020;
originally announced December 2020.
-
Extended full waveform inversion in the time domain by the augmented Lagrangian method
Authors:
Ali Gholami,
Hossein S. Aghamiry,
Stephane Operto
Abstract:
Extended full-waveform inversion (FWI) has shown promising results for accurate estimation of subsurface parameters when the initial models are not sufficiently accurate. Frequency-domain applications have shown that the augmented Lagrangian (AL) method solves the inverse problem accurately with a minimal effect of the penalty parameter choice. Applying this method in the time domain, however, is…
▽ More
Extended full-waveform inversion (FWI) has shown promising results for accurate estimation of subsurface parameters when the initial models are not sufficiently accurate. Frequency-domain applications have shown that the augmented Lagrangian (AL) method solves the inverse problem accurately with a minimal effect of the penalty parameter choice. Applying this method in the time domain, however, is limited by two main factors: (1) The challenge of data-assimilated wavefield reconstruction due to the lack of an explicit time-step** and (2) The need to store the Lagrange multipliers, which is not feasible for the field-scale problems. We show that these wavefields are efficiently determined from the associated data (projection of the wavefields onto the receivers space) by using explicit time step**. Accordingly, based on the augmented Lagrangian, a new algorithm is proposed which performs in "data space" (a lower dimensional subspace of the full space) in which the wavefield reconstruction step is replaced by reconstruction of the associated data, thus requiring optimization in a lower dimensional space (convenient for handling the Lagrange multipliers). We show that this new algorithm can be implemented efficiently in the time domain with existing solvers for the FWI and at a cost comparable to that of the FWI while benefiting from the robustness of the extended FWI formulation. The results obtained by numerical examples show high-performance of the proposed method for large scale time-domain FWI.
△ Less
Submitted 28 November, 2020;
originally announced November 2020.
-
Thermodynamic relations at the coupling boundary in adaptive resolution simulations for open systems
Authors:
Abbas Gholami,
Felix Höfling,
Rupert Klein,
Luigi Delle Site
Abstract:
The adaptive resolution simulation (AdResS) technique couples regions with different molecular resolutions and allows the exchange of molecules between different regions in an adaptive fashion. The latest development of the technique allows to abruptly couple the atomistically resolved region with a region of non-interacting point-like particles. The abrupt set-up was derived having in mind the id…
▽ More
The adaptive resolution simulation (AdResS) technique couples regions with different molecular resolutions and allows the exchange of molecules between different regions in an adaptive fashion. The latest development of the technique allows to abruptly couple the atomistically resolved region with a region of non-interacting point-like particles. The abrupt set-up was derived having in mind the idea of the atomistically resolved region as an open system embedded in a large reservoir at a given macroscopic state. In this work, starting from the idea of open system, we derive thermodynamic relations for AdResS which justify conceptually and numerically the claim of AdResS as a technique for simulating open systems. In particular, we derive the relation between the chemical potential of the AdResS set-up and that of its reference fully atomistic simulation. The implication of this result is that the grand potential of AdResS can be explicitly written and thus, from a statistical mechanics point of view, the atomistically resolved region of AdResS can be identified with a well defined open system.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
HAWQV3: Dyadic Neural Network Quantization
Authors:
Zhewei Yao,
Zhen Dong,
Zhangcheng Zheng,
Amir Gholami,
Jiali Yu,
Eric Tan,
Leyuan Wang,
Qi**g Huang,
Yida Wang,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-on…
▽ More
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced.
△ Less
Submitted 23 June, 2021; v1 submitted 20 November, 2020;
originally announced November 2020.
-
The Impact of Dam** in Second-Order Dynamical Systems with Applications to Power Grid Stability
Authors:
Amin Gholami,
X. Andy Sun
Abstract:
We consider a broad class of second-order dynamical systems and study the impact of dam** as a system parameter on the stability, hyperbolicity, and bifurcation in such systems. We prove a monotonic effect of dam** on the hyperbolicity of the equilibrium points of the corresponding first-order system. This provides a rigorous formulation and theoretical justification for the intuitive notion t…
▽ More
We consider a broad class of second-order dynamical systems and study the impact of dam** as a system parameter on the stability, hyperbolicity, and bifurcation in such systems. We prove a monotonic effect of dam** on the hyperbolicity of the equilibrium points of the corresponding first-order system. This provides a rigorous formulation and theoretical justification for the intuitive notion that dam** increases stability. To establish this result, we prove a matrix perturbation result for complex symmetric matrices with positive semidefinite perturbations to their imaginary parts, which may be of independent interest. Furthermore, we establish necessary and sufficient conditions for the breakdown of hyperbolicity of the first-order system under dam** variations in terms of observability of a pair of matrices relating dam**, inertia, and Jacobian matrices, and propose sufficient conditions for Hopf bifurcation resulting from such hyperbolicity breakdown. The developed theory has significant applications in the stability of electric power systems, which are one of the most complex and important engineering systems. In particular, we characterize the impact of dam** on the hyperbolicity of the swing equation model which is the fundamental dynamical model of power systems, and demonstrate Hopf bifurcations resulting from dam** variations.
△ Less
Submitted 19 July, 2021; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Joint Mobility-Aware UAV Placement and Routing in Multi-Hop UAV Relaying Systems
Authors:
Anousheh Gholami,
Nariman Torkzaban,
John S. Baras,
Chrysa Papagianni
Abstract:
Unmanned Aerial Vehicles (UAVs) have been extensively utilized to provide wireless connectivity in rural and under-developed areas, enhance network capacity and provide support for peaks or unexpected surges in user demand, mainly due to their fast deployment, cost-efficiency and superior communication performance resulting from Line of Sight (LoS)-dominated wireless channels. In order to exploit…
▽ More
Unmanned Aerial Vehicles (UAVs) have been extensively utilized to provide wireless connectivity in rural and under-developed areas, enhance network capacity and provide support for peaks or unexpected surges in user demand, mainly due to their fast deployment, cost-efficiency and superior communication performance resulting from Line of Sight (LoS)-dominated wireless channels. In order to exploit the benefits of UAVs as base stations or relays in a mobile network, a major challenge is to determine the optimal UAV placement and relocation strategy with respect to the mobility and traffic patterns of the ground network nodes. Moreover, considering that the UAVs form a multi-hop aerial network, capacity and connectivity constraints have significant impacts on the end-to-end network performance. To this end, we formulate the joint UAV placement and routing problem as a Mixed Integer Linear Program (MILP) and propose an approximation that leads to a LP rounding algorithm and achieves a balance between time-complexity and optimality.
△ Less
Submitted 30 September, 2020;
originally announced September 2020.
-
A Fast Certificate for Power System Small-Signal Stability
Authors:
Amin Gholami,
Xu Andy Sun
Abstract:
Swing equations are an integral part of a large class of power system dynamical models used in rotor angle stability assessment. Despite intensive studies, some fundamental properties of lossy swing equations are still not fully understood. In this paper, we develop a sufficient condition for certifying the stability of equilibrium points (EPs) of these equations, and illustrate the effects of dam…
▽ More
Swing equations are an integral part of a large class of power system dynamical models used in rotor angle stability assessment. Despite intensive studies, some fundamental properties of lossy swing equations are still not fully understood. In this paper, we develop a sufficient condition for certifying the stability of equilibrium points (EPs) of these equations, and illustrate the effects of dam**, inertia, and network topology on the stability properties of such EPs. The proposed certificate is suitable for real-time monitoring and fast stability assessment, as it is purely algebraic and can be evaluated in a parallel manner. Moreover, we provide a novel approach to quantitatively measure the degree of stability in power grids using the proposed certificate. Extensive computational experiments are conducted, demonstrating the practicality and effectiveness of the proposal.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Complex-valued Imaging with Total Variation Regularization: An Application to Full-Waveform Inversion in Visco-acoustic Media
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stephane Operto
Abstract:
Full waveform inversion (FWI) is a nonlinear PDE constrained optimization problem, which seeks to estimate constitutive parameters of a medium such as phase velocity, density, and anisotropy, by fitting waveforms. Attenuation is an additional parameter that needs to be taken into account in viscous media to exploit the full potential of FWI. Attenuation is more easily implemented in the frequency…
▽ More
Full waveform inversion (FWI) is a nonlinear PDE constrained optimization problem, which seeks to estimate constitutive parameters of a medium such as phase velocity, density, and anisotropy, by fitting waveforms. Attenuation is an additional parameter that needs to be taken into account in viscous media to exploit the full potential of FWI. Attenuation is more easily implemented in the frequency domain by using complex-valued velocities in the time-harmonic wave equation. These complex velocities are frequency-dependent to guarantee causality and account for dispersion. Since estimating a complex frequency-dependent velocity at each grid point in space is not realistic, the optimization is generally performed in the real domain by processing the phase velocity (or slowness) at a reference frequency and attenuation (or quality factor) as separate real parameters. This real parametrization requires an a priori empirical relation (such as the nonlinear Kolsky-Futterman (KF) or standard linear solid (SLS) attenuation models) between the complex velocity and the two real quantities, which is prone to generate modeling errors if it does not represent accurately the attenuation behavior of the subsurface. Moreover, it leads to a multivariate inverse problem, which is twice larger than the actual size of the medium and ill-posed due to the cross-talk between the two classes of real parameters. To alleviate these issues, we present a mono-variate algorithm that solves directly the optimization problem in the complex domain by processing in sequence narrow bands of frequencies under the assumption of band-wise frequency dependence of the sought complex velocities.
△ Less
Submitted 30 July, 2020;
originally announced July 2020.
-
Boundary thickness and robustness in learning models
Authors:
Yaoqing Yang,
Rajiv Khanna,
Yaodong Yu,
Amir Gholami,
Kurt Keutzer,
Joseph E. Gonzalez,
Kannan Ramchandran,
Michael W. Mahoney
Abstract:
Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured…
▽ More
Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured by the robust generalization gap between training and testing) and lower robustness. We show that a thicker boundary helps improve robustness against adversarial examples (e.g., improving the robust test accuracy of adversarial training) as well as so-called out-of-distribution (OOD) transforms, and we show that many commonly-used regularization and data augmentation procedures can increase boundary thickness. On the theoretical side, we establish that maximizing boundary thickness during training is akin to the so-called mixup training. Using these observations, we show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms. We can also show that the performance improvement in several lines of recent work happens in conjunction with a thicker boundary.
△ Less
Submitted 12 January, 2021; v1 submitted 9 July, 2020;
originally announced July 2020.
-
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
Authors:
Zhewei Yao,
Amir Gholami,
Sheng Shen,
Mustafa Mustafa,
Kurt Keutzer,
Michael W. Mahoney
Abstract:
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order m…
▽ More
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.
△ Less
Submitted 28 April, 2021; v1 submitted 1 June, 2020;
originally announced June 2020.
-
PowerNorm: Rethinking Batch Normalization in Transformers
Authors:
Sheng Shen,
Zhewei Yao,
Amir Gholami,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks;…
▽ More
The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at \url{https://github.com/sIncerass/powernorm}.
△ Less
Submitted 28 June, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Joint Satellite Gateway Placement and Routing for Integrated Satellite-Terrestrial Networks
Authors:
Nariman Torkzaban,
Anousheh Gholami,
Chrysa Papagianni,
John S. Baras
Abstract:
With the increasing attention to the integrated satellite-terrestrial networks (ISTNs), the satellite gateway placement problem becomes of paramount importance. The resulting network performance may vary depending on the different design strategies. In this paper, a joint satellite gateway placement and routing strategy for the terrestrial network is proposed to minimize the overall cost of gatewa…
▽ More
With the increasing attention to the integrated satellite-terrestrial networks (ISTNs), the satellite gateway placement problem becomes of paramount importance. The resulting network performance may vary depending on the different design strategies. In this paper, a joint satellite gateway placement and routing strategy for the terrestrial network is proposed to minimize the overall cost of gateway deployment and traffic routing, while adhering to the average delay requirement for traffic demands. Although traffic routing and gateway placement can be solved independently, the dependence between the routing decisions for different demands makes it more realistic to solve an aggregated model instead. We develop a mixed-integer linear program (MILP) formulation for the problem. We relax the integrality constraints to achieve a linear program (LP) which reduces time-complexity at the expense of a sub-optimal solution. We further propose a variant of the proposed model to balance the load between the selected gateways.
△ Less
Submitted 5 October, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
Full Waveform Inversion with Adaptive Regularization
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
Regularization is necessary for solving nonlinear ill-posed inverse problems arising in different fields of geosciences. The base of a suitable regularization is the prior expressed by the regularizer, which can be non-adaptive or adaptive (data-driven). In this paper, we propose general black-box regularization algorithms for solving nonlinear inverse problems such as full-waveform inversion (FWI…
▽ More
Regularization is necessary for solving nonlinear ill-posed inverse problems arising in different fields of geosciences. The base of a suitable regularization is the prior expressed by the regularizer, which can be non-adaptive or adaptive (data-driven). In this paper, we propose general black-box regularization algorithms for solving nonlinear inverse problems such as full-waveform inversion (FWI), which admit empirical priors that are determined adaptively by sophisticated denoising algorithms. The nonlinear inverse problem is solved by a proximal Newton method, which generalizes the traditional Newton step in such a way to involve the gradients/subgradients of a (possibly non-differentiable) regularization function through operator splitting and proximal map**s. Furthermore, it requires to account for the Hessian matrix in the regularized least-squares optimization problem. We propose two different splitting algorithms for this task. In the first, we compute the Newton search direction with an iterative method based upon the first-order generalized iterative shrinkage-thresholding algorithm (ISTA), and hence Newton-ISTA (NISTA). The iterations require only Hessian-vector products to compute the gradient step of the quadratic approximation of the nonlinear objective function. The second relies on the alternating direction method of multipliers (ADMM), and hence Newton-ADMM (NADMM), where the least-square optimization subproblem and the regularization subproblem in the composite are decoupled through auxiliary variable and solved in an alternating mode. We compare NISTA and NADMM numerically by solving full-waveform inversion with BM3D regularizations. The tests show promising results obtained by both algorithms. However, NADMM shows a faster convergence rate than Newton-ISTA when using L-BFGS to solve the Newton system.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
A Bayesian Monte-Carlo Uncertainty Model for Assessment of Shear Stress Entropy
Authors:
Amin Kazemian-Kale-Kale,
Azadeh Gholami,
Mohammad Rezaie-Balf,
Amir Mosavi,
Ahmed A Sattar,
Bahram Gharabaghi,
Hossein Bonakdari
Abstract:
The entropy models have been recently adopted in many studies to evaluate the distribution of the shear stress in circular channels. However, the uncertainty in their predictions and their reliability remains an open question. We present a novel method to evaluate the uncertainty of four popular entropy models, including Shannon, Shannon-Power Low (PL), Tsallis, and Renyi, in shear stress estimati…
▽ More
The entropy models have been recently adopted in many studies to evaluate the distribution of the shear stress in circular channels. However, the uncertainty in their predictions and their reliability remains an open question. We present a novel method to evaluate the uncertainty of four popular entropy models, including Shannon, Shannon-Power Low (PL), Tsallis, and Renyi, in shear stress estimation in circular channels. The Bayesian Monte-Carlo (BMC) uncertainty method is simplified considering a 95% Confidence Bound (CB). We developed a new statistic index called as FREEopt-based OCB (FOCB) using the statistical indices Forecasting Range of Error Estimation (FREE) and the percentage of observed data in the CB (Nin), which integrates their combined effect. The Shannon and Shannon PL entropies had close values of the FOCB equal to 8.781 and 9.808, respectively, had the highest certainty in the calculation of shear stress values in circular channels followed by traditional uniform flow shear stress and Tsallis models with close values of 14.491 and 14.895, respectively. However, Renyi entropy with much higher values of FOCB equal to 57.726 has less certainty in the estimation of shear stress than other models. Using the presented results in this study, the amount of confidence in entropy methods in the calculation of shear stress to design and implement different types of open channels and their stability is determined.
△ Less
Submitted 10 January, 2020;
originally announced January 2020.
-
ZeroQ: A Novel Zero Shot Quantization Framework
Authors:
Yaohui Cai,
Zhewei Yao,
Zhen Dong,
Amir Gholami,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization method…
▽ More
Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultra-low precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to address this. ZeroQ enables mixed-precision quantization without any access to the training or validation data. This is achieved by optimizing for a Distilled Dataset, which is engineered to match the statistics of batch normalization across different layers of the network. ZeroQ supports both uniform and mixed-precision quantization. For the latter, we introduce a novel Pareto frontier based method to automatically determine the mixed-precision bit setting for all layers, with no manual search involved. We extensively test our proposed method on a diverse set of models, including ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as RetinaNet-ResNet50 on the Microsoft COCO dataset. In particular, we show that ZeroQ can achieve 1.71\% higher accuracy on MobileNetV2, as compared to the recently proposed DFQ method. Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0.5\% of one epoch training time of ResNet50 on ImageNet). We have open-sourced the ZeroQ framework\footnote{https://github.com/amirgholami/ZeroQ}.
△ Less
Submitted 1 January, 2020;
originally announced January 2020.
-
PyHessian: Neural Networks Through the Lens of the Hessian
Authors:
Zhewei Yao,
Amir Gholami,
Kurt Keutzer,
Michael Mahoney
Abstract:
We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open sour…
▽ More
We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.
△ Less
Submitted 5 March, 2020; v1 submitted 15 December, 2019;
originally announced December 2019.
-
HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks
Authors:
Zhen Dong,
Zhewei Yao,
Yaohui Cai,
Daiyaan Arfeen,
Amir Gholami,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at hig…
▽ More
Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at higher precision. However, the search space for a mixed-precision quantization is exponential in the number of layers. Recent work has proposed HAWQ, a novel Hessian based framework, with the aim of reducing this exponential search space by using second-order information. While promising, this prior work has three major limitations: (i) HAWQV1 only uses the top Hessian eigenvalue as a measure of sensitivity and do not consider the rest of the Hessian spectrum; (ii) HAWQV1 approach only provides relative sensitivity of different layers and therefore requires a manual selection of the mixed-precision setting; and (iii) HAWQV1 does not consider mixed-precision activation quantization. Here, we present HAWQV2 which addresses these shortcomings. For (i), we perform a theoretical analysis showing that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues. For (ii), we develop a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection. For (iii), we extend the Hessian analysis to mixed-precision activation quantization. We have found this to be very beneficial for object detection. We show that HAWQV2 achieves new state-of-the-art results for a wide range of tasks.
△ Less
Submitted 9 November, 2019;
originally announced November 2019.
-
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
Authors:
Paras Jain,
Ajay Jain,
Aniruddha Nrusimha,
Amir Gholami,
Pieter Abbeel,
Kurt Keutzer,
Ion Stoica,
Joseph E. Gonzalez
Abstract:
We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm,…
▽ More
We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes. Checkmate is an open-source project, available at https://github.com/parasj/checkmate.
△ Less
Submitted 14 May, 2020; v1 submitted 7 October, 2019;
originally announced October 2019.
-
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
Authors:
Sheng Shen,
Zhen Dong,
Jiayu Ye,
Linjian Ma,
Zhewei Yao,
Amir Gholami,
Michael W. Mahoney,
Kurt Keutzer
Abstract:
Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challengin…
▽ More
Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most $2.3\%$ performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to $13\times$ compression of the model parameters, and up to $4\times$ compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.
△ Less
Submitted 24 September, 2019; v1 submitted 12 September, 2019;
originally announced September 2019.
-
Attenuation imaging by wavefield reconstruction inversion with bound constraints and total variation regularization
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
Wavefield reconstruction inversion (WRI) extends the search space of Full Waveform Inversion (FWI) by allowing for wave equation errors during wavefield reconstruction to match the data from the first iteration. Then, the wavespeeds are updated from the wavefields by minimizing the source residuals. Performing these two tasks in alternating mode breaks down the nonlinear FWI as a sequence of two l…
▽ More
Wavefield reconstruction inversion (WRI) extends the search space of Full Waveform Inversion (FWI) by allowing for wave equation errors during wavefield reconstruction to match the data from the first iteration. Then, the wavespeeds are updated from the wavefields by minimizing the source residuals. Performing these two tasks in alternating mode breaks down the nonlinear FWI as a sequence of two linear subproblems, relaying on the bilinearity of the wave equation. We solve this biconvex optimization with the alternating-direction method of multipliers (ADMM) to cancel out efficiently the data and source residuals in iterations and stabilize the parameter estimation with appropriate regularizations. Here, we extend WRI to viscoacoustic media for attenuation imaging. Attenuation reconstruction is challenging because of the small imprint of attenuation in the data and the cross-talks with velocities. To address these issues, we recast the multivariate viscoacoustic WRI as a triconvex optimization and update wavefields, squared slowness, and attenuation factor in alternating mode at each WRI iteration. This requires to linearize the attenuation-estimation subproblem via an approximated trilinear viscoacoustic wave equation. The iterative defect correction embedded in ADMM corrects the errors generated by this linearization, while the operator splitting allows us to tailor $\ell{1}$ regularization to each parameter class. A toy numerical example shows that these strategies mitigate cross-talk artifacts and noise from the attenuation reconstruction. A more realistic synthetic example representative of the North Sea validates the method.
△ Less
Submitted 6 January, 2020; v1 submitted 11 September, 2019;
originally announced September 2019.
-
Drone-Assisted Communications for Remote Areas and Disaster Relief
Authors:
Anousheh Gholami,
Usman A. Fiaz,
John S. Baras
Abstract:
We explore an end-to-end (including access and backhaul links) UAV-assisted wireless communication system, considering both uplink and downlink traffics, with the goal of supporting demand of the Ground Users (GUs) using the minimum number of UAVs. Moreover, in order to extend the operational (flight) time of UAVs, we exploit an energy-aware routing scheme. Our intention is to design and analyze t…
▽ More
We explore an end-to-end (including access and backhaul links) UAV-assisted wireless communication system, considering both uplink and downlink traffics, with the goal of supporting demand of the Ground Users (GUs) using the minimum number of UAVs. Moreover, in order to extend the operational (flight) time of UAVs, we exploit an energy-aware routing scheme. Our intention is to design and analyze the access and backhaul connectivity of a drone-assisted communication network for remote and crowded areas and disaster relief, while minimizing the resources required i.e., the number of UAVs.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Robust Wavefield Inversion via Phase Retrieval
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
Extended formulation of Full Waveform Inversion (FWI), called Wavefield Reconstruction Inversion (WRI), offers potential benefits of decreasing the nonlinearity of the inverse problem by replacing the explicit inverse of the ill-conditioned wave-equation operator of classical FWI (the oscillating Green functions) with a suitably defined data-driven regularized inverse. This regularization relaxes…
▽ More
Extended formulation of Full Waveform Inversion (FWI), called Wavefield Reconstruction Inversion (WRI), offers potential benefits of decreasing the nonlinearity of the inverse problem by replacing the explicit inverse of the ill-conditioned wave-equation operator of classical FWI (the oscillating Green functions) with a suitably defined data-driven regularized inverse. This regularization relaxes the wave-equation constraint to reconstruct wavefields that match the data, hence mitigating the risk of cycle skip**. The subsurface model parameters are then updated in a direction that reduces these constraint violations. However, in the case of a rough initial model, the phase errors in the reconstructed wavefields may trap the waveform inversion in a local minimum leading to inaccurate subsurface models. In this paper, in order to avoid matching such incorrect phase information during the early WRI iterations, we design a new cost function based upon phase retrieval, namely a process which seeks to reconstruct a signal from the amplitude of linear measurements. This new formulation, called Wavefield Inversion with Phase Retrieval (WIPR), further improves the robustness of the parameter estimation subproblem by a suitable phase correction. We implement the resulting WIPR problem with an alternating-direction approach, which combines the Majorization-Minimization (MM) algorithm to linearise the phase-retrieval term and a variable splitting technique based upon the alternating direction method of multipliers (ADMM). This new workflow equipped with Tikhonov-total variation (TT) regularization, which is the combination of second-order Tikhonov and total variation regularizations and bound constraints, successfully reconstructs the 2004 BP salt model from a sparse fixed-spread acquisition using a 3~Hz starting frequency and a homogeneous initial velocity model.
△ Less
Submitted 24 November, 2019; v1 submitted 25 July, 2019;
originally announced July 2019.
-
ANODEV2: A Coupled Neural ODE Evolution Framework
Authors:
Tianjun Zhang,
Zhewei Yao,
Amir Gholami,
Kurt Keutzer,
Joseph Gonzalez,
George Biros,
Michael Mahoney
Abstract:
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time step**. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network…
▽ More
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time step**. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.
△ Less
Submitted 9 June, 2019;
originally announced June 2019.
-
ADMM-based multi-parameter wavefield reconstruction inversion in VTI acoustic media with TV regularization
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
Full waveform inversion (FWI) is a nonlinear waveform matching procedure, which suffers from cycle skip** when the initial model is not kinematically-accurate enough. To mitigate cycle skip**, wavefield reconstruction inversion (WRI) extends the inversion search space by computing wavefields with a relaxation of the wave equation in order to fit the data from the first iteration. Then, the sub…
▽ More
Full waveform inversion (FWI) is a nonlinear waveform matching procedure, which suffers from cycle skip** when the initial model is not kinematically-accurate enough. To mitigate cycle skip**, wavefield reconstruction inversion (WRI) extends the inversion search space by computing wavefields with a relaxation of the wave equation in order to fit the data from the first iteration. Then, the subsurface parameters are updated by minimizing the source residuals the relaxation generated. Capitalizing on the wave-equation bilinearity, performing wavefield reconstruction and parameter estimation in alternating mode decomposes WRI into two linear subproblems, which can solved efficiently with the alternating-direction method of multiplier (ADMM), leading to the so-called IR-WRI. Moreover, ADMM provides a suitable framework to implement bound constraints and different types of regularizations and their mixture in IR-WRI. Here, IR-WRI is extended to multiparameter reconstruction for VTI acoustic media. To achieve this goal, we first propose different forms of bilinear VTI acoustic wave equation. We develop more specifically IR-WRI for the one that relies on a parametrisation involving vertical wavespeed and Thomsen's parameters delta and epsilon. With a toy numerical example, we first show that the radiation patterns of the virtual sources generate similar wavenumber filtering and parameter cross-talks in classical FWI and IR-WRI. Bound constraints and TV regularization in IR-WRI fully remove these undesired effects for an idealized piecewise constant target. We show with a more realistic long-offset case study representative of the North Sea that anisotropic IR-WRI successfully reconstruct the vertical wavespeed starting from a laterally homogeneous model and update the long-wavelengths of the starting epsilon model, while a smooth delta model is used as a passive background model.
△ Less
Submitted 2 July, 2019; v1 submitted 14 May, 2019;
originally announced May 2019.
-
HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
Authors:
Zhen Dong,
Zhewei Yao,
Amir Gholami,
Michael Mahoney,
Kurt Keutzer
Abstract:
Model size and inference speed/power have become a major challenge in the deployment of Neural Networks for many applications. A promising approach to address these problems is quantization. However, uniformly quantizing a model to ultra low precision leads to significant accuracy degradation. A novel solution for this is to use mixed-precision quantization, as some parts of the network may allow…
▽ More
Model size and inference speed/power have become a major challenge in the deployment of Neural Networks for many applications. A promising approach to address these problems is quantization. However, uniformly quantizing a model to ultra low precision leads to significant accuracy degradation. A novel solution for this is to use mixed-precision quantization, as some parts of the network may allow lower precision as compared to other layers. However, there is no systematic way to determine the precision of different layers. A brute force approach is not feasible for deep networks, as the search space for mixed-precision is exponential in the number of layers. Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision. Here, we introduce Hessian AWare Quantization (HAWQ), a novel second-order quantization method to address these problems. HAWQ allows for the automatic selection of the relative quantization precision of each layer, based on the layer's Hessian spectrum. Moreover, HAWQ provides a deterministic fine-tuning order for quantizing layers, based on second-order information. We show the results of our method on Cifar-10 using ResNet20, and on ImageNet using Inception-V3, ResNet50 and SqueezeNext models. Comparing HAWQ with state-of-the-art shows that we can achieve similar/better accuracy with $8\times$ activation compression ratio on ResNet20, as compared to DNAS~\cite{wu2018mixed}, and up to $1\%$ higher accuracy with up to $14\%$ smaller models on ResNet50 and Inception-V3, compared to recently proposed methods of RVQuant~\cite{park2018value} and HAQ~\cite{wang2018haq}. Furthermore, we show that we can quantize SqueezeNext to just 1MB model size while achieving above $68\%$ top1 accuracy on ImageNet.
△ Less
Submitted 29 April, 2019;
originally announced May 2019.
-
Private Shotgun DNA Sequencing: A Structured Approach
Authors:
Ali Gholami,
Mohammad Ali Maddah-Ali,
Seyed Abolfazl Motahari
Abstract:
DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals' sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at…
▽ More
DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals' sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at a sequencing machine, and assembling the reads, which is done at a trusted local data collector. To confuse the sequencer, in a pooled sequencing scenario, in which multiple sequences are going to be sequenced simultaneously, for each target individual, we add fragments of one non-target individual, with a known DNA sequence at the data collector. Then coverage depth of the individuals, defined as the number of DNA fragments per DNA site, are selected proportional to the powers of two. This layered structured solution allows us to ensure privacy, using only one sequencing machine, in contrast to our previous solution, where we relied on the existence of multiple non-colluding sequencing machines.
△ Less
Submitted 2 April, 2019; v1 submitted 28 March, 2019;
originally announced April 2019.
-
Inefficiency of K-FAC for Large Batch Size Training
Authors:
Linjian Ma,
Gabe Montague,
Jiayu Ye,
Zhewei Yao,
Amir Gholami,
Kurt Keutzer,
Michael W. Mahoney
Abstract:
In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it ha…
▽ More
In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it has been suggested that the Kronecker-Factored Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to large batch sizes, for non-convex machine learning problems such as neural network optimization, as well as greater robustness to variation in model hyperparameters. Here, we perform a detailed empirical analysis of large batch size training %of these two hypotheses, for both \mbox{K-FAC} and SGD, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that both \mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain batch size, and that \mbox{K-FAC} does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that \mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers from similar hyperparameter sensitivity behavior as does SGD. We discuss extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN, respectively, as well as more general implications of our findings.
△ Less
Submitted 31 July, 2019; v1 submitted 14 March, 2019;
originally announced March 2019.
-
Compound Regularization of Full-waveform Inversion for Imaging Piecewise Media
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
The nonlinear and ill-posed nature of full waveform inversion (FWI) requires us to use sophisticated regularization techniques to solve it. In most applications, the model parameters may be described by physical properties (e.g., wave speeds, density, attenuation, anisotropic parameters) which are piecewise functions of space. Compound regularizations are thus necessary to reconstruct properly suc…
▽ More
The nonlinear and ill-posed nature of full waveform inversion (FWI) requires us to use sophisticated regularization techniques to solve it. In most applications, the model parameters may be described by physical properties (e.g., wave speeds, density, attenuation, anisotropic parameters) which are piecewise functions of space. Compound regularizations are thus necessary to reconstruct properly such parameters by FWI. We consider different implementations of compound regularizations in the wavefield reconstruction inversion (WRI) method, a formulation of FWI that extends its search space and prevent the so-called cycle skip** pathology. Our hybrid regularizations rely on Tikhonov and total variation (TV) functionals, from which we build two classes of hybrid regularizers: the first class is simply obtained by a convex combination (CC) of the two functionals, while the second relies on their infimal convolution (IC). In the former class, the model of parameters is required to simultaneously satisfy different priors, while in the latter the model is broken into its basic components, each satisfying a distinct prior (e.g. smooth, piecewise constant, piecewise linear). We implement these types of compound regularizations in the WRI optimization problem using the alternating direction method of multipliers (ADMM). Then, we assess our regularized WRI in the framework of seismic imaging applications. Using a wide range of subsurface models, we conclude that compound regularizer based on IC leads to the lowest error in the parameter reconstruction compared to that obtained with the CC counterpart and the Tikhonov and TV regularizers when used independently.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs
Authors:
Amir Gholami,
Kurt Keutzer,
George Biros
Abstract:
Residual neural networks can be viewed as the forward Euler discretization of an Ordinary Differential Equation (ODE) with a unit time step. This has recently motivated researchers to explore other discretization approaches and train ODE based networks. However, an important challenge of neural ODEs is their prohibitive memory cost during gradient backpropogation. Recently a method proposed in [8]…
▽ More
Residual neural networks can be viewed as the forward Euler discretization of an Ordinary Differential Equation (ODE) with a unit time step. This has recently motivated researchers to explore other discretization approaches and train ODE based networks. However, an important challenge of neural ODEs is their prohibitive memory cost during gradient backpropogation. Recently a method proposed in [8], claimed that this memory overhead can be reduced from O(LN_t), where N_t is the number of time steps, down to O(L) by solving forward ODE backwards in time, where L is the depth of the network. However, we will show that this approach may lead to several problems: (i) it may be numerically unstable for ReLU/non-ReLU activations and general convolution operators, and (ii) the proposed optimize-then-discretize approach may lead to divergent training due to inconsistent gradients for small time step sizes. We discuss the underlying problems, and to address them we propose ANODE, an Adjoint based Neural ODE framework which avoids the numerical instability related problems noted above, and provides unconditionally accurate gradients. ANODE has a memory footprint of O(L) + O(N_t), with the same computational cost as reversing ODE solve. We furthermore, discuss a memory efficient algorithm which can further reduce this footprint with a trade-off of additional computational cost. We show results on Cifar-10/100 datasets using ResNet and SqueezeNext neural networks.
△ Less
Submitted 1 July, 2019; v1 submitted 26 February, 2019;
originally announced February 2019.
-
Implementing bound constraints and total-variation regularization in extended full waveform inversion with the alternating direction method of multiplier: application to large contrast media
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stéphane Operto
Abstract:
Full waveform inversion (FWI) is a waveform matching procedure, which can provide a subsurface model with a wavelength-scale resolution. However, this high resolution makes FWI prone to cycle skip**, which drives the inversion to a local minimum when the initial model is not accurate enough. Other sources of nonlinearities and ill-posedness are noise, uneven illumination, approximate wave physic…
▽ More
Full waveform inversion (FWI) is a waveform matching procedure, which can provide a subsurface model with a wavelength-scale resolution. However, this high resolution makes FWI prone to cycle skip**, which drives the inversion to a local minimum when the initial model is not accurate enough. Other sources of nonlinearities and ill-posedness are noise, uneven illumination, approximate wave physics and parameter cross-talks. All these sources of error require robust and versatile regularized optimization approaches to mitigate their imprint on FWI while preserving its intrinsic resolution power. To achieve this goal, we implement bound constraints and total variation (TV) regularization in the so-called frequency-domain wavefield-reconstruction inversion (WRI) with the alternating direction method of multipliers (ADMM). In the ADMM framework, WRI relies on an augmented Lagrangian function, a combination of penalty and Lagrangian functions, to extend the FWI search space by relaxing the wave-equation constraint during early iterations. Moreover, ADMM breaks down the joint wavefield reconstruction plus parameter estimation problem into a sequence of two linear subproblems, whose solutions are coordinated to provide the solution of the global problem. The decomposability of ADMM is further exploited to interface in a straightforward way bound constraints and TV regularization with WRI via variable splitting and proximal operators. The resilience of our regularized WRI formulation to cycle skip** and noise as well as its resolution power are illustrated with two targets of the large-contrast BP salt model. Starting from a 3Hz frequency and a crude initial model, the extended search space allows for the reconstruction of the salt and subsalt structures with a high fidelity.
△ Less
Submitted 7 February, 2019;
originally announced February 2019.
-
Spontaneous center formation in Dictyostelium discoideum
Authors:
Estefania Vidal-Henriquez,
Azam Gholami
Abstract:
Dictyostelium discoideum (D.d.) is a widely studied amoeba due to its capabilities of development, survival, and self-organization. During aggregation it produces and relays a chemical signal (cAMP) which shows spirals and target centers. Nevertheless, the natural emergence of these structures is still not well understood. We present a mechanism for creation of centers and target waves of cAMP in…
▽ More
Dictyostelium discoideum (D.d.) is a widely studied amoeba due to its capabilities of development, survival, and self-organization. During aggregation it produces and relays a chemical signal (cAMP) which shows spirals and target centers. Nevertheless, the natural emergence of these structures is still not well understood. We present a mechanism for creation of centers and target waves of cAMP in D.d. by adding cell inhomogeneity to a well known reaction-diffusion model of cAMP waves and we characterize its properties. We show how stable activity centers appear spontaneously in areas of higher cell density with the oscillation frequency of these centers depending on their density. The cAMP waves have the characteristic dispersion relation of trigger waves and a velocity which increases with cell density. Chemotactically competent cells react to these waves and create aggregation streams even with very simple movement rules. Finally we argue in favor of the existence of bounded phosphodiesterase to maintain the wave properties once small cell clusters appear.
△ Less
Submitted 19 December, 2018;
originally announced December 2018.
-
Trust Region Based Adversarial Attack on Neural Networks
Authors:
Zhewei Yao,
Amir Gholami,
Peng Xu,
Kurt Keutzer,
Michael Mahoney
Abstract:
Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial pertu…
▽ More
Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently. We propose several attacks based on variants of the trust region optimization method. We test the proposed methods on Cifar-10 and ImageNet datasets using several different models including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods achieve comparable results with the Carlini-Wagner (CW) attack, but with significant speed up of up to $37\times$, for the VGG-16 model on a Titan Xp GPU. For the case of ResNet-50 on ImageNet, we can bring down its classification accuracy to less than 0.1\% with at most $1.5\%$ relative $L_\infty$ (or $L_2$) perturbation requiring only $1.02$ seconds as compared to $27.04$ seconds for the CW attack. We have open sourced our method which can be accessed at [1].
△ Less
Submitted 15 December, 2018;
originally announced December 2018.
-
Parameter Re-Initialization through Cyclical Batch Size Schedules
Authors:
Norman Mu,
Zhewei Yao,
Amir Gholami,
Kurt Keutzer,
Michael Mahoney
Abstract:
Optimal parameter initialization remains a crucial problem for neural network training. A poor weight initialization may take longer to train and/or converge to sub-optimal solutions. Here, we propose a method of weight re-initialization by repeated annealing and injection of noise in the training process. We implement this through a cyclical batch size schedule motivated by a Bayesian perspective…
▽ More
Optimal parameter initialization remains a crucial problem for neural network training. A poor weight initialization may take longer to train and/or converge to sub-optimal solutions. Here, we propose a method of weight re-initialization by repeated annealing and injection of noise in the training process. We implement this through a cyclical batch size schedule motivated by a Bayesian perspective of neural network training. We evaluate our methods through extensive experiments on tasks in language modeling, natural language inference, and image classification. We demonstrate the ability of our method to improve language modeling performance by up to 7.91 perplexity and reduce training iterations by up to $61\%$, in addition to its flexibility in enabling snapshot ensembling and use with adversarial training.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
Authors:
Noah Golmant,
Nikita Vemuri,
Zhewei Yao,
Vladimir Feinberg,
Amir Gholami,
Kai Rothauge,
Michael W. Mahoney,
Joseph Gonzalez
Abstract:
Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network tr…
▽ More
Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.
△ Less
Submitted 30 November, 2018;
originally announced November 2018.
-
Private Shotgun DNA Sequencing
Authors:
Ali Gholami,
Mohammad Ali Maddah-Ali,
Seyed Abolfazl Motahari
Abstract:
Current techniques in sequencing a genome allow a service provider (e.g. a sequencing company) to have full access to the genome information, and thus the privacy of individuals regarding their lifetime secret is violated. In this paper, we introduce the problem of private DNA sequencing, where the goal is to keep the DNA sequence private to the sequencer. We propose an architecture, where the tas…
▽ More
Current techniques in sequencing a genome allow a service provider (e.g. a sequencing company) to have full access to the genome information, and thus the privacy of individuals regarding their lifetime secret is violated. In this paper, we introduce the problem of private DNA sequencing, where the goal is to keep the DNA sequence private to the sequencer. We propose an architecture, where the task of reading fragments of DNA and the task of DNA assembly are separated, the former is done at the sequencer(s), and the later is completed at a local trusted data collector. To satisfy the privacy constraint at the sequencer and reconstruction condition at the data collector, we create an information gap between these two relying on two techniques: (i) we use more than one non-colluding sequencer, all reporting the read fragments to the single data collector, (ii) adding the fragments of some known DNA molecules, which are still unknown to the sequencers, to the pool. We prove that these two techniques provide enough freedom to satisfy both conditions at the same time.
△ Less
Submitted 23 November, 2018;
originally announced November 2018.
-
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
Authors:
Spyridon Bakas,
Mauricio Reyes,
Andras Jakab,
Stefan Bauer,
Markus Rempfler,
Alessandro Crimi,
Russell Takeshi Shinohara,
Christoph Berger,
Sung Min Ha,
Martin Rozycki,
Marcel Prastawa,
Esther Alberts,
Jana Lipkova,
John Freymann,
Justin Kirby,
Michel Bilello,
Hassan Fathallah-Shaykh,
Roland Wiest,
Jan Kirschke,
Benedikt Wiestler,
Rivka Colen,
Aikaterini Kotrotsou,
Pamela Lamontagne,
Daniel Marcus,
Mikhail Milchenko
, et al. (402 additional authors not shown)
Abstract:
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem…
▽ More
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.
△ Less
Submitted 23 April, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
A Novel Domain Adaptation Framework for Medical Image Segmentation
Authors:
Amir Gholami,
Shashank Subramanian,
Varun Shenoy,
Naveen Himthani,
Xiangyu Yue,
Sicheng Zhao,
Peter **,
George Biros,
Kurt Keutzer
Abstract:
We propose a segmentation framework that uses deep neural networks and introduce two innovations. First, we describe a biophysics-based domain adaptation method. Second, we propose an automatic method to segment white and gray matter, and cerebrospinal fluid, in addition to tumorous tissue. Regarding our first innovation, we use a domain adaptation framework that combines a novel multispecies biop…
▽ More
We propose a segmentation framework that uses deep neural networks and introduce two innovations. First, we describe a biophysics-based domain adaptation method. Second, we propose an automatic method to segment white and gray matter, and cerebrospinal fluid, in addition to tumorous tissue. Regarding our first innovation, we use a domain adaptation framework that combines a novel multispecies biophysical tumor growth model with a generative adversarial model to create realistic looking synthetic multimodal MR images with known segmentation. Regarding our second innovation, we propose an automatic approach to enrich available segmentation data by computing the segmentation for healthy tissues. This segmentation, which is done using diffeomorphic image registration between the BraTS training data and a set of prelabeled atlases, provides more information for training and reduces the class imbalance problem. Our overall approach is not specific to any particular neural network and can be used in conjunction with existing solutions. We demonstrate the performance improvement using a 2D U-Net for the BraTS'18 segmentation challenge. Our biophysics based domain adaptation achieves better results, as compared to the existing state-of-the-art GAN model used to create synthetic data for training.
△ Less
Submitted 11 October, 2018;
originally announced October 2018.
-
Simulation of glioblastoma growth using a 3D multispecies tumor model with mass effect
Authors:
Shashank Subramanian,
Amir Gholami,
George Biros
Abstract:
In this article, we present a multispecies reaction-advection-diffusion partial differential equation (PDE) coupled with linear elasticity for modeling tumor growth. The model aims to capture the phenomenological features of glioblastoma multiforme observed in magnetic resonance imaging (MRI) scans. These include enhancing and necrotic tumor structures, brain edema and the so called "mass effect",…
▽ More
In this article, we present a multispecies reaction-advection-diffusion partial differential equation (PDE) coupled with linear elasticity for modeling tumor growth. The model aims to capture the phenomenological features of glioblastoma multiforme observed in magnetic resonance imaging (MRI) scans. These include enhancing and necrotic tumor structures, brain edema and the so called "mass effect", that is, the deformation of brain tissue due to the presence of the tumor. The multispecies model accounts for proliferating, invasive and necrotic tumor cells as well as a simple model for nutrition consumption and tumor-induced brain edema. The coupling of the model with linear elasticity equations with variable coefficients allows us to capture the mechanical deformations due to the tumor growth on surrounding tissues. We present the overall formulation along with a novel operator-splitting scheme with components that include linearly-implicit preconditioned elliptic solvers, and semi-Lagrangian method for advection. Also, we present results showing simulated MRI images which highlight the capability of our method to capture the overall structure of glioblastomas in MRIs.
△ Less
Submitted 26 May, 2019; v1 submitted 12 October, 2018;
originally announced October 2018.
-
Large batch size training of neural networks with adversarial training and second-order information
Authors:
Zhewei Yao,
Amir Gholami,
Daiyaan Arfeen,
Richard Liaw,
Joseph Gonzalez,
Kurt Keutzer,
Michael Mahoney
Abstract:
The most straightforward method to accelerate Stochastic Gradient Descent (SGD) computation is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size. However, large batch training often leads to poorer generalization. A recently proposed solution for this problem is to use adapt…
▽ More
The most straightforward method to accelerate Stochastic Gradient Descent (SGD) computation is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size. However, large batch training often leads to poorer generalization. A recently proposed solution for this problem is to use adaptive batch sizes in SGD. In this case, one starts with a small number of processes and scales the processes as training progresses. Two major challenges with this approach are (i) that dynamically resizing the cluster can add non-trivial overhead, in part since it is currently not supported, and (ii) that the overall speed up is limited by the initial phase with smaller batches. In this work, we address both challenges by develo** a new adaptive batch size framework, with autoscaling based on the Ray framework. This allows very efficient elastic scaling with negligible resizing overhead (0.32\% of time for ResNet18 ImageNet training). Furthermore, we propose a new adaptive batch size training scheme using second order methods and adversarial training. These enable increasing batch sizes earlier during training, which leads to better training time. We extensively evaluate our method on Cifar-10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple neural networks, including ResNets and smaller networks such as SqueezeNext. Our method exceeds the performance of existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively). Importantly, this is achieved without any additional hyper-parameter tuning to tailor our method in any of these experiments.
△ Less
Submitted 2 January, 2020; v1 submitted 1 October, 2018;
originally announced October 2018.
-
Improving full waveform inversion by wavefield reconstruction with the alternating direction method of multipliers
Authors:
Hossein S. Aghamiry,
Ali Gholami,
Stephane Operto
Abstract:
Full waveform inversion (FWI) is an iterative nonlinear waveform matching procedure subject to wave-equation constraint. FWI is highly nonlinear when the wave-equation constraint is enforced at each iteration. To mitigate nonlinearity, wavefield-reconstruction inversion (WRI) expands the search space by relaxing the wave-equation constraint with a penalty method. The pitfall of this approach resid…
▽ More
Full waveform inversion (FWI) is an iterative nonlinear waveform matching procedure subject to wave-equation constraint. FWI is highly nonlinear when the wave-equation constraint is enforced at each iteration. To mitigate nonlinearity, wavefield-reconstruction inversion (WRI) expands the search space by relaxing the wave-equation constraint with a penalty method. The pitfall of this approach resides in the tuning of the penalty parameter because increasing values should be used to foster data fitting during early iterations while progressively enforcing the wave-equation constraint during late iterations. However, large values of penalty parameter lead to ill-conditioned problems. Here, this tuning issue is solved by replacing the penalty method by an augmented Lagrangian method equipped with operator splitting (IR-WRI as iteratively-refined WRI). It is shown that IR-WRI is similar to a penalty method in which data and sources are updated at each iteration by the running sum of the data and source residuals of previous iterations. Moreover, the alternating direction strategy exploits the bilinearity of the wave equation constraint to linearize the subsurface model estimation around the reconstructed wavefield. Accordingly, the original nonlinear FWI is decomposed into a sequence of two linear subproblems, the optimization variable of one subproblem being passed as a passive variable for the next subproblem. The convergence of WRI and IR-WRI are first compared with a simple transmission experiment, which lies in the linear regime of FWI. Under the same conditions, IR-WRI converges to a more accurate minimizer with a smaller number of iterations than WRI. More realistic case studies performed with the Marmousi II and the BP salt models show the resilience of IR-WRI to cycle skip** and noise.
△ Less
Submitted 10 September, 2018; v1 submitted 4 September, 2018;
originally announced September 2018.
-
Towards Resilient Operation of Multi-Microgrids: An MISOCP-Based Frequency-Constrained Approach
Authors:
Amin Gholami,
Xu Andy Sun
Abstract:
High penetration of distributed energy resources (DERs) is transforming the paradigm in power system operation. The ability to provide electricity to customers while the main grid is disrupted has introduced the concept of microgrids with many challenges and opportunities. Emergency control of dangerous transients caused by the transition between the grid-connected and island modes in microgrids i…
▽ More
High penetration of distributed energy resources (DERs) is transforming the paradigm in power system operation. The ability to provide electricity to customers while the main grid is disrupted has introduced the concept of microgrids with many challenges and opportunities. Emergency control of dangerous transients caused by the transition between the grid-connected and island modes in microgrids is one of the main challenges in this context. To address this challenge, this paper proposes a comprehensive optimization and real-time control framework for maintaining frequency stability of multi-microgrid networks under an islanding event and for achieving optimal load shedding and network topology control with AC power flow constraints. The paper also develops a strong mixed-integer second-order cone programming (MISOCP)-based reformulation and a cutting plane algorithm for scalable computation. We believe this is the first time in the literature that such a framework for multi-microgrid network control is proposed, and its effectiveness is demonstrated with extensive numerical experiments.
△ Less
Submitted 31 August, 2018;
originally announced September 2018.
-
CLAIRE: A distributed-memory solver for constrained large deformation diffeomorphic image registration
Authors:
Andreas Mang,
Amir Gholami,
Christos Davatzikos,
George Biros
Abstract:
With this work, we release CLAIRE, a distributed-memory implementation of an effective solver for constrained large deformation diffeomorphic image registration problems in three dimensions. We consider an optimal control formulation. We invert for a stationary velocity field that parameterizes the deformation map. Our solver is based on a globalized, preconditioned, inexact reduced space Gauss--N…
▽ More
With this work, we release CLAIRE, a distributed-memory implementation of an effective solver for constrained large deformation diffeomorphic image registration problems in three dimensions. We consider an optimal control formulation. We invert for a stationary velocity field that parameterizes the deformation map. Our solver is based on a globalized, preconditioned, inexact reduced space Gauss--Newton--Krylov scheme. We exploit state-of-the-art techniques in scientific computing to develop an effective solver that scales to thousands of distributed memory nodes on high-end clusters. We present the formulation, discuss algorithmic features, describe the software package, and introduce an improved preconditioner for the reduced space Hessian to speed up the convergence of our solver. We test registration performance on synthetic and real data. We demonstrate registration accuracy on several neuroimaging datasets. We compare the performance of our scheme against different flavors of the Demons algorithm for diffeomorphic image registration. We study convergence of our preconditioner and our overall algorithm. We report scalability results on state-of-the-art supercomputing platforms. We demonstrate that we can solve registration problems for clinically relevant data sizes in two to four minutes on a standard compute node with 20 cores, attaining excellent data fidelity. With the present work we achieve a speedup of (on average) 5$\times$ with a peak performance of up to 17$\times$ compared to our former work.
△ Less
Submitted 9 December, 2019; v1 submitted 13 August, 2018;
originally announced August 2018.
-
Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications
Authors:
Kiseok Kwon,
Alon Amid,
Amir Gholami,
Bichen Wu,
Krste Asanovic,
Kurt Keutzer
Abstract:
Deep Learning is arguably the most rapidly evolving research area in recent years. As a result it is not surprising that the design of state-of-the-art deep neural net models proceeds without much consideration of the latest hardware targets, and the design of neural net accelerators proceeds without much consideration of the characteristics of the latest deep neural net models. Nevertheless, in t…
▽ More
Deep Learning is arguably the most rapidly evolving research area in recent years. As a result it is not surprising that the design of state-of-the-art deep neural net models proceeds without much consideration of the latest hardware targets, and the design of neural net accelerators proceeds without much consideration of the characteristics of the latest deep neural net models. Nevertheless, in this paper we show that there are significant improvements available if deep neural net models and neural net accelerators are co-designed.
△ Less
Submitted 19 April, 2018;
originally announced April 2018.
-
Spatial heterogeneities shape collective behavior of signaling amoeboid cells
Authors:
Torsten Eckstein,
Estefania Vidal-Henriquez,
Albert Bae,
Azam Gholami
Abstract:
We present novel experimental results on pattern formation of signaling Dictyostelium discoideum amoeba in the presence of a periodic array of millimeter-sized pillars. We observe concentric cAMP waves that initiate almost synchronously at the pillars and propagate outwards. These waves have higher frequency than the other firing centers and dominate the system dynamics. The cells respond chemotac…
▽ More
We present novel experimental results on pattern formation of signaling Dictyostelium discoideum amoeba in the presence of a periodic array of millimeter-sized pillars. We observe concentric cAMP waves that initiate almost synchronously at the pillars and propagate outwards. These waves have higher frequency than the other firing centers and dominate the system dynamics. The cells respond chemotactically to these circular waves and stream towards the pillars, forming periodic Voronoi domains that reflect the periodicity of the underlying lattice. We performed comprehensive numerical simulations of a reaction-diffusion model to study the characteristics of the boundary conditions given by the obstacles. Our simulations show that, the obstacles can act as the wave source depending on the imposed boundary condition. Interestingly, a critical minimum accumulation of cAMP around the obstacles is needed for the pillars to act as the wave source. This critical value is lower at smaller production rates of the intracellular cAMP which can be controlled in our experiments using caffeine. Experiments and simulations also show that in the presence of caffeine the number of firing centers is reduced which is crucial in our system for circular waves emitted from the pillars to successfully take over the dynamics. These results are crucial to understand the signaling mechanism of Dictyostelium cells that experience spatial heterogeneities in its natural habitat.
△ Less
Submitted 18 April, 2018;
originally announced April 2018.
-
Modelling of Dictyostelium Discoideum Movement in Linear Gradient of Chemoattractant
Authors:
Zahra Eidi,
Farshid Mohammad-Rafiee,
Mohammad Khorrami,
Azam Gholami
Abstract:
Chemotaxis is a ubiquitous biological phenomenon in which cells detect a spatial gradient of chemoattractant, and then move towards the source. Here we present a position-dependent advection-diffusion model that quantitatively describes the statistical features of the chemotactic motion of the social amoeba {\it Dictyostelium discoideum} in a linear gradient of cAMP (cyclic adenosine monophosphate…
▽ More
Chemotaxis is a ubiquitous biological phenomenon in which cells detect a spatial gradient of chemoattractant, and then move towards the source. Here we present a position-dependent advection-diffusion model that quantitatively describes the statistical features of the chemotactic motion of the social amoeba {\it Dictyostelium discoideum} in a linear gradient of cAMP (cyclic adenosine monophosphate). We fit the model to experimental trajectories that are recorded in a microfluidic setup with stationary cAMP gradients and extract the diffusion and drift coefficients in the gradient direction. Our analysis shows that for the majority of gradients, both coefficients decrease in time and become negative as the cells crawl up the gradient. The extracted model parameters also show that besides the expected drift in the direction of chemoattractant gradient, we observe a nonlinear dependency of the corresponding variance in time, which can be explained by the model. Furthermore, the results of the model show that the non-linear term in the mean squared displacement of the cell trajectories can dominate the linear term on large time scales.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.