Search | arXiv e-print repository

arXiv:2203.01924 [pdf, other]

Min-Max Bilevel Multi-objective Optimization with Applications in Machine Learning

Authors: Alex Gu, Songtao Lu, Parikshit Ram, Lily Weng

Abstract: We consider a generic min-max multi-objective bilevel optimization problem with applications in robust machine learning such as representation learning and hyperparameter optimization. We design MORBiT, a novel single-loop gradient descent-ascent bilevel optimization algorithm, to solve the generic problem and present a novel analysis showing that MORBiT converges to the first-order stationary poi… ▽ More We consider a generic min-max multi-objective bilevel optimization problem with applications in robust machine learning such as representation learning and hyperparameter optimization. We design MORBiT, a novel single-loop gradient descent-ascent bilevel optimization algorithm, to solve the generic problem and present a novel analysis showing that MORBiT converges to the first-order stationary point at a rate of $\widetilde{\mathcal{O}}(n^{1/2} K^{-2/5})$ for a class of weakly convex problems with $n$ objectives upon $K$ iterations of the algorithm. Our analysis utilizes novel results to handle the non-smooth min-max multi-objective setup and to obtain a sublinear dependence in the number of objectives $n$. Experimental results on robust representation learning and robust hyperparameter optimization showcase (i) the advantages of considering the min-max multi-objective setup, and (ii) convergence properties of the proposed MORBiT. Our code is at https://github.com/minimario/MORBiT. △ Less

Submitted 7 March, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

Comments: 43 pages, 3 figures, ICLR 2023 version

arXiv:2202.09729 [pdf, other]

It's Raw! Audio Generation with State-Space Models

Authors: Karan Goel, Albert Gu, Chris Donahue, Christopher Ré

Abstract: Develo** architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-s… ▽ More Develo** architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples. △ Less

Submitted 19 February, 2022; originally announced February 2022.

Comments: 23 pages, 7 figures, 7 tables

arXiv:2202.07663 [pdf, other]

doi 10.3847/1538-4357/ac6de4

GIGA-Lens: Fast Bayesian Inference for Strong Gravitational Lens Modeling

Authors: A. Gu, X. Huang, W. Sheu, G. Aldering, A. S. Bolton, K. Boone, A. Dey, A. Filipp, E. Jullo, S. Perlmutter, D. Rubin, E. F. Schlafly, D. J. Schlegel, Y. Shu, S. H. Suyu

Abstract: We present GIGA-Lens: a gradient-informed, GPU-accelerated Bayesian framework for modeling strong gravitational lensing systems, implemented in TensorFlow and JAX. The three components, optimization using multi-start gradient descent, posterior covariance estimation with variational inference, and sampling via Hamiltonian Monte Carlo, all take advantage of gradient information through automatic di… ▽ More We present GIGA-Lens: a gradient-informed, GPU-accelerated Bayesian framework for modeling strong gravitational lensing systems, implemented in TensorFlow and JAX. The three components, optimization using multi-start gradient descent, posterior covariance estimation with variational inference, and sampling via Hamiltonian Monte Carlo, all take advantage of gradient information through automatic differentiation and massive parallelization on graphics processing units (GPUs). We test our pipeline on a large set of simulated systems and demonstrate in detail its high level of performance. The average time to model a single system on four Nvidia A100 GPUs is 105 seconds. The robustness, speed, and scalability offered by this framework make it possible to model the large number of strong lenses found in current surveys and present a very promising prospect for the modeling of $\mathcal{O}(10^5)$ lensing systems expected to be discovered in the era of the Vera C. Rubin Observatory, Euclid, and the Nancy Grace Roman Space Telescope. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: 23 pages, 13 figures, 2 tables. Submitted to ApJ

arXiv:2202.01602 [pdf, other]

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective

Authors: Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, Himabindu Lakkaraju

Abstract: As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questio… ▽ More As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements. To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output. Our findings also underscore the importance of develo** principled evaluation metrics that enable practitioners to effectively compare explanations. △ Less

Submitted 8 February, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

arXiv:2111.00396 [pdf, other]

Efficiently Modeling Long Sequences with Structured State Spaces

Authors: Albert Gu, Karan Goel, Christopher Ré

Abstract: A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promis… ▽ More A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors. △ Less

Submitted 5 August, 2022; v1 submitted 30 October, 2021; originally announced November 2021.

Comments: ICLR 2022 (Outstanding Paper HM)

arXiv:2110.13985 [pdf, other]

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

Authors: Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré

Abstract: Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear… ▽ More Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence $u \mapsto y$ by simply simulating a linear continuous-time state-space representation $\dot{x} = Ax + Bu, y = Cx + Du$. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices $A$ that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: NeurIPS 2021

arXiv:2110.03274 [pdf, other]

Three Operator Splitting with Subgradients, Stochastic Gradients, and Adaptive Learning Rates

Authors: Alp Yurtsever, Alex Gu, Suvrit Sra

Abstract: Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handl… ▽ More Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handle complex penalty functions may be more efficient and realistic. Motivated by these concerns, we analyze three potentially valuable extensions of TOS. The first two permit using subgradients and stochastic gradients, and are shown to ensure a $\mathcal{O}(1/\sqrt{t})$ convergence rate. The third extension AdapTOS endows TOS with adaptive step-sizes. For the important setting of optimizing a convex loss over the intersection of convex sets AdapTOS attains universal convergence rates, i.e., the rate adapts to the unknown smoothness degree of the objective. We compare our proposed methods with competing methods on various applications. △ Less

Submitted 18 February, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Appears in the 35th Annual Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2108.10434 [pdf, ps, other]

doi 10.48550/arXiv.2108.10434

Adaptive shot allocation for fast convergence in variational quantum algorithms

Authors: Andi Gu, Angus Lowe, Pavel A. Dub, Patrick J. Coles, Andrew Arrasmith

Abstract: Variational Quantum Algorithms (VQAs) are a promising approach for practical applications like chemistry and materials science on near-term quantum computers as they typically reduce quantum resource requirements. However, in order to implement VQAs, an efficient classical optimization strategy is required. Here we present a new stochastic gradient descent method using an adaptive number of shots… ▽ More Variational Quantum Algorithms (VQAs) are a promising approach for practical applications like chemistry and materials science on near-term quantum computers as they typically reduce quantum resource requirements. However, in order to implement VQAs, an efficient classical optimization strategy is required. Here we present a new stochastic gradient descent method using an adaptive number of shots at each step, called the global Coupled Adaptive Number of Shots (gCANS) method, which improves on prior art in both the number of iterations as well as the number of shots required. These improvements reduce both the time and money required to run VQAs on current cloud platforms. We analytically prove that in a convex setting gCANS achieves geometric convergence to the optimum. Further, we numerically investigate the performance of gCANS on some chemical configuration problems. We also consider finding the ground state for an Ising model with different numbers of spins to examine the scaling of the method. We find that for these problems, gCANS compares favorably to all of the other optimizers we consider. △ Less

Submitted 23 August, 2021; originally announced August 2021.

Comments: 13 pages, 6 figures, 1 table

Report number: LA-UR-21-28401

arXiv:2106.03306 [pdf, other]

HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

Authors: Ines Chami, Albert Gu, Dat Nguyen, Christopher Ré

Abstract: This paper studies Principal Component Analysis (PCA) for data lying in hyperbolic spaces. Given directions, PCA relies on: (1) a parameterization of subspaces spanned by these directions, (2) a method of projection onto subspaces that preserves information in these directions, and (3) an objective to optimize, namely the variance explained by projections. We generalize each of these concepts to t… ▽ More This paper studies Principal Component Analysis (PCA) for data lying in hyperbolic spaces. Given directions, PCA relies on: (1) a parameterization of subspaces spanned by these directions, (2) a method of projection onto subspaces that preserves information in these directions, and (3) an objective to optimize, namely the variance explained by projections. We generalize each of these concepts to the hyperbolic space and propose HoroPCA, a method for hyperbolic dimensionality reduction. By focusing on the core problem of extracting principal directions, HoroPCA theoretically better preserves information in the original data such as distances, compared to previous generalizations of PCA. Empirically, we validate that HoroPCA outperforms existing dimensionality reduction methods, significantly reducing error in distance preservation. As a data whitening method, it improves downstream classification by up to 3.9% compared to methods that don't use whitening. Finally, we show that HoroPCA can be used to visualize hyperbolic data in two dimensions. △ Less

Submitted 6 June, 2021; originally announced June 2021.

Comments: ICML 2021

Journal ref: PMLR 139:1419-1429, 2021

arXiv:2106.02933 [pdf, other]

k-Mixup Regularization for Deep Learning via Optimal Transport

Authors: Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, Edward Chien

Abstract: Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases robustness to certain distribution shifts. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to \emph{$k$-mixup}, which pertur… ▽ More Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases robustness to certain distribution shifts. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to \emph{$k$-mixup}, which perturbs $k$-batches of training points in the direction of other $k$-batches. The perturbation is done with displacement interpolation, i.e. interpolation under the Wasserstein metric. We demonstrate theoretically and in simulations that $k$-mixup preserves cluster and manifold structures, and we extend theory studying the efficacy of standard mixup to the $k$-mixup case. Our empirical results show that training with $k$-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of $k$-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, $k$-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM. △ Less

Submitted 7 October, 2023; v1 submitted 5 June, 2021; originally announced June 2021.

arXiv:2103.06710 [pdf, other]

Deep Transfer Learning for Infectious Disease Case Detection Using Electronic Medical Records

Authors: Ye Ye, Andrew Gu

Abstract: During an infectious disease pandemic, it is critical to share electronic medical records or models (learned from these records) across regions. Applying one region's data/model to another region often have distribution shift issues that violate the assumptions of traditional machine learning techniques. Transfer learning can be a solution. To explore the potential of deep transfer learning algori… ▽ More During an infectious disease pandemic, it is critical to share electronic medical records or models (learned from these records) across regions. Applying one region's data/model to another region often have distribution shift issues that violate the assumptions of traditional machine learning techniques. Transfer learning can be a solution. To explore the potential of deep transfer learning algorithms, we applied two data-based algorithms (domain adversarial neural networks and maximum classifier discrepancy) and model-based transfer learning algorithms to infectious disease detection tasks. We further studied well-defined synthetic scenarios where the data distribution differences between two regions are known. Our experiments show that, in the context of infectious disease classification, transfer learning may be useful when (1) the source and target are similar and the target training data is insufficient and (2) the target training data does not have labels. Model-based transfer learning works well in the first situation, in which case the performance closely matched that of the data-based transfer learning models. Still, further investigation of the domain shift in real world research data to account for the drop in performance is needed. △ Less

Submitted 7 March, 2021; originally announced March 2021.

arXiv:2102.05824 [pdf, other]

Reproducibility Report: La-MAML: Look-ahead Meta Learning for Continual Learning

Authors: Joel Joseph, Alex Gu

Abstract: The Continual Learning (CL) problem involves performing well on a sequence of tasks under limited compute. Current algorithms in the domain are either slow, offline or sensitive to hyper-parameters. La-MAML, an optimization-based meta-learning algorithm claims to be better than other replay-based, prior-based and meta-learning based approaches. According to the MER paper [1], metrics to measure pe… ▽ More The Continual Learning (CL) problem involves performing well on a sequence of tasks under limited compute. Current algorithms in the domain are either slow, offline or sensitive to hyper-parameters. La-MAML, an optimization-based meta-learning algorithm claims to be better than other replay-based, prior-based and meta-learning based approaches. According to the MER paper [1], metrics to measure performance in the continual learning arena are Retained Accuracy (RA) and Backward Transfer-Interference (BTI). La-MAML claims to perform better in these values when compared to the SOTA in the domain. This is the main claim of the paper, which we shall be verifying in this report. △ Less

Submitted 20 May, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

arXiv:2102.01586 [pdf, other]

U-LanD: Uncertainty-Driven Video Landmark Detection

Authors: Mohammad H. Jafari, Christina Luong, Michael Tsang, Ang Nan Gu, Nathan Van Woudenberg, Robert Rohling, Teresa Tsang, Purang Abolmaesumi

Abstract: This paper presents U-LanD, a framework for joint detection of key frames and landmarks in videos. We tackle a specifically challenging problem, where training labels are noisy and highly sparse. U-LanD builds upon a pivotal observation: a deep Bayesian landmark detector solely trained on key video frames, has significantly lower predictive uncertainty on those frames vs. other frames in videos. W… ▽ More This paper presents U-LanD, a framework for joint detection of key frames and landmarks in videos. We tackle a specifically challenging problem, where training labels are noisy and highly sparse. U-LanD builds upon a pivotal observation: a deep Bayesian landmark detector solely trained on key video frames, has significantly lower predictive uncertainty on those frames vs. other frames in videos. We use this observation as an unsupervised signal to automatically recognize key frames on which we detect landmarks. As a test-bed for our framework, we use ultrasound imaging videos of the heart, where sparse and noisy clinical labels are only available for a single frame in each video. Using data from 4,493 patients, we demonstrate that U-LanD can exceedingly outperform the state-of-the-art non-Bayesian counterpart by a noticeable absolute margin of 42% in R2 score, with almost no overhead imposed on the model size. Our approach is generic and can be potentially applied to other challenging data with noisy and sparse training labels. △ Less

Submitted 2 February, 2021; originally announced February 2021.

arXiv:2012.14966 [pdf, other]

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Authors: Tri Dao, Nimit S. Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré

Abstract: Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off… ▽ More Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (K-matrices) that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that K-matrices can be automatically learned within end-to-end pipelines to replace hand-crafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5%. K-matrices can also simplify hand-engineered pipelines -- we replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition, K-matrices can capture latent structure in models: for a challenging permuted image classification task, a K-matrix based representation of permutations is able to learn the right latent structure and improves accuracy of a downstream convolutional model by over 9%. We provide a practically efficient implementation of our approach, and use K-matrices in a Transformer network to attain 36% faster end-to-end inference speed on a language translation task. △ Less

Submitted 5 January, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

Comments: International Conference on Learning Representations (ICLR) 2020 spotlight

arXiv:2011.12945 [pdf, other]

No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems

Authors: Nimit S. Sohoni, Jared A. Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré

Abstract: In real-world classification tasks, each class often comprises multiple finer-grained "subclasses." As the subclass labels are frequently unavailable, models trained using only the coarser-grained class labels often exhibit highly variable performance across different subclasses. This phenomenon, known as hidden stratification, has important consequences for models deployed in safety-critical appl… ▽ More In real-world classification tasks, each class often comprises multiple finer-grained "subclasses." As the subclass labels are frequently unavailable, models trained using only the coarser-grained class labels often exhibit highly variable performance across different subclasses. This phenomenon, known as hidden stratification, has important consequences for models deployed in safety-critical applications such as medicine. We propose GEORGE, a method to both measure and mitigate hidden stratification even when subclass labels are unknown. We first observe that unlabeled subclasses are often separable in the feature space of deep neural networks, and exploit this fact to estimate subclass labels for the training data via clustering techniques. We then use these approximate subclass labels as a form of noisy supervision in a distributionally robust optimization objective. We theoretically characterize the performance of GEORGE in terms of the worst-case generalization error across any subclass. We empirically validate GEORGE on a mix of real-world and benchmark image classification datasets, and show that our approach boosts worst-case subclass accuracy by up to 22 percentage points compared to standard training techniques, without requiring any prior information about the subclasses. △ Less

Submitted 10 April, 2022; v1 submitted 25 November, 2020; originally announced November 2020.

Comments: 40 pages. Published as a conference paper at NeurIPS 2020

arXiv:2010.00402 [pdf, other]

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Authors: Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Ré

Abstract: Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's di… ▽ More Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1 + epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task. △ Less

Submitted 1 October, 2020; originally announced October 2020.

arXiv:2009.11242 [pdf, other]

Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

Authors: Shi Dong, Zlatan Feric, Guangyu Li, Chieh Wu, April Z. Gu, Jennifer Dy, John Meeker, Ingrid Y. Padilla, Jose Cordero, Carmen Velez Vega, Zaira Rosario, Akram Alshawabkeh, David Kaeli

Abstract: In this paper, we propose Ensemble Learning models to identify factors contributing to preterm birth. Our work leverages a rich dataset collected by a NIEHS P42 Center that is trying to identify the dominant factors responsible for the high rate of premature births in northern Puerto Rico. We investigate analytical models addressing two major challenges present in the dataset: 1) the significant a… ▽ More In this paper, we propose Ensemble Learning models to identify factors contributing to preterm birth. Our work leverages a rich dataset collected by a NIEHS P42 Center that is trying to identify the dominant factors responsible for the high rate of premature births in northern Puerto Rico. We investigate analytical models addressing two major challenges present in the dataset: 1) the significant amount of incomplete data in the dataset, and 2) class imbalance in the dataset. First, we leverage and compare two types of missing data imputation methods: 1) mean-based and 2) similarity-based, increasing the completeness of this dataset. Second, we propose a feature selection and evaluation model based on using undersampling with Ensemble Learning to address class imbalance present in the dataset. We leverage and compare multiple Ensemble Feature selection methods, including Complete Linear Aggregation (CLA), Weighted Mean Aggregation (WMA), Feature Occurrence Frequency (OFA), and Classification Accuracy Based Aggregation (CAA). To further address missing data present in each feature, we propose two novel methods: 1) Missing Data Rate and Accuracy Based Aggregation (MAA), and 2) Entropy and Accuracy Based Aggregation (EAA). Both proposed models balance the degree of data variance introduced by the missing data handling during the feature selection process while maintaining model performance. Our results show a 42\% improvement in sensitivity versus fallout over previous state-of-the-art methods. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Journal ref: ICMLA 2020

arXiv:2008.07669 [pdf, other]

HiPPO: Recurrent Memory with Optimal Polynomial Projections

Authors: Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re

Abstract: A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal soluti… ▽ More A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy. △ Less

Submitted 22 October, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

arXiv:2008.07393 [pdf, other]

Rotation-Invariant Gait Identification with Quaternion Convolutional Neural Networks

Authors: Bowen **g, Vinay Prabhu, Angela Gu, John Whaley

Abstract: A desireable property of accelerometric gait-based identification systems is robustness to new device orientations presented by users during testing but unseen during the training phase. However, traditional Convolutional neural networks (CNNs) used in these systems compensate poorly for such transformations. In this paper, we target this problem by introducing Quaternion CNN, a network architectu… ▽ More A desireable property of accelerometric gait-based identification systems is robustness to new device orientations presented by users during testing but unseen during the training phase. However, traditional Convolutional neural networks (CNNs) used in these systems compensate poorly for such transformations. In this paper, we target this problem by introducing Quaternion CNN, a network architecture which is intrinsically layer-wise equivariant and globally invariant under 3D rotations of an array of input vectors. We show empirically that this network indeed significantly outperforms a traditional CNN in a multi-user rotation-invariant gait classification setting .Lastly, we demonstrate how the kernels learned by this QCNN can also be visualized as basis-independent but origin- and chirality-dependent trajectory fragments in the euclidean space, thus yielding a novel mode of feature visualization and extraction. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2008.06775 [pdf, other]

Model Patching: Closing the Subgroup Performance Gap with Data Augmentation

Authors: Karan Goel, Albert Gu, Yixuan Li, Christopher Ré

Abstract: Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that enc… ▽ More Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models subgroup features within a class and learns semantic transformations between them, and then trains a classifier with data augmentations that deliberately manipulate subgroup features. We instantiate model patching with CAMEL, which (1) uses a CycleGAN to learn the intra-class, inter-subgroup augmentations, and (2) balances subgroup performance using a theoretically-motivated subgroup consistency regularizer, accompanied by a new robust objective. We demonstrate CAMEL's effectiveness on 3 benchmark datasets, with reductions in robust error of up to 33% relative to the best baseline. Lastly, CAMEL successfully patches a model that fails due to spurious features on a real-world skin cancer dataset. △ Less

Submitted 15 August, 2020; originally announced August 2020.

arXiv:2007.11156 [pdf, ps, other]

Weak Pullback Mean Random Attractors for Stochastic Evolution Equations and Applications

Authors: Anhui Gu

Abstract: In this paper, we investigate the existence and uniqueness of weak pullback mean random attractors for abstract stochastic evolution equations with general diffusion terms in Bochner spaces. As applications, the existence and uniqueness of weak pullback mean random attractors for some stochastic models such as stochastic reaction-diffusion equations, the stochastic $p$-Laplace equation and stochas… ▽ More In this paper, we investigate the existence and uniqueness of weak pullback mean random attractors for abstract stochastic evolution equations with general diffusion terms in Bochner spaces. As applications, the existence and uniqueness of weak pullback mean random attractors for some stochastic models such as stochastic reaction-diffusion equations, the stochastic $p$-Laplace equation and stochastic porous media equations are established. △ Less

Submitted 15 September, 2020; v1 submitted 21 July, 2020; originally announced July 2020.

Comments: Few details were improved. Comments are welcome

MSC Class: 35B40; 35B41; 37L30

arXiv:2005.04730 [pdf, other]

doi 10.3847/1538-4357/abd62b

Discovering New Strong Gravitational Lenses in the DESI Legacy Imaging Surveys

Authors: X. Huang, C. Storfer, A. Gu, V. Ravi, A. Pilon, W. Sheu, R. Venguswamy, S. Banka, A. Dey, M. Landriau, D. Lang, A. Meisner, J. Moustakas, A. D. Myers, R. Sajith, E. F. Schlafly, D. J. Schlegel

Abstract: We have conducted a search for new strong gravitational lensing systems in the Dark Energy Spectroscopic Instrument Legacy Imaging Surveys' Data Release 8. We use deep residual neural networks, building on previous work presented in Huang et al. (2020). These surveys together cover approximately one third of the sky visible from the northern hemisphere, reaching a z band AB magnitude of ~22.5. We… ▽ More We have conducted a search for new strong gravitational lensing systems in the Dark Energy Spectroscopic Instrument Legacy Imaging Surveys' Data Release 8. We use deep residual neural networks, building on previous work presented in Huang et al. (2020). These surveys together cover approximately one third of the sky visible from the northern hemisphere, reaching a z band AB magnitude of ~22.5. We compile a training sample that consists of known lensing systems as well as non-lenses in the Legacy Surveys and the Dark Energy Survey. After applying our trained neural networks to the survey data, we visually inspect and rank images with probabilities above a threshold. Here we present 1210 new strong lens candidates. △ Less

Submitted 10 January, 2021; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: 31 pages, 17 figures. Accepted for publication in The Astrophysical Journal. The strong lens candidates discovered in this article can be found at https://sites.google.com/usfca.edu/neuralens/publications/lens-candidates-huang-2020b

arXiv:2002.10425 [pdf, ps, other]

Rough Path Theory to approximate Random Dynamical Systems

Authors: Hongjun Gao, María J. Garrido-Atienza, Anhui Gu, Kening Lu, Björn Schmalfuss

Abstract: We consider the rough differential equation $dY=f(Y)d\bm \om$ where $\bm \om=(ω,\bbomega)$ is a rough path defined by a Brownian motion $ω$ on $\RR^m$. Under the usual regularity assumption on $f$, namely $f\in C^3_b (\RR^d, \RR^{d\times m})$, the rough differential equation has a unique solution that defines a random dynamical system $φ_0$. On the other hand, we also consider an ordinary random d… ▽ More We consider the rough differential equation $dY=f(Y)d\bm \om$ where $\bm \om=(ω,\bbomega)$ is a rough path defined by a Brownian motion $ω$ on $\RR^m$. Under the usual regularity assumption on $f$, namely $f\in C^3_b (\RR^d, \RR^{d\times m})$, the rough differential equation has a unique solution that defines a random dynamical system $φ_0$. On the other hand, we also consider an ordinary random differential equation $dY_δ=f(Y_δ)dω_\de$, where $ω_\de$ is a random process with stationary increments and continuously differentiable paths that approximates $ω$. The latter differential equation generates a random dynamical system $φ_δ$ as well. We show the convergence of the random dynamical system $φ_δ$ to $φ_0$ for $δ\to 0$ in Hölder norm. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: 23 pages

arXiv:1910.09890 [pdf, other]

Improving the Gating Mechanism of Recurrent Neural Networks

Authors: Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, Razvan Pascanu

Abstract: Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. However, their saturation property introduces problems of its own. For example, in recurrent models these gates need to have outputs near 1 to propagate information over long time-delays, which requires them to operate in their saturation regime and hinders gra… ▽ More Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. However, their saturation property introduces problems of its own. For example, in recurrent models these gates need to have outputs near 1 to propagate information over long time-delays, which requires them to operate in their saturation regime and hinders gradient-based learning of the gate mechanism. We address this problem by deriving two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation. We show how these changes are related to and improve on alternative recently proposed gating mechanisms such as chrono initialization and Ordered Neurons. Empirically, our simple gating mechanisms robustly improve the performance of recurrent models on a range of applications, including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning, particularly when long-term dependencies are involved. △ Less

Submitted 18 June, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: International Conference on Machine Learning 2020

arXiv:1908.03914 [pdf, ps, other]

Arithmetic of weighted Catalan numbers

Authors: Yibo Gao, Andrew Gu

Abstract: In this paper, we study arithmetic properties of weighted Catalan numbers. Previously, Postnikov and Sagan found conditions under which the $2$-adic valuations of the weighted Catalan numbers are equal to the $2$-adic valutations of the Catalan numbers. We obtain the same result under weaker conditions by considering a map from a class of functions to $2$-adic integers. These methods are also exte… ▽ More In this paper, we study arithmetic properties of weighted Catalan numbers. Previously, Postnikov and Sagan found conditions under which the $2$-adic valuations of the weighted Catalan numbers are equal to the $2$-adic valutations of the Catalan numbers. We obtain the same result under weaker conditions by considering a map from a class of functions to $2$-adic integers. These methods are also extended to $q$-weighted Catalan numbers, strengthening a previous result by Konvalinka. Finally, we prove some results on the periodicity of weighted Catalan numbers modulo an integer and apply them to the specific case of the number of combinatorial types of Morse links. Many open questions are mentioned. △ Less

Submitted 11 August, 2019; originally announced August 2019.

Comments: 27 pages

arXiv:1907.08362 [pdf, ps, other]

Sparse Recovery for Orthogonal Polynomial Transforms

Authors: Anna Gilbert, Albert Gu, Christopher Re, Atri Rudra, Mary Wootters

Abstract: In this paper we consider the following sparse recovery problem. We have query access to a vector $\vx \in \R^N$ such that $\vhx = \vF \vx$ is $k$-sparse (or nearly $k$-sparse) for some orthogonal transform $\vF$. The goal is to output an approximation (in an $\ell_2$ sense) to $\vhx$ in sublinear time. This problem has been well-studied in the special case that $\vF$ is the Discrete Fourier Trans… ▽ More In this paper we consider the following sparse recovery problem. We have query access to a vector $\vx \in \R^N$ such that $\vhx = \vF \vx$ is $k$-sparse (or nearly $k$-sparse) for some orthogonal transform $\vF$. The goal is to output an approximation (in an $\ell_2$ sense) to $\vhx$ in sublinear time. This problem has been well-studied in the special case that $\vF$ is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time $O(k \cdot \mathrm{polylog} N)$. However, for transforms $\vF$ other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled. In this paper we give sublinear-time algorithms---running in time $\poly(k \log(N))$---for solving the sparse recovery problem for orthogonal transforms $\vF$ that arise from orthogonal polynomials. More precisely, our algorithm works for any $\vF$ that is an orthogonal polynomial transform derived from Jacobi polynomials. The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing. One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability. Our approach is to give a very general reduction from the $k$-sparse sparse recovery problem to the $1$-sparse sparse recovery problem that holds for any flat orthogonal polynomial transform; then we solve this one-sparse recovery problem for transforms derived from Jacobi polynomials. △ Less

Submitted 18 July, 2019; originally announced July 2019.

Comments: 64 pages

arXiv:1903.05895 [pdf, other]

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Authors: Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré

Abstract: Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and… ▽ More Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N \log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions $N$ up to $1024$. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points -- the first time a structured approach has done so -- with 4X faster inference speed and 40X fewer parameters. △ Less

Submitted 28 December, 2020; v1 submitted 14 March, 2019; originally announced March 2019.

Comments: International Conference on Machine Learning (ICML) 2019

arXiv:1902.07152 [pdf, other]

doi 10.1103/PhysRevC.101.024908

Elliptical flow coalescence to identify the $f_{0}$(980) content

Authors: An Gu, Terrence Edmonds, Jie Zhao, Fuqiang Wang

Abstract: We use a simple coalescence model to generate $f_{0}$(980) particles for three configurations: a ${s\bar{s}}$ meson, a ${u\bar{u}s\bar{s}}$ tetraquark, and a ${K^{+}K^{-}}$ molecule. The phase-space information of the coalescing constituents is taken from a multi-phase transport (AMPT) simulation of heavy-ion collisions. It is shown that the number of constituent quarks scaling of the elliptic flo… ▽ More We use a simple coalescence model to generate $f_{0}$(980) particles for three configurations: a ${s\bar{s}}$ meson, a ${u\bar{u}s\bar{s}}$ tetraquark, and a ${K^{+}K^{-}}$ molecule. The phase-space information of the coalescing constituents is taken from a multi-phase transport (AMPT) simulation of heavy-ion collisions. It is shown that the number of constituent quarks scaling of the elliptic flow anisotropy can be used to discern ${s\bar{s}}$ from ${u\bar{u}s\bar{s}}$ and ${K^{+}K^{-}}$ configurations. △ Less

Submitted 19 February, 2019; originally announced February 2019.

Journal ref: Phys. Rev. C 101, 024908 (2020)

arXiv:1810.02309 [pdf, other]

Learning Compressed Transforms with Low Displacement Rank

Authors: Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, Christopher Ré

Abstract: The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a class of LDR matrices with more general displacement operators, and explicitly learn over both… ▽ More The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the low-rank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multi-layer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fully-connected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks also outperform general unstructured layers while using more than 20x fewer parameters. △ Less

Submitted 1 January, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: NeurIPS 2018. Code available at https://github.com/HazyResearch/structured-nets

arXiv:1804.03329 [pdf, other]

Representation Tradeoffs for Hyperbolic Embeddings

Authors: Christopher De Sa, Albert Gu, Christopher Ré, Frederic Sala

Abstract: Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures like synonym or type hierarchies. Given a tree, we give a combinatorial construction that embeds the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, our combinatorial embedding obtains a mean-average-precision of 0.989 with only two dimensio… ▽ More Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures like synonym or type hierarchies. Given a tree, we give a combinatorial construction that embeds the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, our combinatorial embedding obtains a mean-average-precision of 0.989 with only two dimensions, while Nickel et al.'s recent construction obtains 0.87 using 200 dimensions. We provide upper and lower bounds that allow us to characterize the precision-dimensionality tradeoff inherent in any hyperbolic embedding. To embed general metric spaces, we propose a hyperbolic generalization of multidimensional scaling (h-MDS). We show how to perform exact recovery of hyperbolic points from distances, provide a perturbation analysis, and give a recovery result that allows us to reduce dimensionality. The h-MDS approach offers consistently low distortion even with few dimensions across several datasets. Finally, we extract lessons from the algorithms and theory above to design a PyTorch-based implementation that can handle incomplete information and is scalable. △ Less

Submitted 24 April, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

arXiv:1803.06084 [pdf, other]

A Kernel Theory of Modern Data Augmentation

Authors: Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré

Abstract: Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear natural… ▽ More Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear naturally with respect to this model, even when we do not employ kernel classification. Next, we analyze more directly the effect of augmentation on kernel classifiers, showing that data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. These frameworks both serve to illustrate the ways in which data augmentation affects the downstream learning model, and the resulting analyses provide novel connections between prior work in invariant kernels, tangent propagation, and robust optimization. Finally, we provide several proof-of-concept applications showing that our theory can be useful for accelerating machine learning workflows, such as reducing the amount of computation needed to train using augmented data, and predicting the utility of a transformation prior to training. △ Less

Submitted 20 March, 2019; v1 submitted 16 March, 2018; originally announced March 2018.

arXiv:1802.08290 [pdf, other]

Locally Adaptive Learning Loss for Semantic Image Segmentation

Authors: **jiang Guo, Pengyuan Ren, Aiguo Gu, Jian Xu, Weixin Wu

Abstract: We propose a novel locally adaptive learning estimator for enhancing the inter- and intra- discriminative capabilities of Deep Neural Networks, which can be used as improved loss layer for semantic image segmentation tasks. Most loss layers compute pixel-wise cost between feature maps and ground truths, ignoring spatial layouts and interactions between neighboring pixels with same object category,… ▽ More We propose a novel locally adaptive learning estimator for enhancing the inter- and intra- discriminative capabilities of Deep Neural Networks, which can be used as improved loss layer for semantic image segmentation tasks. Most loss layers compute pixel-wise cost between feature maps and ground truths, ignoring spatial layouts and interactions between neighboring pixels with same object category, and thus networks cannot be effectively sensitive to intra-class connections. Stride by stride, our method firstly conducts adaptive pooling filter operating over predicted feature maps, aiming to merge predicted distributions over a small group of neighboring pixels with same category, and then it computes cost between the merged distribution vector and their category label. Such design can make groups of neighboring predictions from same category involved into estimations on predicting correctness with respect to their category, and hence train networks to be more sensitive to regional connections between adjacent pixels based on their categories. In the experiments on Pascal VOC 2012 segmentation datasets, the consistently improved results show that our proposed approach achieves better segmentation masks against previous counterparts. △ Less

Submitted 15 April, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

Comments: 8 pages, 4 figures

arXiv:1611.01569 [pdf, ps, other]

A Two Pronged Progress in Structured Dense Matrix Multiplication

Authors: Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, Atri Rudra

Abstract: Matrix-vector multiplication is one of the most fundamental computing primitives. Given a matrix $A\in\mathbb{F}^{N\times N}$ and a vector $b$, it is known that in the worst case $Θ(N^2)$ operations over $\mathbb{F}$ are needed to compute $Ab$. A broad question is to identify classes of structured dense matrices that can be represented with $O(N)$ parameters, and for which matrix-vector multiplica… ▽ More Matrix-vector multiplication is one of the most fundamental computing primitives. Given a matrix $A\in\mathbb{F}^{N\times N}$ and a vector $b$, it is known that in the worst case $Θ(N^2)$ operations over $\mathbb{F}$ are needed to compute $Ab$. A broad question is to identify classes of structured dense matrices that can be represented with $O(N)$ parameters, and for which matrix-vector multiplication can be performed sub-quadratically. One such class of structured matrices is the orthogonal polynomial transforms, whose rows correspond to a family of orthogonal polynomials. Other well known classes include the Toeplitz, Hankel, Vandermonde, Cauchy matrices and their extensions that are all special cases of a ldisplacement rank property. In this paper, we make progress on two fronts: 1. We introduce the notion of recurrence width of matrices. For matrices with constant recurrence width, we design algorithms to compute $Ab$ and $A^Tb$ with a near-linear number of operations. This notion of width is finer than all the above classes of structured matrices and thus we can compute multiplication for all of them using the same core algorithm. 2. We additionally adapt this algorithm to an algorithm for a much more general class of matrices with displacement structure: those with low displacement rank with respect to quasiseparable matrices. This class includes Toeplitz-plus-Hankel-like matrices, Discrete Cosine/Sine Transforms, and more, and captures all previously known matrices with displacement structure that we are aware of under a unified parametrization and algorithm. Our work unifies, generalizes, and simplifies existing state-of-the-art results in structured matrix-vector multiplication. Finally, we show how applications in areas such as multipoint evaluations of multivariate polynomials can be reduced to problems involving low recurrence width matrices. △ Less

Submitted 17 November, 2017; v1 submitted 4 November, 2016; originally announced November 2016.

arXiv:1412.1160 [pdf, ps, other]

Regularity of pullback attractors and equilibrium for non-autonomous stochastic FitzHugh-Nagumo system on unbounded domains

Authors: Wenqiang Zhao, Anhui Gu

Abstract: A theory on bi-spatial random attractors developed recently by Li \emph{et al.} is extended to study stochastic Fitzhugh-Nagumo system driven by a non-autonomous term as well as a general multiplicative noise. By using the so-called notions of uniform absorption and uniformly pullback asymptotic compactness, it is showed that every generated random cocycle has a pullback attractor in… ▽ More A theory on bi-spatial random attractors developed recently by Li \emph{et al.} is extended to study stochastic Fitzhugh-Nagumo system driven by a non-autonomous term as well as a general multiplicative noise. By using the so-called notions of uniform absorption and uniformly pullback asymptotic compactness, it is showed that every generated random cocycle has a pullback attractor in $L^l(\mathbb{R}^N)\times L^2(\mathbb{R}^N)$ with $l\in(2,p]$, and the family of obtained attractors is upper semi-continuous at any intensity of noise. Moreover, if some additional conditions are added, then the system possesses a unique equilibrium and is attracted by a single point. △ Less

Submitted 27 April, 2015; v1 submitted 2 December, 2014; originally announced December 2014.

MSC Class: 60H15; 35R60; 35B40; 35B41

arXiv:1409.5938 [pdf, ps, other]

A random attractor for stochastic porous media equations on infinite lattices

Authors: Anhui Gu, Yangrong Li, Jia Li

Abstract: The paper is devoted to studying the existence of a random attractor for stochastic porous media equations on infinite lattices under some conditions. The paper is devoted to studying the existence of a random attractor for stochastic porous media equations on infinite lattices under some conditions. △ Less

Submitted 22 September, 2014; v1 submitted 21 September, 2014; originally announced September 2014.

Comments: 15 pages. Some key details have been added

arXiv:1408.6128 [pdf, ps, other]

doi 10.1142/S0218127413500417

Random Attractors of Stochastic Lattice Dynamical Systems Driven by Fractional Brownian Motions and its Erratum

Authors: Anhui Gu

Abstract: This paper is devoted to considering the stochastic lattice dynamical systems (SLDS) driven by fractional Brownian motions with Hurst parameter bigger than $1/2$. Under usual dissipativity conditions these SLDS are shown to generate a random dynamical system for which the existence and unique of a random attractor is established. Furthermore, the random attractor is in fact a singleton sets random… ▽ More This paper is devoted to considering the stochastic lattice dynamical systems (SLDS) driven by fractional Brownian motions with Hurst parameter bigger than $1/2$. Under usual dissipativity conditions these SLDS are shown to generate a random dynamical system for which the existence and unique of a random attractor is established. Furthermore, the random attractor is in fact a singleton sets random attractor. Next, we give an erratum because of the misused theory. △ Less

Submitted 27 August, 2014; v1 submitted 26 August, 2014; originally announced August 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1310.7113

arXiv:1408.2794 [pdf, other]

Sector-Based Factor Models for Asset Returns

Authors: Angela Gu, Patrick Zeng

Abstract: Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We ad… ▽ More Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We add eleven sectors of stocks as defined by the IBES, represented by respective sector-specific factors, to non-specific market factors to revise the factor model. We then develop an expectation maximization (EM) algorithm to compute our revised model with 15 years' worth of S&P 500 stocks' daily close prices. Our results in most sectors show that nearly all of these factor components have the same sign, consistent with the intuitive idea that stocks in the same sector tend to rise and fall in coordination over time. Results obtained by the classic factor model, in contrast, had a homogeneous blend of positive and negative components. We conclude that results produced by our sector-based factor model are more interpretable than those produced by the classic non-sector-based model for at least some stock sectors. △ Less

Submitted 11 August, 2014; originally announced August 2014.

Comments: 10 pages, 6 figures

arXiv:1404.0488 [pdf, ps, other]

Sufficient Criteria for Existence of Pullback Attractors for Stochastic Lattice Dynamical Systems with Deterministic Non-autonomous Terms

Authors: Anhui Gu, Yangrong Li

Abstract: We consider the pullback attractors for non-autonomous dynamical systems generated by stochastic lattice differential equations with non-autonomous deterministic terms. We first establish a sufficient condition for existence of pullback attractors of lattice dynamical systems with both non-autonomous deterministic and random forcing terms. As an application of the abstract theory, we prove the exi… ▽ More We consider the pullback attractors for non-autonomous dynamical systems generated by stochastic lattice differential equations with non-autonomous deterministic terms. We first establish a sufficient condition for existence of pullback attractors of lattice dynamical systems with both non-autonomous deterministic and random forcing terms. As an application of the abstract theory, we prove the existence of a unique pullback attractor for the first-order lattice dynamical systems with both deterministic non-autonomous forcing terms and multiplicative white noise. Our results recover many existing ones on the existences of pullback attractors for lattice dynamical systems with autonomous terms or white noises. △ Less

Submitted 2 April, 2014; originally announced April 2014.

arXiv:1312.2661 [pdf, ps, other]

doi 10.1016/j.cnsns.2013.08.036

Random Attractor For Stochastic Lattice FitzHugh-Nagumo System Driven By $α$-stable Lévy Noises

Authors: Anhui Gu, Yangrong Li, Jia Li

Abstract: The present paper is devoted to the existence of a random attractor for stochastic lattice FitzHugh-Nagumo system driven by $α$-stable Lévy noises under some dissipative conditions. The present paper is devoted to the existence of a random attractor for stochastic lattice FitzHugh-Nagumo system driven by $α$-stable Lévy noises under some dissipative conditions. △ Less

Submitted 23 February, 2014; v1 submitted 9 December, 2013; originally announced December 2013.

arXiv:1312.2659 [pdf, ps, other]

Synchronization of Coupled Stochastic Systems Driven by Non-Gaussian Lévy Noises

Authors: Anhui Gu, Yangrong Li

Abstract: We consider the synchronization of the solutions to coupled stochastic systems of $N$-stochastic ordinary differential equations (SODEs) driven by Non-Gaussian Lévy noises ($N\in \mathbb{N})$. We discuss the synchronization between two solutions and among different components of solutions under certain dissipative and integrability conditions. Our results generalize the present work obtained in Li… ▽ More We consider the synchronization of the solutions to coupled stochastic systems of $N$-stochastic ordinary differential equations (SODEs) driven by Non-Gaussian Lévy noises ($N\in \mathbb{N})$. We discuss the synchronization between two solutions and among different components of solutions under certain dissipative and integrability conditions. Our results generalize the present work obtained in Liu et al (2010) and Shen et al (2010). △ Less

Submitted 9 December, 2013; originally announced December 2013.

Comments: arXiv admin note: substantial text overlap with arXiv:1402.1790 by other authors

arXiv:1310.7113 [pdf, ps, other]

doi 10.1016/j.cnsns.2014.04.005

Singleton sets random attractor for stochastic FitzHugh-Nagumo lattice equations driven by fractional Brownian motions

Authors: Anhui Gu, Yangrong Li

Abstract: The paper is devoted to the study of the dynamical behavior of the solutions of stochastic FitzHugh-Nagumo lattice equations, driven by fractional Brownian motions, with Hurst parameter greater than $1/2$. Under some usual dissipativity conditions, the system considered here features different dynamics from the same one perturbed by Brownian motion. In our case, the random dynamical system has a u… ▽ More The paper is devoted to the study of the dynamical behavior of the solutions of stochastic FitzHugh-Nagumo lattice equations, driven by fractional Brownian motions, with Hurst parameter greater than $1/2$. Under some usual dissipativity conditions, the system considered here features different dynamics from the same one perturbed by Brownian motion. In our case, the random dynamical system has a unique random equilibrium, which constitutes a singleton sets random attractor. △ Less

Submitted 9 April, 2014; v1 submitted 26 October, 2013; originally announced October 2013.

Comments: Some details (including the Abstract section) have been improved

MSC Class: 60H15; 37L60; 35B40; 35B41

arXiv:1307.3757 [pdf, ps, other]

The Power of Deferral: Maintaining a Constant-Competitive Steiner Tree Online

Authors: Albert Gu, Anupam Gupta, Amit Kumar

Abstract: In the online Steiner tree problem, a sequence of points is revealed one-by-one: when a point arrives, we only have time to add a single edge connecting this point to the previous ones, and we want to minimize the total length of edges added. For two decades, we know that the greedy algorithm maintains a tree whose cost is O(log n) times the Steiner tree cost, and this is best possible. But suppos… ▽ More In the online Steiner tree problem, a sequence of points is revealed one-by-one: when a point arrives, we only have time to add a single edge connecting this point to the previous ones, and we want to minimize the total length of edges added. For two decades, we know that the greedy algorithm maintains a tree whose cost is O(log n) times the Steiner tree cost, and this is best possible. But suppose, in addition to the new edge we add, we can change a single edge from the previous set of edges: can we do much better? Can we maintain a tree that is constant-competitive? We answer this question in the affirmative. We give a primal-dual algorithm, and a novel dual-based analysis, that makes only a single swap per step (in addition to adding the edge connecting the new point to the previous ones), and such that the tree's cost is only a constant times the optimal cost. Previous results for this problem gave an algorithm that performed an amortized constant number of swaps: for each n, the number of swaps in the first n steps was O(n). We also give a simpler tight analysis for this amortized case. △ Less

Submitted 21 October, 2013; v1 submitted 14 July, 2013; originally announced July 2013.

Comments: An extended abstract appears in the 45th ACM Symposium on the Theory of Computing (STOC), 2013

arXiv:1005.2557 [pdf, ps, other]

An Optimal Differentiable Sphere Theorem for Complete Manifolds

Authors: Hong-Wei Xu And Juan-Ru Gu

Abstract: A new differentiable sphere theorem is obtained from the view of submanifold geometry. An important scalar is defined by the scalar curvature and the mean curvature of an oriented complete submanifold $M^n$ in a space form $F^{n+p}(c)$ with $c\ge0$. Making use of the Hamilton-Brendle-Schoen convergence result for Ricci flow and the Lawson-Simons-Xin formula for the nonexistence of stable currents,… ▽ More A new differentiable sphere theorem is obtained from the view of submanifold geometry. An important scalar is defined by the scalar curvature and the mean curvature of an oriented complete submanifold $M^n$ in a space form $F^{n+p}(c)$ with $c\ge0$. Making use of the Hamilton-Brendle-Schoen convergence result for Ricci flow and the Lawson-Simons-Xin formula for the nonexistence of stable currents, we prove that if the infimum of this scalar is positive, then $M$ is diffeomorphic to $S^n$. We then introduce an intrinsic invariant $I(M)$ for oriented complete Riemannian $n$-manifold $M$ via the scalar, and prove that if $I(M)>0$, then $M$ is diffeomorphic to $S^n$. It should be emphasized that our differentiable sphere theorem is optimal for arbitrary $n(\ge2)$. △ Less

Submitted 14 May, 2010; originally announced May 2010.

Comments: 13 pages

Showing 51–93 of 93 results for author: Gu, A