-
Effective Causal Discovery under Identifiable Heteroscedastic Noise Model
Authors:
Naiyu Yin,
Tian Gao,
Yue Yu,
Qiang Ji
Abstract:
Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of both accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have…
▽ More
Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of both accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the issue of heteroscedastic noise, we introduce relaxed and implementable sufficient conditions, proving the identifiability of a general class of SEM subject to these conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning that accounts for the variation in noise variance across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties and to learn a causal DAG from data with heteroscedastic variable noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data.
△ Less
Submitted 9 June, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Wide Neural Networks as Gaussian Processes: Lessons from Deep Equilibrium Models
Authors:
Tianxiang Gao,
Xiaokai Huo,
Hailiang Liu,
Hongyang Gao
Abstract:
Neural networks with wide layers have attracted significant attention due to their equivalence to Gaussian processes, enabling perfect fitting of training data while maintaining generalization performance, known as benign overfitting. However, existing results mainly focus on shallow or finite-depth networks, necessitating a comprehensive analysis of wide neural networks with infinite-depth layers…
▽ More
Neural networks with wide layers have attracted significant attention due to their equivalence to Gaussian processes, enabling perfect fitting of training data while maintaining generalization performance, known as benign overfitting. However, existing results mainly focus on shallow or finite-depth networks, necessitating a comprehensive analysis of wide neural networks with infinite-depth layers, such as neural ordinary differential equations (ODEs) and deep equilibrium models (DEQs). In this paper, we specifically investigate the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers. Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process, establishing what is known as the Neural Network and Gaussian Process (NNGP) correspondence. Remarkably, this convergence holds even when the limits of depth and width are interchanged, which is not observed in typical infinite-depth Multilayer Perceptron (MLP) networks. Furthermore, we demonstrate that the associated Gaussian vector remains non-degenerate for any pairwise distinct input data, ensuring a strictly positive smallest eigenvalue of the corresponding kernel matrix using the NNGP kernel. These findings serve as fundamental elements for studying the training and generalization of DEQs, laying the groundwork for future research in this area.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Early warning indicators via latent stochastic dynamical systems
Authors:
Lingyu Feng,
Ting Gao,
Wang Xiao,
**qiao Duan
Abstract:
Detecting early warning indicators for abrupt dynamical transitions in complex systems or high-dimensional observation data is essential in many real-world applications, such as brain diseases, natural disasters, and engineering reliability. To this end, we develop a novel approach: the directed anisotropic diffusion map that captures the latent evolutionary dynamics in the low-dimensional manifol…
▽ More
Detecting early warning indicators for abrupt dynamical transitions in complex systems or high-dimensional observation data is essential in many real-world applications, such as brain diseases, natural disasters, and engineering reliability. To this end, we develop a novel approach: the directed anisotropic diffusion map that captures the latent evolutionary dynamics in the low-dimensional manifold. Then three effective warning signals (Onsager-Machlup Indicator, Sample Entropy Indicator, and Transition Probability Indicator) are derived through the latent coordinates and the latent stochastic dynamical systems. To validate our framework, we apply this methodology to authentic electroencephalogram (EEG) data. We find that our early warning indicators are capable of detecting the tip** point during state transition. This framework not only bridges the latent dynamics with real-world data but also shows the potential ability for automatic labeling on complex high-dimensional time series.
△ Less
Submitted 5 April, 2024; v1 submitted 7 September, 2023;
originally announced September 2023.
-
Large-scale Multi-layer Academic Networks Derived from Statistical Publications
Authors:
Tianchen Gao,
Yan Zhang,
Rui Pan,
Hansheng Wang
Abstract:
The utilization of multi-layer network structures now enables the explanation of complex systems in nature from multiple perspectives. Multi-layer academic networks capture diverse relationships among academic entities, facilitating the study of academic development and the prediction of future directions. However, there are currently few academic network datasets that simultaneously consider mult…
▽ More
The utilization of multi-layer network structures now enables the explanation of complex systems in nature from multiple perspectives. Multi-layer academic networks capture diverse relationships among academic entities, facilitating the study of academic development and the prediction of future directions. However, there are currently few academic network datasets that simultaneously consider multi-layer academic networks; often, they only include a single layer. In this study, we provide a large-scale multi-layer academic network dataset, namely, LMANStat, which includes collaboration, co-institution, citation, co-citation, journal citation, author citation, author-paper and keyword co-occurrence networks. Furthermore, each layer of the multi-layer academic network is dynamic. Additionally, we expand the attributes of nodes, such as authors' research interests, productivity, region and institution. Supported by this dataset, it is possible to study the development and evolution of statistical disciplines from multiple perspectives. This dataset also provides fertile ground for studying complex systems with multi-layer structures.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Reservoir Computing with Error Correction: Long-term Behaviors of Stochastic Dynamical Systems
Authors:
Cheng Fang,
Yubin Lu,
Ting Gao,
**qiao Duan
Abstract:
The prediction of stochastic dynamical systems and the capture of dynamical behaviors are profound problems. In this article, we propose a data-driven framework combining Reservoir Computing and Normalizing Flow to study this issue, which mimics error modeling to improve traditional Reservoir Computing performance and integrates the virtues of both approaches. With few assumptions about the underl…
▽ More
The prediction of stochastic dynamical systems and the capture of dynamical behaviors are profound problems. In this article, we propose a data-driven framework combining Reservoir Computing and Normalizing Flow to study this issue, which mimics error modeling to improve traditional Reservoir Computing performance and integrates the virtues of both approaches. With few assumptions about the underlying stochastic dynamical systems, this model-free method successfully predicts the long-term evolution of stochastic dynamical systems and replicates dynamical behaviors. We verify the effectiveness of the proposed framework in several experiments, including the stochastic Van der Pal oscillator, El Niño-Southern Oscillation simplified model, and stochastic Lorenz system. These experiments consist of Markov/non-Markov and stationary/non-stationary stochastic processes which are defined by linear/nonlinear stochastic differential equations or stochastic delay differential equations. Additionally, we explore the noise-induced tip** phenomenon, relaxation oscillation, stochastic mixed-mode oscillation, and replication of the strange attractor.
△ Less
Submitted 30 July, 2023; v1 submitted 1 May, 2023;
originally announced May 2023.
-
On the optimization and generalization of overparameterized implicit neural networks
Authors:
Tianxiang Gao,
Hongyang Gao
Abstract:
Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together,…
▽ More
Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together, their analyses are equivalent to only studying the evolution of the output layer. It is unclear how the implicit layer contributes to the training. Thus, in this paper, we restrict ourselves to only training the implicit layer. We show that global convergence is guaranteed, even if only the implicit layer is trained. On the other hand, the theoretical understanding of when and how the training performance of an implicit neural network can be generalized to unseen data is still under-explored. Although this problem has been studied in standard feed-forward networks, the case of implicit neural networks is still intriguing since implicit networks theoretically have infinitely many layers. Therefore, this paper investigates the generalization error for implicit neural networks. Specifically, we study the generalization of an implicit network activated by the ReLU function over random initialization. We provide a generalization bound that is initialization sensitive. As a result, we show that gradient flow with proper random initialization can train a sufficient over-parameterized implicit network to achieve arbitrarily small generalization errors.
△ Less
Submitted 30 September, 2022;
originally announced September 2022.
-
Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths
Authors:
Tianxiang Gao,
Hongyang Gao
Abstract:
Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational resources. However, our theoretical understanding of when and how first-order methods such as gradient descent (GD) converge on \textit{nonlinear} implicit net…
▽ More
Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational resources. However, our theoretical understanding of when and how first-order methods such as gradient descent (GD) converge on \textit{nonlinear} implicit networks is limited. Although this type of problem has been studied in standard feed-forward networks, the case of implicit models is still intriguing because implicit networks have \textit{infinitely} many layers. The corresponding equilibrium equation probably admits no or multiple solutions during training. This paper studies the convergence of both gradient flow (GF) and gradient descent for nonlinear ReLU activated implicit networks. To deal with the well-posedness problem, we introduce a fixed scalar to scale the weight matrix of the implicit layer and show that there exists a small enough scaling constant, kee** the equilibrium equation well-posed throughout training. As a result, we prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is \textit{linear} in the sample size $N$, i.e., $m=Ω(N)$.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Learning effective dynamics from data-driven stochastic systems
Authors:
Lingyu Feng,
Ting Gao,
Min Dai,
**qiao Duan
Abstract:
Multiscale stochastic dynamical systems have been widely adopted to a variety of scientific and engineering problems due to their capability of depicting complex phenomena in many real world applications. This work is devoted to investigating the effective dynamics for slow-fast stochastic dynamical systems. Given observation data on a short-term period satisfying some unknown slow-fast stochastic…
▽ More
Multiscale stochastic dynamical systems have been widely adopted to a variety of scientific and engineering problems due to their capability of depicting complex phenomena in many real world applications. This work is devoted to investigating the effective dynamics for slow-fast stochastic dynamical systems. Given observation data on a short-term period satisfying some unknown slow-fast stochastic systems, we propose a novel algorithm including a neural network called Auto-SDE to learn invariant slow manifold. Our approach captures the evolutionary nature of a series of time-dependent autoencoder neural networks with the loss constructed from a discretized stochastic differential equation. Our algorithm is also validated to be accurate, stable and effective through numerical experiments under various evaluation metrics.
△ Less
Submitted 29 December, 2023; v1 submitted 9 May, 2022;
originally announced May 2022.
-
An end-to-end deep learning approach for extracting stochastic dynamical systems with $α$-stable Lévy noise
Authors:
Cheng Fang,
Yubin Lu,
Ting Gao,
**qiao Duan
Abstract:
Recently, extracting data-driven governing laws of dynamical systems through deep learning frameworks has gained a lot of attention in various fields. Moreover, a growing amount of research work tends to transfer deterministic dynamical systems to stochastic dynamical systems, especially those driven by non-Gaussian multiplicative noise. However, lots of log-likelihood based algorithms that work w…
▽ More
Recently, extracting data-driven governing laws of dynamical systems through deep learning frameworks has gained a lot of attention in various fields. Moreover, a growing amount of research work tends to transfer deterministic dynamical systems to stochastic dynamical systems, especially those driven by non-Gaussian multiplicative noise. However, lots of log-likelihood based algorithms that work well for Gaussian cases cannot be directly extended to non-Gaussian scenarios which could have high error and low convergence issues. In this work, we overcome some of these challenges and identify stochastic dynamical systems driven by $α$-stable Lévy noise from only random pairwise data. Our innovations include: (1) designing a deep learning approach to learn both drift and diffusion coefficients for Lévy induced noise with $α$ across all values, (2) learning complex multiplicative noise without restrictions on small noise intensity, (3) proposing an end-to-end complete framework for stochastic systems identification under a general input data assumption, that is, $α$-stable random variable. Finally, numerical experiments and comparisons with the non-local Kramers-Moyal formulas with moment generating function confirm the effectiveness of our method.
△ Less
Submitted 2 July, 2022; v1 submitted 31 January, 2022;
originally announced January 2022.
-
Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network
Authors:
Huaiqian You,
Yue Yu,
Marta D'Elia,
Tian Gao,
Stewart Silling
Abstract:
Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of…
▽ More
Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Neural network stochastic differential equation models with applications to financial data forecasting
Authors:
Luxuan Yang,
Ting Gao,
Yubin Lu,
**qiao Duan,
Tao Liu
Abstract:
In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called Lévy induced stochastic differential equation network, which explores compounded stochastic differential equations with…
▽ More
In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called Lévy induced stochastic differential equation network, which explores compounded stochastic differential equations with $α$-stable Lévy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove that the numerical solution through our algorithm converges in probability to the solution of corresponding stochastic differential equation, without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian Lévy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of Lévy motion and the prediction lengths.
△ Less
Submitted 3 November, 2022; v1 submitted 25 November, 2021;
originally announced November 2021.
-
A global convergence theory for deep ReLU implicit networks via over-parameterization
Authors:
Tianxiang Gao,
Hailiang Liu,
Jia Liu,
Hridesh Rajan,
Hongyang Gao
Abstract:
Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of i…
▽ More
Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.
△ Less
Submitted 18 February, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Learning the temporal evolution of multivariate densities via normalizing flows
Authors:
Yubin Lu,
Romit Maulik,
Ting Gao,
Felix Dietrich,
Ioannis G. Kevrekidis,
**qiao Duan
Abstract:
In this work, we propose a method to learn multivariate probability distributions using sample path data from stochastic differential equations. Specifically, we consider temporally evolving probability distributions (e.g., those produced by integrating local or nonlocal Fokker-Planck equations). We analyze this evolution through machine learning assisted construction of a time-dependent map** t…
▽ More
In this work, we propose a method to learn multivariate probability distributions using sample path data from stochastic differential equations. Specifically, we consider temporally evolving probability distributions (e.g., those produced by integrating local or nonlocal Fokker-Planck equations). We analyze this evolution through machine learning assisted construction of a time-dependent map** that takes a reference distribution (say, a Gaussian) to each and every instance of our evolving distribution. If the reference distribution is the initial condition of a Fokker-Planck equation, what we learn is the time-T map of the corresponding solution. Specifically, the learned map is a multivariate normalizing flow that deforms the support of the reference density to the support of each and every density snapshot in time. We demonstrate that this approach can approximate probability density function evolutions in time from observed sampled data for systems driven by both Brownian and Lévy noise. We present examples with two- and three-dimensional, uni- and multimodal distributions to validate the method.
△ Less
Submitted 3 May, 2022; v1 submitted 29 July, 2021;
originally announced July 2021.
-
DAGs with No Curl: An Efficient DAG Structure Learning Approach
Authors:
Yue Yu,
Tian Gao,
Naiyu Yin,
Qiang Ji
Abstract:
Recently directed acyclic graph (DAG) structure learning is formulated as a constrained continuous optimization problem with continuous acyclicity constraints and was solved iteratively through subproblem optimization. To further improve efficiency, we propose a novel learning framework to model and learn the weighted adjacency matrices in the DAG space directly. Specifically, we first show that t…
▽ More
Recently directed acyclic graph (DAG) structure learning is formulated as a constrained continuous optimization problem with continuous acyclicity constraints and was solved iteratively through subproblem optimization. To further improve efficiency, we propose a novel learning framework to model and learn the weighted adjacency matrices in the DAG space directly. Specifically, we first show that the set of weighted adjacency matrices of DAGs are equivalent to the set of weighted gradients of graph potential functions, and one may perform structure learning by searching in this equivalent set of DAGs. To instantiate this idea, we propose a new algorithm, DAG-NoCurl, which solves the optimization problem efficiently with a two-step procedure: 1) first we find an initial cyclic solution to the optimization problem, and 2) then we employ the Hodge decomposition of graphs and learn an acyclic graph by projecting the cyclic graph to the gradient of a potential function. Experimental studies on benchmark datasets demonstrate that our method provides comparable accuracy but better efficiency than baseline DAG structure learning methods on both linear and generalized structural equation models, often by more than one order of magnitude.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
DAGs with No Fears: A Closer Look at Continuous Optimization for Learning Bayesian Networks
Authors:
Dennis Wei,
Tian Gao,
Yue Yu
Abstract:
This paper re-examines a continuous optimization framework dubbed NOTEARS for learning Bayesian networks. We first generalize existing algebraic characterizations of acyclicity to a class of matrix polynomials. Next, focusing on a one-parameter-per-edge setting, it is shown that the Karush-Kuhn-Tucker (KKT) optimality conditions for the NOTEARS formulation cannot be satisfied except in a trivial c…
▽ More
This paper re-examines a continuous optimization framework dubbed NOTEARS for learning Bayesian networks. We first generalize existing algebraic characterizations of acyclicity to a class of matrix polynomials. Next, focusing on a one-parameter-per-edge setting, it is shown that the Karush-Kuhn-Tucker (KKT) optimality conditions for the NOTEARS formulation cannot be satisfied except in a trivial case, which explains a behavior of the associated algorithm. We then derive the KKT conditions for an equivalent reformulation, show that they are indeed necessary, and relate them to explicit constraints that certain edges be absent from the graph. If the score function is convex, these KKT conditions are also sufficient for local minimality despite the non-convexity of the constraint. Informed by the KKT conditions, a local search post-processing algorithm is proposed and shown to substantially and universally improve the structural Hamming distance of all tested algorithms, typically by a factor of 2 or more. Some combinations with local search are both more accurate and more efficient than the original NOTEARS.
△ Less
Submitted 18 October, 2020;
originally announced October 2020.
-
Type-augmented Relation Prediction in Knowledge Graphs
Authors:
Zijun Cui,
Pavan Kapanipathi,
Kartik Talamadupula,
Tian Gao,
Qiang Ji
Abstract:
Knowledge graphs (KGs) are of great importance to many real world applications, but they generally suffer from incomplete information in the form of missing relations between entities. Knowledge graph completion (also known as relation prediction) is the task of inferring missing facts given existing ones. Most of the existing work is proposed by maximizing the likelihood of observed instance-leve…
▽ More
Knowledge graphs (KGs) are of great importance to many real world applications, but they generally suffer from incomplete information in the form of missing relations between entities. Knowledge graph completion (also known as relation prediction) is the task of inferring missing facts given existing ones. Most of the existing work is proposed by maximizing the likelihood of observed instance-level triples. Not much attention, however, is paid to the ontological information, such as type information of entities and relations. In this work, we propose a type-augmented relation prediction (TaRP) method, where we apply both the type information and instance-level information for relation prediction. In particular, type information and instance-level information are encoded as prior probabilities and likelihoods of relations respectively, and are combined by following Bayes' rule. Our proposed TaRP method achieves significantly better performance than state-of-the-art methods on four benchmark datasets: FB15K, FB15K-237, YAGO26K-906, and DB111K-174. In addition, we show that TaRP achieves significantly improved data efficiency. More importantly, the type information extracted from a specific dataset can generalize well to other datasets through the proposed TaRP model.
△ Less
Submitted 26 February, 2021; v1 submitted 16 September, 2020;
originally announced September 2020.
-
Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs
Authors:
Meng Qu,
Tianyu Gao,
Louis-Pascal A. C. Xhonneux,
Jian Tang
Abstract:
This paper studies few-shot relation extraction, which aims at predicting the relation for a pair of entities in a sentence by training with a few labeled examples in each relation. To more effectively generalize to new relations, in this paper we study the relationships between different relations and propose to leverage a global relation graph. We propose a novel Bayesian meta-learning approach…
▽ More
This paper studies few-shot relation extraction, which aims at predicting the relation for a pair of entities in a sentence by training with a few labeled examples in each relation. To more effectively generalize to new relations, in this paper we study the relationships between different relations and propose to leverage a global relation graph. We propose a novel Bayesian meta-learning approach to effectively learn the posterior distribution of the prototype vectors of relations, where the initial prior of the prototype vectors is parameterized with a graph neural network on the global relation graph. Moreover, to effectively optimize the posterior distribution of the prototype vectors, we propose to use the stochastic gradient Langevin dynamics, which is related to the MAML algorithm but is able to handle the uncertainty of the prototype vectors. The whole framework can be effectively and efficiently optimized in an end-to-end fashion. Experiments on two benchmark datasets prove the effectiveness of our proposed approach against competitive baselines in both the few-shot and zero-shot settings.
△ Less
Submitted 5 July, 2020;
originally announced July 2020.
-
A Multi-Channel Neural Graphical Event Model with Negative Evidence
Authors:
Tian Gao,
Dharmashankar Subramanian,
Karthikeyan Shanmugam,
Debarun Bhattacharjya,
Nicholas Mattei
Abstract:
Event datasets are sequences of events of various types occurring irregularly over the time-line, and they are increasingly prevalent in numerous domains. Existing work for modeling events using conditional intensities rely on either using some underlying parametric form to capture historical dependencies, or on non-parametric models that focus primarily on tasks such as prediction. We propose a n…
▽ More
Event datasets are sequences of events of various types occurring irregularly over the time-line, and they are increasingly prevalent in numerous domains. Existing work for modeling events using conditional intensities rely on either using some underlying parametric form to capture historical dependencies, or on non-parametric models that focus primarily on tasks such as prediction. We propose a non-parametric deep neural network approach in order to estimate the underlying intensity functions. We use a novel multi-channel RNN that optimally reinforces the negative evidence of no observable events with the introduction of fake event epochs within each consecutive inter-event interval. We evaluate our method against state-of-the-art baselines on model fitting tasks as gauged by log-likelihood. Through experiments on both synthetic and real-world datasets, we find that our proposed approach outperforms existing baselines on most of the datasets studied.
△ Less
Submitted 21 February, 2020;
originally announced February 2020.
-
Graph-augmented Convolutional Networks on Drug-Drug Interactions Prediction
Authors:
Yi Zhong,
Xueyu Chen,
Yu Zhao,
Xiaoming Chen,
Tingfang Gao,
Zuquan Weng
Abstract:
We propose an end-to-end model to predict drug-drug interactions (DDIs) by employing graph-augmented convolutional networks. And this is implemented by combining graph CNN with an attentive pooling network to extract structural relations between drug pairs and make DDI predictions. The experiment results suggest a desirable performance achieving ROC at 0.988, F1-score at 0.956, and AUPR at 0.986.…
▽ More
We propose an end-to-end model to predict drug-drug interactions (DDIs) by employing graph-augmented convolutional networks. And this is implemented by combining graph CNN with an attentive pooling network to extract structural relations between drug pairs and make DDI predictions. The experiment results suggest a desirable performance achieving ROC at 0.988, F1-score at 0.956, and AUPR at 0.986. Besides, the model can tell how the two DDI drugs interact structurally by varying colored atoms. And this may be helpful for drug design during drug discovery.
△ Less
Submitted 8 December, 2019;
originally announced December 2019.
-
Characterization of Overlap in Observational Studies
Authors:
Michael Oberst,
Fredrik D. Johansson,
Dennis Wei,
Tian Gao,
Gabriel Brat,
David Sontag,
Kush R. Varshney
Abstract:
Overlap between treatment groups is required for non-parametric estimation of causal effects. If a subgroup of subjects always receives the same intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of causal conclusions for new subjects,…
▽ More
Overlap between treatment groups is required for non-parametric estimation of causal effects. If a subgroup of subjects always receives the same intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of causal conclusions for new subjects, and can help guide additional data collection. To have impact, these descriptions must be interpretable for downstream users who are not machine learning experts, such as policy makers. We formalize overlap estimation as a problem of finding minimum volume sets subject to coverage constraints and reduce this problem to binary classification with Boolean rule classifiers. We then generalize this method to estimate overlap in off-policy policy evaluation. In several real-world applications, we demonstrate that these rules have comparable accuracy to black-box estimators and provide intuitive and informative explanations that can inform policy making.
△ Less
Submitted 3 June, 2020; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Unsupervised Co-Learning on $\mathcal{G}$-Manifolds Across Irreducible Representations
Authors:
Yifeng Fan,
Tingran Gao,
Zhizhen Zhao
Abstract:
We introduce a novel co-learning paradigm for manifolds naturally equipped with a group action, motivated by recent developments on learning a manifold from attached fibre bundle structures. We utilize a representation theoretic mechanism that canonically associates multiple independent vector bundles over a common base manifold, which provides multiple views for the geometry of the underlying man…
▽ More
We introduce a novel co-learning paradigm for manifolds naturally equipped with a group action, motivated by recent developments on learning a manifold from attached fibre bundle structures. We utilize a representation theoretic mechanism that canonically associates multiple independent vector bundles over a common base manifold, which provides multiple views for the geometry of the underlying manifold. The consistency across these fibre bundles provide a common base for performing unsupervised manifold co-learning through the redundancy created artificially across irreducible representations of the transformation group. We demonstrate the efficacy of the proposed algorithmic paradigm through drastically improved robust nearest neighbor search and community detection on rotation-invariant cryo-electron microscopy image analysis.
△ Less
Submitted 8 December, 2019; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Generalized Linear Rule Models
Authors:
Dennis Wei,
Sanjeeb Dash,
Tian Gao,
Oktay Günlük
Abstract:
This paper considers generalized linear models using rule-based features, also referred to as rule ensembles, for regression and probabilistic classification. Rules facilitate model interpretation while also capturing nonlinear dependences and interactions. Our problem formulation accordingly trades off rule set complexity and prediction accuracy. Column generation is used to optimize over an expo…
▽ More
This paper considers generalized linear models using rule-based features, also referred to as rule ensembles, for regression and probabilistic classification. Rules facilitate model interpretation while also capturing nonlinear dependences and interactions. Our problem formulation accordingly trades off rule set complexity and prediction accuracy. Column generation is used to optimize over an exponentially large space of rules without pre-generating a large subset of candidates or greedily boosting rules one by one. The column generation subproblem is solved using either integer programming or a heuristic optimizing the same objective. In experiments involving logistic and linear regression, the proposed methods obtain better accuracy-complexity trade-offs than existing rule ensemble algorithms. At one end of the trade-off, the methods are competitive with less interpretable benchmark models.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
SelectNet: Learning to Sample from the Wild for Imbalanced Data Training
Authors:
Yunru Liu,
Tingran Gao,
Haizhao Yang
Abstract:
Supervised learning from training data with imbalanced class sizes, a commonly encountered scenario in real applications such as anomaly/fraud detection, has long been considered a significant challenge in machine learning. Motivated by recent progress in curriculum and self-paced learning, we propose to adopt a semi-supervised learning paradigm by training a deep neural network, referred to as Se…
▽ More
Supervised learning from training data with imbalanced class sizes, a commonly encountered scenario in real applications such as anomaly/fraud detection, has long been considered a significant challenge in machine learning. Motivated by recent progress in curriculum and self-paced learning, we propose to adopt a semi-supervised learning paradigm by training a deep neural network, referred to as SelectNet, to selectively add unlabelled data together with their predicted labels to the training dataset. Unlike existing techniques designed to tackle the difficulty in dealing with class imbalanced training data such as resampling, cost-sensitive learning, and margin-based learning, SelectNet provides an end-to-end approach for learning from important unlabelled data "in the wild" that most likely belong to the under-sampled classes in the training data, thus gradually mitigates the imbalance in the data used for training the classifier. We demonstrate the efficacy of SelectNet through extensive numerical experiments on standard datasets in computer vision.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
DAG-GNN: DAG Structure Learning with Graph Neural Networks
Authors:
Yue Yu,
Jie Chen,
Tian Gao,
Mo Yu
Abstract:
Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the li…
▽ More
Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the linear structural equation model (SEM) and the least-squares loss function that are statistically well justified but nevertheless limited. Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear map**s, in this work we propose a deep generative model and apply a variant of the structural constraint to learn the DAG. At the heart of the generative model is a variational autoencoder parameterized by a novel graph neural network architecture, which we coin DAG-GNN. In addition to the richer capacity, an advantage of the proposed model is that it naturally handles discrete variables as well as vector-valued ones. We demonstrate that on synthetic data sets, the proposed method learns more accurate graphs for nonlinearly generated samples; and on benchmark data sets with discrete variables, the learned graphs are reasonably close to the global optima. The code is available at \url{https://github.com/fishmoon1234/DAG-GNN}.
△ Less
Submitted 22 April, 2019;
originally announced April 2019.
-
A Sequential Set Generation Method for Predicting Set-Valued Outputs
Authors:
Tian Gao,
Jie Chen,
Vijil Chenthamarakshan,
Michael Witbrock
Abstract:
Consider a general machine learning setting where the output is a set of labels or sequences. This output set is unordered and its size varies with the input. Whereas multi-label classification methods seem a natural first resort, they are not readily applicable to set-valued outputs because of the growth rate of the output space; and because conventional sequence generation doesn't reflect sets'…
▽ More
Consider a general machine learning setting where the output is a set of labels or sequences. This output set is unordered and its size varies with the input. Whereas multi-label classification methods seem a natural first resort, they are not readily applicable to set-valued outputs because of the growth rate of the output space; and because conventional sequence generation doesn't reflect sets' order-free nature. In this paper, we propose a unified framework--sequential set generation (SSG)--that can handle output sets of labels and sequences. SSG is a meta-algorithm that leverages any probabilistic learning method for label or sequence prediction, but employs a proper regularization such that a new label or sequence is generated repeatedly until the full set is produced. Though SSG is sequential in nature, it does not penalize the ordering of the appearance of the set elements and can be applied to a variety of set output problems, such as a set of classification labels or sequences. We perform experiments with both benchmark and synthetic data sets and demonstrate SSG's strong performance over baseline methods.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
A Forest from the Trees: Generation through Neighborhoods
Authors:
Yang Li,
Tianxiang Gao,
Junier B. Oliva
Abstract:
In this work, we propose to learn a generative model using both learned features (through a latent space) and memories (through neighbors). Although human learning makes seamless use of both learned perceptual features and instance recall, current generative learning paradigms only make use of one of these two components. Take, for instance, flow models, which learn a latent space of invertible fe…
▽ More
In this work, we propose to learn a generative model using both learned features (through a latent space) and memories (through neighbors). Although human learning makes seamless use of both learned perceptual features and instance recall, current generative learning paradigms only make use of one of these two components. Take, for instance, flow models, which learn a latent space of invertible features that follow a simple distribution. Conversely, kernel density techniques use instances to shift a simple distribution into an aggregate mixture model. Here we propose multiple methods to enhance the latent space of a flow model with neighborhood information. Not only does our proposed framework represent a more human-like approach by leveraging both learned features and memories, but it may also be viewed as a step forward in non-parametric methods. The efficacy of our model is shown empirically with standard image datasets. We observe compelling results and a significant improvement over baselines.
△ Less
Submitted 19 November, 2019; v1 submitted 4 February, 2019;
originally announced February 2019.
-
Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent Algorithms via Diffusion Approximation
Authors:
Yuanyuan Feng,
Tingran Gao,
Lei Li,
Jian-Guo Liu,
Yulong Lu
Abstract:
Diffusion approximation provides weak approximation for stochastic gradient descent algorithms in a finite time horizon. In this paper, we introduce new tools motivated by the backward error analysis of numerical stochastic differential equations into the theoretical framework of diffusion approximation, extending the validity of the weak approximation from finite to infinite time horizon. The new…
▽ More
Diffusion approximation provides weak approximation for stochastic gradient descent algorithms in a finite time horizon. In this paper, we introduce new tools motivated by the backward error analysis of numerical stochastic differential equations into the theoretical framework of diffusion approximation, extending the validity of the weak approximation from finite to infinite time horizon. The new techniques developed in this paper enable us to characterize the asymptotic behavior of constant-step-size SGD algorithms for strongly convex objective functions, a goal previously unreachable within the diffusion approximation framework. Our analysis builds upon a truncated formal power expansion of the solution of a stochastic modified equation arising from diffusion approximation, where the main technical ingredient is a uniform-in-time weak error bound controlling the long-term behavior of the expansion coefficient functions near the global minimum. We expect these new techniques to greatly expand the range of applicability of diffusion approximation to cover wider and deeper aspects of stochastic optimization algorithms in data science.
△ Less
Submitted 4 September, 2019; v1 submitted 1 February, 2019;
originally announced February 2019.
-
Wasserstein Soft Label Propagation on Hypergraphs: Algorithm and Generalization Error Bounds
Authors:
Tingran Gao,
Shahab Asoodeh,
Yi Huang,
James Evans
Abstract:
Inspired by recent interests of develo** machine learning and data mining algorithms on hypergraphs, we investigate in this paper the semi-supervised learning algorithm of propagating "soft labels" (e.g. probability distributions, class membership scores) over hypergraphs, by means of optimal transportation. Borrowing insights from Wasserstein propagation on graphs [Solomon et al. 2014], we re-f…
▽ More
Inspired by recent interests of develo** machine learning and data mining algorithms on hypergraphs, we investigate in this paper the semi-supervised learning algorithm of propagating "soft labels" (e.g. probability distributions, class membership scores) over hypergraphs, by means of optimal transportation. Borrowing insights from Wasserstein propagation on graphs [Solomon et al. 2014], we re-formulate the label propagation procedure as a message-passing algorithm, which renders itself naturally to a generalization applicable to hypergraphs through Wasserstein barycenters. Furthermore, in a PAC learning framework, we provide generalization error bounds for propagating one-dimensional distributions on graphs and hypergraphs using 2-Wasserstein distance, by establishing the \textit{algorithmic stability} of the proposed semi-supervised learning algorithm. These theoretical results also shed new lights upon deeper understandings of the Wasserstein propagation on graphs.
△ Less
Submitted 16 November, 2018; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Gaussian Process Landmarking for Three-Dimensional Geometric Morphometrics
Authors:
Tingran Gao,
Shahar Z. Kovalsky,
Doug M. Boyer,
Ingrid Daubechies
Abstract:
We demonstrate applications of the Gaussian process-based landmarking algorithm proposed in [T. Gao, S.Z. Kovalsky, and I. Daubechies, SIAM Journal on Mathematics of Data Science (2019)] to geometric morphometrics, a branch of evolutionary biology centered at the analysis and comparisons of anatomical shapes, and compares the automatically sampled landmarks with the "ground truth" landmarks manual…
▽ More
We demonstrate applications of the Gaussian process-based landmarking algorithm proposed in [T. Gao, S.Z. Kovalsky, and I. Daubechies, SIAM Journal on Mathematics of Data Science (2019)] to geometric morphometrics, a branch of evolutionary biology centered at the analysis and comparisons of anatomical shapes, and compares the automatically sampled landmarks with the "ground truth" landmarks manually placed by evolutionary anthropologists; the results suggest that Gaussian process landmarks perform equally well or better, in terms of both spatial coverage and downstream statistical analysis. We provide a detailed exposition of numerical procedures and feature filtering algorithms for computing high-quality and semantically meaningful diffeomorphisms between disk-type anatomical surfaces.
△ Less
Submitted 8 January, 2019; v1 submitted 31 July, 2018;
originally announced July 2018.
-
Curvature of Hypergraphs via Multi-Marginal Optimal Transport
Authors:
Shahab Asoodeh,
Tingran Gao,
James Evans
Abstract:
We introduce a novel definition of curvature for hypergraphs, a natural generalization of graphs, by introducing a multi-marginal optimal transport problem for a naturally defined random walk on the hypergraph. This curvature, termed \emph{coarse scalar curvature}, generalizes a recent definition of Ricci curvature for Markov chains on metric spaces by Ollivier [Journal of Functional Analysis 256…
▽ More
We introduce a novel definition of curvature for hypergraphs, a natural generalization of graphs, by introducing a multi-marginal optimal transport problem for a naturally defined random walk on the hypergraph. This curvature, termed \emph{coarse scalar curvature}, generalizes a recent definition of Ricci curvature for Markov chains on metric spaces by Ollivier [Journal of Functional Analysis 256 (2009) 810-864], and is related to the scalar curvature when the hypergraph arises naturally from a Riemannian manifold. We investigate basic properties of the coarse scalar curvature and obtain several bounds. Empirical experiments indicate that coarse scalar curvatures are capable of detecting "bridges" across connected components in hypergraphs, suggesting it is an appropriate generalization of curvature on simple graphs.
△ Less
Submitted 22 March, 2018;
originally announced March 2018.
-
DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization
Authors:
Tianxiang Gao,
Chris Chu
Abstract:
Nonnegative matrix factorization (NMF) has attracted much attention in the last decade as a dimension reduction method in many applications. Due to the explosion in the size of data, naturally the samples are collected and stored distributively in local computational nodes. Thus, there is a growing need to develop algorithms in a distributed memory architecture. We propose a novel distributed algo…
▽ More
Nonnegative matrix factorization (NMF) has attracted much attention in the last decade as a dimension reduction method in many applications. Due to the explosion in the size of data, naturally the samples are collected and stored distributively in local computational nodes. Thus, there is a growing need to develop algorithms in a distributed memory architecture. We propose a novel distributed algorithm, called \textit{distributed incremental block coordinate descent} (DID), to solve the problem. By adapting the block coordinate descent framework, closed-form update rules are obtained in DID. Moreover, DID performs updates incrementally based on the most recently updated residual matrix. As a result, only one communication step per iteration is required. The correctness, efficiency, and scalability of the proposed algorithm are verified in a series of numerical experiments.
△ Less
Submitted 24 February, 2018;
originally announced February 2018.
-
Gaussian Process Landmarking on Manifolds
Authors:
Tingran Gao,
Shahar Z. Kovalsky,
Ingrid Daubechies
Abstract:
As a means of improving analysis of biological shapes, we propose an algorithm for sampling a Riemannian manifold by sequentially selecting points with maximum uncertainty under a Gaussian process model. This greedy strategy is known to be near-optimal in the experimental design literature, and appears to outperform the use of user-placed landmarks in representing the geometry of biological object…
▽ More
As a means of improving analysis of biological shapes, we propose an algorithm for sampling a Riemannian manifold by sequentially selecting points with maximum uncertainty under a Gaussian process model. This greedy strategy is known to be near-optimal in the experimental design literature, and appears to outperform the use of user-placed landmarks in representing the geometry of biological objects in our application. In the noiseless regime, we establish an upper bound for the mean squared prediction error (MSPE) in terms of the number of samples and geometric quantities of the manifold, demonstrating that the MSPE for our proposed sequential design decays at a rate comparable to the oracle rate achievable by any sequential or non-sequential optimal design; to our knowledge this is the first result of this type for sequential experimental design. The key is to link the greedy algorithm to reduced basis methods in the context of model reduction for partial differential equations. We expect this approach will find additional applications in other fields of research.
△ Less
Submitted 8 January, 2019; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Degrees of Freedom in Deep Neural Networks
Authors:
Tianxiang Gao,
Vladimir Jojic
Abstract:
In this paper, we explore degrees of freedom in deep sigmoidal neural networks. We show that the degrees of freedom in these models is related to the expected optimism, which is the expected difference between test error and training error. We provide an efficient Monte-Carlo method to estimate the degrees of freedom for multi-class classification methods. We show degrees of freedom are lower than…
▽ More
In this paper, we explore degrees of freedom in deep sigmoidal neural networks. We show that the degrees of freedom in these models is related to the expected optimism, which is the expected difference between test error and training error. We provide an efficient Monte-Carlo method to estimate the degrees of freedom for multi-class classification methods. We show degrees of freedom are lower than the parameter count in a simple XOR network. We extend these results to neural nets trained on synthetic and real data, and investigate impact of network's architecture and different regularization choices. The degrees of freedom in deep networks are dramatically smaller than the number of parameters, in some real datasets several orders of magnitude. Further, we observe that for fixed number of parameters, deeper networks have less degrees of freedom exhibiting a regularization-by-depth.
△ Less
Submitted 3 June, 2016; v1 submitted 30 March, 2016;
originally announced March 2016.
-
A Derivative-Free Trust-Region Algorithm for Reliability-Based Optimization
Authors:
Tian Gao,
**glai Li
Abstract:
In this note, we present a derivative-free trust-region (TR) algorithm for reliability based optimization (RBO) problems. The proposed algorithm consists of solving a set of subproblems, in which simple surrogate models of the reliability constraints are constructed and used in solving the subproblems. Taking advantage of the special structure of the RBO problems, we employ a sample reweighting me…
▽ More
In this note, we present a derivative-free trust-region (TR) algorithm for reliability based optimization (RBO) problems. The proposed algorithm consists of solving a set of subproblems, in which simple surrogate models of the reliability constraints are constructed and used in solving the subproblems. Taking advantage of the special structure of the RBO problems, we employ a sample reweighting method to evaluate the failure probabilities, which constructs the surrogate for the reliability constraints by performing only a single full reliability evaluation in each iteration. With numerical experiments, we illustrate that the proposed algorithm is competitive against existing methods.
△ Less
Submitted 2 October, 2016; v1 submitted 6 November, 2015;
originally announced November 2015.
-
Hypoelliptic Diffusion Maps I: Tangent Bundles
Authors:
Tingran Gao
Abstract:
We introduce the concept of Hypoelliptic Diffusion Maps (HDM), a framework generalizing Diffusion Maps in the context of manifold learning and dimensionality reduction. Standard non-linear dimensionality reduction methods (e.g., LLE, ISOMAP, Laplacian Eigenmaps, Diffusion Maps) focus on mining massive data sets using weighted affinity graphs; Orientable Diffusion Maps and Vector Diffusion Maps enr…
▽ More
We introduce the concept of Hypoelliptic Diffusion Maps (HDM), a framework generalizing Diffusion Maps in the context of manifold learning and dimensionality reduction. Standard non-linear dimensionality reduction methods (e.g., LLE, ISOMAP, Laplacian Eigenmaps, Diffusion Maps) focus on mining massive data sets using weighted affinity graphs; Orientable Diffusion Maps and Vector Diffusion Maps enrich these graphs by attaching to each node also some local geometry. HDM likewise considers a scenario where each node possesses additional structure, which is now itself of interest to investigate. Virtually, HDM augments the original data set with attached structures, and provides tools for studying and organizing the augmented ensemble. The goal is to obtain information on individual structures attached to the nodes and on the relationship between structures attached to nearby nodes, so as to study the underlying manifold from which the nodes are sampled. In this paper, we analyze HDM on tangent bundles, revealing its intimate connection with sub-Riemannian geometry and a family of hypoelliptic differential operators. In a later paper, we shall consider more general fibre bundles.
△ Less
Submitted 16 March, 2015;
originally announced March 2015.