Search | arXiv e-print repository

SmoothHess: ReLU Network Feature Interactions via Stein's Lemma

Authors: Max Torop, Aria Masoomi, Davin Hill, Kivanc Kose, Stratis Ioannidis, Jennifer Dy

Abstract: Several recent methods for interpretability model feature interactions by looking at the Hessian of a neural network. This poses a challenge for ReLU networks, which are piecewise-linear and thus have a zero Hessian almost everywhere. We propose SmoothHess, a method of estimating second-order interactions through Stein's Lemma. In particular, we estimate the Hessian of the network convolved with a… ▽ More Several recent methods for interpretability model feature interactions by looking at the Hessian of a neural network. This poses a challenge for ReLU networks, which are piecewise-linear and thus have a zero Hessian almost everywhere. We propose SmoothHess, a method of estimating second-order interactions through Stein's Lemma. In particular, we estimate the Hessian of the network convolved with a Gaussian through an efficient sampling algorithm, requiring only network gradient calls. SmoothHess is applied post-hoc, requires no modifications to the ReLU network architecture, and the extent of smoothing can be controlled explicitly. We provide a non-asymptotic bound on the sample complexity of our estimation procedure. We validate the superior ability of SmoothHess to capture interactions on benchmark datasets and a real-world medical spirometry dataset. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted to NeurIPS 2023 as a conference paper

arXiv:2304.07670 [pdf, other]

Explanations of Black-Box Models based on Directional Feature Interactions

Authors: Aria Masoomi, Davin Hill, Zhonghui Xu, Craig P Hersh, Edwin K. Silverman, Peter J. Castaldi, Stratis Ioannidis, Jennifer Dy

Abstract: As machine learning algorithms are deployed ubiquitously to a variety of domains, it is imperative to make these often black-box models transparent. Several recent works explain black-box models by capturing the most influential features for prediction per instance; such explanation methods are univariate, as they characterize importance per feature. We extend univariate explanation to a higher-or… ▽ More As machine learning algorithms are deployed ubiquitously to a variety of domains, it is imperative to make these often black-box models transparent. Several recent works explain black-box models by capturing the most influential features for prediction per instance; such explanation methods are univariate, as they characterize importance per feature. We extend univariate explanation to a higher-order; this enhances explainability, as bivariate methods can capture feature interactions in black-box models, represented as a directed graph. Analyzing this graph enables us to discover groups of features that are equally important (i.e., interchangeable), while the notion of directionality allows us to identify the most influential features. We apply our bivariate method on Shapley value explanations, and experimentally demonstrate the ability of directional explanations to discover feature interactions. We show the superiority of our method against state-of-the-art on CIFAR10, IMDB, Census, Divorce, Drug, and gene data. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Journal ref: International Conference on Learning Representations, 2022

arXiv:2302.04411 [pdf, other]

Geometry of Score Based Generative Models

Authors: Sandesh Ghimire, **yang Liu, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy

Abstract: In this work, we look at Score-based generative models (also called diffusion generative models) from a geometric perspective. From a new view point, we prove that both the forward and backward process of adding noise and generating from noise are Wasserstein gradient flow in the space of probability measures. We are the first to prove this connection. Our understanding of Score-based (and Diffusi… ▽ More In this work, we look at Score-based generative models (also called diffusion generative models) from a geometric perspective. From a new view point, we prove that both the forward and backward process of adding noise and generating from noise are Wasserstein gradient flow in the space of probability measures. We are the first to prove this connection. Our understanding of Score-based (and Diffusion) generative models have matured and become more complete by drawing ideas from different fields like Bayesian inference, control theory, stochastic differential equation and Schrodinger bridge. However, many open questions and challenges remain. One problem, for example, is how to decrease the sampling time? We demonstrate that looking from geometric perspective enables us to answer many of these questions and provide new interpretations to some known results. Furthermore, geometric perspective enables us to devise an intuitive geometric solution to the problem of faster sampling. By augmenting traditional score-based generative models with a projection step, we show that we can generate high quality images with significantly fewer sampling-steps. △ Less

Submitted 8 February, 2023; originally announced February 2023.

arXiv:2302.02272 [pdf, other]

Divide and Compose with Score Based Generative Models

Authors: Sandesh Ghimire, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy

Abstract: While score based generative models, or diffusion models, have found success in image synthesis, they are often coupled with text data or image label to be able to manipulate and conditionally generate images. Even though manipulation of images by changing the text prompt is possible, our understanding of the text embedding and our ability to modify it to edit images is quite limited. Towards the… ▽ More While score based generative models, or diffusion models, have found success in image synthesis, they are often coupled with text data or image label to be able to manipulate and conditionally generate images. Even though manipulation of images by changing the text prompt is possible, our understanding of the text embedding and our ability to modify it to edit images is quite limited. Towards the direction of having more control over image manipulation and conditional generation, we propose to learn image components in an unsupervised manner so that we can compose those components to generate and manipulate images in informed manner. Taking inspiration from energy based models, we interpret different score components as the gradient of different energy functions. We show how score based learning allows us to learn interesting components and we can visualize them through generation. We also show how this novel decomposition allows us to compose, generate and modify images in interesting ways akin to dreaming. We make our code available at https://github.com/sandeshgh/Score-based-disentanglement △ Less

Submitted 4 February, 2023; originally announced February 2023.

arXiv:2211.06780 [pdf, other]

Inv-SENnet: Invariant Self Expression Network for clustering under biased data

Authors: Ashutosh Singh, Ashish Singh, Aria Masoomi, Tales Imbiriba, Erik Learned-Miller, Deniz Erdogmus

Abstract: Subspace clustering algorithms are used for understanding the cluster structure that explains the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle unwanted biases in datasets. For datasets where a data sample represents multiple attributes, naively applying any clustering approach can res… ▽ More Subspace clustering algorithms are used for understanding the cluster structure that explains the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle unwanted biases in datasets. For datasets where a data sample represents multiple attributes, naively applying any clustering approach can result in undesired output. To this end, we propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces. Assuming we have information about the bias, we regularize the clustering method by adversarially learning to minimize the mutual information between the data and the unwanted attributes. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach. △ Less

Submitted 12 November, 2022; originally announced November 2022.

arXiv:2210.02419 [pdf, other]

Boundary-Aware Uncertainty for Feature Attribution Explainers

Authors: Davin Hill, Aria Masoomi, Max Torop, Sandesh Ghimire, Jennifer Dy

Abstract: Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications. However, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods i… ▽ More Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications. However, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work we propose the Gaussian Process Explanation UnCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with explanation function approximation uncertainty. We introduce a novel geodesic-based kernel, which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with decision boundary complexity. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods. △ Less

Submitted 4 March, 2024; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024

arXiv:2206.12481 [pdf, other]

Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions

Authors: Zulqarnain Khan, Davin Hill, Aria Masoomi, Joshua Bone, Jennifer Dy

Abstract: Machine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particul… ▽ More Machine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particular aspect of robustness, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining explainer astuteness, analogous to astuteness of prediction functions. Our formalism allows us to connect explainer robustness to the predictor's probabilistic Lipschitzness, which captures the probability of local smoothness of a function. We provide lower bound guarantees on the astuteness of a variety of explainers (e.g., SHAP, RISE, CXPlain) given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets. △ Less

Submitted 16 April, 2024; v1 submitted 24 June, 2022; originally announced June 2022.

arXiv:2202.01210

Deep Layer-wise Networks Have Closed-Form Weights

Authors: Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy

Abstract: There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \textit{one layer at a time} with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two… ▽ More There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \textit{one layer at a time} with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$. △ Less

Submitted 7 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

Comments: Since this version is similar to an older version, I should have updated the older version instead of creating a new version. I will now retract this version, and update a previous version to this. See arXiv:2006.08539

Journal ref: AIStats 2022

arXiv:2109.14688 [pdf, other]

Reliable Estimation of KL Divergence using a Discriminator in Reproducing Kernel Hilbert Space

Authors: Sandesh Ghimire, Aria Masoomi, Jennifer Dy

Abstract: Estimating Kullback Leibler (KL) divergence from samples of two distributions is essential in many machine learning problems. Variational methods using neural network discriminator have been proposed to achieve this task in a scalable manner. However, we noted that most of these methods using neural network discriminators suffer from high fluctuations (variance) in estimates and instability in tra… ▽ More Estimating Kullback Leibler (KL) divergence from samples of two distributions is essential in many machine learning problems. Variational methods using neural network discriminator have been proposed to achieve this task in a scalable manner. However, we noted that most of these methods using neural network discriminators suffer from high fluctuations (variance) in estimates and instability in training. In this paper, we look at this issue from statistical learning theory and function space complexity perspective to understand why this happens and how to solve it. We argue that the cause of these pathologies is lack of control over the complexity of the neural network discriminator function and could be mitigated by controlling it. To achieve this objective, we 1) present a novel construction of the discriminator in the Reproducing Kernel Hilbert Space (RKHS), 2) theoretically relate the error probability bound of the KL estimates to the complexity of the discriminator in the RKHS space, 3) present a scalable way to control the complexity (RKHS norm) of the discriminator for a reliable estimation of KL divergence, and 4) prove the consistency of the proposed estimator. In three different applications of KL divergence : estimation of KL, estimation of mutual information and Variational Bayes, we show that by controlling the complexity as developed in the theory, we are able to reduce the variance of KL estimates and stabilize the training △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: 27 pages, 3 figures. arXiv admin note: text overlap with arXiv:2002.11187

Journal ref: Advances in Neural Information Processing Systems 2021

arXiv:2106.07035 [pdf, other]

Deep Bayesian Unsupervised Lifelong Learning

Authors: Tingting Zhao, Zifeng Wang, Aria Masoomi, Jennifer Dy

Abstract: Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data wh… ▽ More Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data when the data distribution and the unknown class labels evolve over time. Bayesian framework is natural to incorporate past knowledge and sequentially update the belief with new data. We develop a fully Bayesian inference framework for ULL with a novel end-to-end Deep Bayesian Unsupervised Lifelong Learning (DBULL) algorithm, which can progressively discover new clusters without forgetting the past with unlabelled data while learning latent representations. To efficiently maintain past knowledge, we develop a novel knowledge preservation mechanism via sufficient statistics of the latent representation for raw data. To detect the potential new clusters on the fly, we develop an automatic cluster discovery and redundancy removal strategy in our inference inspired by Nonparametric Bayesian statistics techniques. We demonstrate the effectiveness of our approach using image and text corpora benchmark datasets in both LL and batch settings. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.02734 [pdf, other]

Revisiting Hilbert-Schmidt Information Bottleneck for Adversarial Robustness

Authors: Zifeng Wang, Tong Jian, Aria Masoomi, Stratis Ioannidis, Jennifer Dy

Abstract: We investigate the HSIC (Hilbert-Schmidt independence criterion) bottleneck as a regularizer for learning an adversarially robust deep neural network classifier. In addition to the usual cross-entropy loss, we add regularization terms for every intermediate layer to ensure that the latent representations retain useful information for output prediction while reducing redundant information. We show… ▽ More We investigate the HSIC (Hilbert-Schmidt independence criterion) bottleneck as a regularizer for learning an adversarially robust deep neural network classifier. In addition to the usual cross-entropy loss, we add regularization terms for every intermediate layer to ensure that the latent representations retain useful information for output prediction while reducing redundant information. We show that the HSIC bottleneck enhances robustness to adversarial attacks both theoretically and experimentally. In particular, we prove that the HSIC bottleneck regularizer reduces the sensitivity of the classifier to adversarial examples. Our experiments on multiple benchmark datasets and architectures demonstrate that incorporating an HSIC bottleneck regularizer attains competitive natural accuracy and improves adversarial robustness, both with and without adversarial examples during training. Our code and adversarially robust models are publicly available. △ Less

Submitted 25 October, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: Published as a conference paper at NeurIPS 2021

arXiv:2011.03320 [pdf, ps, other]

Kernel Dependence Network

Authors: Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy

Abstract: We propose a greedy strategy to spectrally train a deep network for multi-class classification. Each layer is defined as a composition of linear weights with the feature map of a Gaussian kernel acting as the activation function. At each layer, the linear weights are learned by maximizing the dependence between the layer output and the labels using the Hilbert Schmidt Independence Criterion (HSIC)… ▽ More We propose a greedy strategy to spectrally train a deep network for multi-class classification. Each layer is defined as a composition of linear weights with the feature map of a Gaussian kernel acting as the activation function. At each layer, the linear weights are learned by maximizing the dependence between the layer output and the labels using the Hilbert Schmidt Independence Criterion (HSIC). By constraining the solution space on the Stiefel Manifold, we demonstrate how our network construct (Kernel Dependence Network or KNet) can be solved spectrally while leveraging the eigenvalues to automatically find the width and the depth of the network. We theoretically guarantee the existence of a solution for the global optimum while providing insight into our network's ability to generalize. △ Less

Submitted 9 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:2006.08539

Journal ref: NeurIPS2020 Workshop (Beyond Backprop)

arXiv:2006.08539 [pdf, other]

Deep Layer-wise Networks Have Closed-Form Weights

Authors: Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy

Abstract: There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network $\textit{one layer at a time}$ with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering t… ▽ More There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network $\textit{one layer at a time}$ with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$. △ Less

Submitted 9 February, 2022; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: This version will be published in AIStats 2022

arXiv:1906.03288 [pdf, other]

Streaming Adaptive Nonparametric Variational Autoencoder

Authors: Tingting Zhao, Zifeng Wang, Aria Masoomi, Jennifer G. Dy

Abstract: We develop a data driven approach to perform clustering and end-to-end feature learning simultaneously for streaming data that can adaptively detect novel clusters in emerging data. Our approach, Adaptive Nonparametric Variational Autoencoder (AdapVAE), learns the cluster membership through a Bayesian Nonparametric (BNP) modeling framework with Deep Neural Networks (DNNs) for feature learning. We… ▽ More We develop a data driven approach to perform clustering and end-to-end feature learning simultaneously for streaming data that can adaptively detect novel clusters in emerging data. Our approach, Adaptive Nonparametric Variational Autoencoder (AdapVAE), learns the cluster membership through a Bayesian Nonparametric (BNP) modeling framework with Deep Neural Networks (DNNs) for feature learning. We develop a joint online variational inference algorithm to learn feature representations and clustering assignments simultaneously via iteratively optimizing the Evidence Lower Bound (ELBO). We resolve the catastrophic forgetting \citep{kirkpatrick2017overcoming} challenges with streaming data by adopting generative samples from the trained AdapVAE using previous data, which avoids the need of storing and reusing past data. We demonstrate the advantages of our model including adaptive novel cluster detection without discarding useful information learned from past data, high quality sample generation and comparable clustering performance as end-to-end batch mode clustering methods on both image and text corpora benchmark datasets. △ Less

Submitted 11 October, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

Showing 1–14 of 14 results for author: Masoomi, A