Search | arXiv e-print repository

When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Authors: Gautham Govind Anil, Pascal Esser, Debarghya Ghoshdastidar

Abstract: Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not k… ▽ More Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.13108 [pdf, ps, other]

On the Stability of Gradient Descent for Large Learning Rate

Authors: Alexandru Crăciun, Debarghya Ghoshdastidar

Abstract: There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when train… ▽ More There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when training using gradient descent have recently been proposed -- a lack of flat minima near the gradient descent trajectory together with the presence of compact forward-invariant sets. In this paper, we show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption. More precisely, we prove that the gradient descent map is non-singular, the set of global minimizers of the loss function forms a smooth manifold, and the stable minima form a bounded subset in parameter space. Additionally, we prove that if the step-size is too big, then the set of initializations from which gradient descent converges to a critical point has measure zero. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.09881 [pdf, other]

Explaining Kernel Clustering via Decision Trees

Authors: Maximilian Fleissner, Leena Chennuru Vankadara, Debarghya Ghoshdastidar

Abstract: Despite the growing popularity of explainable and interpretable machine learning, there is still surprisingly limited work on inherently interpretable clustering methods. Recently, there has been a surge of interest in explaining the classic k-means algorithm, leading to efficient algorithms that approximate k-means clusters using axis-aligned decision trees. However, interpretable variants of k-m… ▽ More Despite the growing popularity of explainable and interpretable machine learning, there is still surprisingly limited work on inherently interpretable clustering methods. Recently, there has been a surge of interest in explaining the classic k-means algorithm, leading to efficient algorithms that approximate k-means clusters using axis-aligned decision trees. However, interpretable variants of k-means have limited applicability in practice, where more flexible clustering methods are often needed to obtain useful partitions of the data. In this work, we investigate interpretable kernel clustering, and propose algorithms that construct decision trees to approximate the partitions induced by kernel k-means, a nonlinear extension of k-means. We further build on previous work on explainable k-means and demonstrate how a suitable choice of features allows preserving interpretability without sacrificing approximation guarantees on the interpretable model. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2309.02028 [pdf, other]

Non-Parametric Representation Learning with Kernels

Authors: Pascal Esser, Maximilian Fleissner, Debarghya Ghoshdastidar

Abstract: Unsupervised and self-supervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernel-based representation learning approaches: Fir… ▽ More Unsupervised and self-supervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernel-based representation learning approaches: Firstly, we define two kernel Self-Supervised Learning (SSL) models using contrastive loss functions and secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and reconstructing data. We argue that the classical representer theorems for supervised kernel machines are not always applicable for (self-supervised) representation learning, and present new representer theorems, which show that the representations learned by our kernel models can be expressed in terms of kernel matrices. We further derive generalisation error bounds for representation learning with kernel SSL and AE, and empirically evaluate the performance of these methods in both small data regimes as well as in comparison with neural network based models. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2309.02011 [pdf, other]

Representation Learning Dynamics of Self-Supervised Models

Authors: Pascal Esser, Satyaki Mukherjee, Debarghya Ghoshdastidar

Abstract: Self-Supervised Learning (SSL) is an important paradigm for learning representations from unlabelled data, and SSL with neural networks has been highly successful in practice. However current theoretical analysis of SSL is mostly restricted to generalisation error bounds. In contrast, learning dynamics often provide a precise characterisation of the behaviour of neural networks based models but, s… ▽ More Self-Supervised Learning (SSL) is an important paradigm for learning representations from unlabelled data, and SSL with neural networks has been highly successful in practice. However current theoretical analysis of SSL is mostly restricted to generalisation error bounds. In contrast, learning dynamics often provide a precise characterisation of the behaviour of neural networks based models but, so far, are mainly known in supervised settings. In this paper, we study the learning dynamics of SSL models, specifically representations obtained by minimising contrastive and non-contrastive losses. We show that a naive extension of the dymanics of multivariate regression to SSL leads to learning trivial scalar representations that demonstrates dimension collapse in SSL. Consequently, we formulate SSL objectives with orthogonality constraints on the weights, and derive the exact (network width independent) learning dynamics of the SSL models trained using gradient descent on the Grassmannian manifold. We also argue that the infinite width approximation of SSL models significantly deviate from the neural tangent kernel approximations of supervised models. We numerically illustrate the validity of our theoretical findings, and discuss how the presented results provide a framework for further theoretical analysis of contrastive and non-contrastive SSL. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2307.11672 [pdf, other]

Fast Adaptive Test-Time Defense with Robust Features

Authors: Anurag Singh, Mahalakshmi Sabanayagam, Krikamol Muandet, Debarghya Ghoshdastidar

Abstract: Adaptive test-time defenses are used to improve the robustness of deep neural networks to adversarial examples. However, existing methods significantly increase the inference time due to additional optimization on the model parameters or the input at test time. In this work, we propose a novel adaptive test-time defense strategy that is easy to integrate with any existing (robust) training procedu… ▽ More Adaptive test-time defenses are used to improve the robustness of deep neural networks to adversarial examples. However, existing methods significantly increase the inference time due to additional optimization on the model parameters or the input at test time. In this work, we propose a novel adaptive test-time defense strategy that is easy to integrate with any existing (robust) training procedure without additional test-time computation. Based on the notion of robustness of features that we present, the key idea is to project the trained models to the most robust feature space, thereby reducing the vulnerability to adversarial attacks in non-robust directions. We theoretically show that the top eigenspace of the feature matrix are more robust for a generalized additive model and support our argument for a large width neural network with the Neural Tangent Kernel (NTK) equivalence. We conduct extensive experiments on CIFAR-10 and CIFAR-100 datasets for several robustness benchmarks, including the state-of-the-art methods in RobustBench, and observe that the proposed method outperforms existing adaptive test-time defenses at much lower computation costs. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2302.12693 [pdf, ps, other]

Wasserstein Projection Pursuit of Non-Gaussian Signals

Authors: Satyaki Mukherjee, Soumendu Sundar Mukherjee, Debarghya Ghoshdastidar

Abstract: We consider the general dimensionality reduction problem of locating in a high-dimensional data cloud, a $k$-dimensional non-Gaussian subspace of interesting features. We use a projection pursuit approach -- we search for mutually orthogonal unit directions which maximise the 2-Wasserstein distance of the empirical distribution of data-projections along these directions from a standard Gaussian. U… ▽ More We consider the general dimensionality reduction problem of locating in a high-dimensional data cloud, a $k$-dimensional non-Gaussian subspace of interesting features. We use a projection pursuit approach -- we search for mutually orthogonal unit directions which maximise the 2-Wasserstein distance of the empirical distribution of data-projections along these directions from a standard Gaussian. Under a generative model, where there is a underlying (unknown) low-dimensional non-Gaussian subspace, we prove rigorous statistical guarantees on the accuracy of approximating this unknown subspace by the directions found by our projection pursuit approach. Our results operate in the regime where the data dimensionality is comparable to the sample size, and thus supplement the recent literature on the non-feasibility of locating interesting directions via projection pursuit in the complementary regime where the data dimensionality is much larger than the sample size. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2212.01046 [pdf, other]

Improved Representation Learning Through Tensorized Autoencoders

Authors: Pascal Mattia Esser, Satyaki Mukherjee, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar

Abstract: The central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervise… ▽ More The central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervised representation learning, they do not fulfil the above condition on the embedding as they obtain a single representation of the data. To overcome this we propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE) that allows for learning cluster-specific embeddings while simultaneously learning the cluster assignment. For the linear setting we prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE. We validated this on planted models and for general, non-linear and convolutional AEs we empirically illustrate that tensorizing the AE is beneficial in clustering and de-noising tasks. △ Less

Submitted 2 December, 2022; originally announced December 2022.

arXiv:2211.16459 [pdf, other]

A Revenue Function for Comparison-Based Hierarchical Clustering

Authors: Aishik Mandal, Michaël Perrot, Debarghya Ghoshdastidar

Abstract: Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have b… ▽ More Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods. △ Less

Submitted 2 April, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 26 pages, 6 figures, 5 tables. Transactions on Machine Learning Research (2023)

arXiv:2211.01903 [pdf, ps, other]

A Consistent Estimator for Confounding Strength

Authors: Luca Rendsburg, Leena Chennuru Vankadara, Debarghya Ghoshdastidar, Ulrike von Luxburg

Abstract: Regression on observational data can fail to capture a causal relationship in the presence of unobserved confounding. Confounding strength measures this mismatch, but estimating it requires itself additional assumptions. A common assumption is the independence of causal mechanisms, which relies on concentration phenomena in high dimensions. While high dimensions enable the estimation of confoundin… ▽ More Regression on observational data can fail to capture a causal relationship in the presence of unobserved confounding. Confounding strength measures this mismatch, but estimating it requires itself additional assumptions. A common assumption is the independence of causal mechanisms, which relies on concentration phenomena in high dimensions. While high dimensions enable the estimation of confounding strength, they also necessitate adapted estimators. In this paper, we derive the asymptotic behavior of the confounding strength estimator by Janzing and Schölkopf (2018) and show that it is generally not consistent. We then use tools from random matrix theory to derive an adapted, consistent estimator. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: 21 pages

arXiv:2210.09809 [pdf, other]

Analysis of Convolutions, Non-linearity and Depth in Graph Neural Networks using Neural Tangent Kernel

Authors: Mahalakshmi Sabanayagam, Pascal Esser, Debarghya Ghoshdastidar

Abstract: The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a `graph convolution' in conjunction with a suitable choice for the network architecture, such as depth and activation functions. Therefore, understanding the influence of each of the design choice on the network performance is crucial. Convoluti… ▽ More The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a `graph convolution' in conjunction with a suitable choice for the network architecture, such as depth and activation functions. Therefore, understanding the influence of each of the design choice on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix as the most widely adopted one. However, some empirical studies show that row normalization of the adjacency matrix outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolutions, that could explain this behavior. Similarly, the empirical observation of the linear GNNs performance being on par with non-linear ReLU GNNs lacks rigorous theory. In this work, we theoretically analyze the influence of different aspects of the GNN architecture using the Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under the population Degree Corrected Stochastic Block Model, we prove that: (i) linear networks capture the class information as good as ReLU networks; (ii) row normalization preserves the underlying class structure better than other convolutions; (iii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iv) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing. We finally validate our theoretical findings numerically and on real datasets such as Cora and Citeseer. △ Less

Submitted 31 October, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: 39 pages, 24 figures. Code available at https://github.com/mahalakshmi-sabanayagam/NTK_GCN

arXiv:2202.09054 [pdf, other]

Interpolation and Regularization for Causal Learning

Authors: Leena Chennuru Vankadara, Luca Rendsburg, Ulrike von Luxburg, Debarghya Ghoshdastidar

Abstract: We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an… ▽ More We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the *causal risk* of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of *confounding strength*. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite, and learning from observational data is actively harmful. △ Less

Submitted 18 February, 2022; originally announced February 2022.

arXiv:2112.03968 [pdf, other]

Learning Theory Can (Sometimes) Explain Generalisation in Graph Neural Networks

Authors: Pascal Mattia Esser, Leena Chennuru Vankadara, Debarghya Ghoshdastidar

Abstract: In recent years, several results in the supervised learning setting suggested that classical statistical learning-theoretic measures, such as VC dimension, do not adequately explain the performance of deep learning models which prompted a slew of work in the infinite-width and iteration regimes. However, there is little theoretical explanation for the success of neural networks beyond the supervis… ▽ More In recent years, several results in the supervised learning setting suggested that classical statistical learning-theoretic measures, such as VC dimension, do not adequately explain the performance of deep learning models which prompted a slew of work in the infinite-width and iteration regimes. However, there is little theoretical explanation for the success of neural networks beyond the supervised setting. In this paper we argue that, under some distributional assumptions, classical learning-theoretic measures can sufficiently explain generalization for graph neural networks in the transductive setting. In particular, we provide a rigorous analysis of the performance of neural networks in the context of transductive inference, specifically by analysing the generalisation properties of graph convolutional networks for the problem of node classification. While VC Dimension does result in trivial generalisation error bounds in this setting as well, we show that transductive Rademacher complexity can explain the generalisation properties of graph convolutional networks for stochastic block models. We further use the generalisation error bounds based on transductive Rademacher complexity to demonstrate the role of graph convolutions and network architectures in achieving smaller generalisation error and provide insights into when the graph structure can help in learning. The findings of this paper could re-new the interest in studying generalisation in neural networks in terms of learning-theoretic measures, albeit in specific problems. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2111.09831 [pdf, other]

Causal Forecasting:Generalization Bounds for Autoregressive Models

Authors: Leena Chennuru Vankadara, Philipp Michael Faller, Michaela Hardt, Lenon Minorics, Debarghya Ghoshdastidar, Dominik Janzing

Abstract: Despite the increasing relevance of forecasting methods, causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its \textit{causal risk}. Here, we study the problem of \textit{causal generalization} -- generalizing from the ob… ▽ More Despite the increasing relevance of forecasting methods, causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its \textit{causal risk}. Here, we study the problem of \textit{causal generalization} -- generalizing from the observational to interventional distributions -- in forecasting. Our goal is to find answers to the question: How does the efficacy of an autoregressive (VAR) model in predicting statistical associations compare with its ability to predict under interventions? To this end, we introduce the framework of \textit{causal learning theory} for forecasting. Using this framework, we obtain a characterization of the difference between statistical and causal risks, which helps identify sources of divergence between them. Under causal sufficiency, the problem of causal generalization amounts to learning under covariate shifts, albeit with additional structure (restriction to interventional distributions under the VAR model). This structure allows us to obtain uniform convergence bounds on causal generalizability for the class of VAR models. To the best of our knowledge, this is the first work that provides theoretical guarantees for causal generalization in the time-series setting. △ Less

Submitted 8 September, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

arXiv:2110.09476 [pdf, other]

Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models

Authors: Leena Chennuru Vankadara, Sebastian Bordt, Ulrike von Luxburg, Debarghya Ghoshdastidar

Abstract: Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separab… ▽ More Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering. △ Less

Submitted 18 October, 2021; originally announced October 2021.

arXiv:2110.04060 [pdf, other]

New Insights into Graph Convolutional Networks using Neural Tangent Kernels

Authors: Mahalakshmi Sabanayagam, Pascal Esser, Debarghya Ghoshdastidar

Abstract: Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervi… ▽ More Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs. △ Less

Submitted 4 November, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

arXiv:2110.02722 [pdf, other]

Graphon based Clustering and Testing of Networks: Algorithms and Theory

Authors: Mahalakshmi Sabanayagam, Leena Chennuru Vankadara, Debarghya Ghoshdastidar

Abstract: Network-valued data are encountered in a wide range of applications and pose challenges in learning due to their complex structure and absence of vertex correspondence. Typical examples of such problems include classification or grou** of protein structures and social networks. Various methods, ranging from graph kernels to graph neural networks, have been proposed that achieve some success in g… ▽ More Network-valued data are encountered in a wide range of applications and pose challenges in learning due to their complex structure and absence of vertex correspondence. Typical examples of such problems include classification or grou** of protein structures and social networks. Various methods, ranging from graph kernels to graph neural networks, have been proposed that achieve some success in graph classification problems. However, most methods have limited theoretical justification, and their applicability beyond classification remains unexplored. In this work, we propose methods for clustering multiple graphs, without vertex correspondence, that are inspired by the recent literature on estimating graphons -- symmetric functions corresponding to infinite vertex limit of graphs. We propose a novel graph distance based on sorting-and-smoothing graphon estimators. Using the proposed graph distance, we present two clustering algorithms and show that they achieve state-of-the-art results. We prove the statistical consistency of both algorithms under Lipschitz assumptions on the graph degrees. We further study the applicability of the proposed distance for graph two-sample testing problems. △ Less

Submitted 7 November, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

arXiv:2108.02789 [pdf, other]

doi 10.1051/0004-6361/202141956

HOLISMOKES -- VII. Time-delay measurement of strongly lensed Type Ia supernovae using machine learning

Authors: S. Huber, S. H. Suyu, D. Ghoshdastidar, S. Taubenberger, V. Bonvin, J. H. H. Chan, M. Kromer, U. M. Noebauer, S. A. Sim, L. Leal-Taixé

Abstract: The Hubble constant ($H_0$) is one of the fundamental parameters in cosmology, but there is a heated debate around the $>$4$σ$ tension between the local Cepheid distance ladder and the early Universe measurements. Strongly lensed Type Ia supernovae (LSNe Ia) are an independent and direct way to measure $H_0$, where a time-delay measurement between the multiple supernova (SN) images is required. In… ▽ More The Hubble constant ($H_0$) is one of the fundamental parameters in cosmology, but there is a heated debate around the $>$4$σ$ tension between the local Cepheid distance ladder and the early Universe measurements. Strongly lensed Type Ia supernovae (LSNe Ia) are an independent and direct way to measure $H_0$, where a time-delay measurement between the multiple supernova (SN) images is required. In this work, we present two machine learning approaches for measuring time delays in LSNe Ia, namely, a fully connected neural network (FCNN) and a random forest (RF). For the training of the FCNN and the RF, we simulate mock LSNe Ia from theoretical SN Ia models that include observational noise and microlensing. We test the generalizability of the machine learning models by using a final test set based on empirical LSN Ia light curves not used in the training process, and we find that only the RF provides a low enough bias to achieve precision cosmology; as such, RF is therefore preferred over our FCNN approach for applications to real systems. For the RF with single-band photometry in the $i$ band, we obtain an accuracy better than 1\% in all investigated cases for time delays longer than 15 days, assuming follow-up observations with a 5$σ$ point-source depth of 24.7, a two day cadence with a few random gaps, and a detection of the LSNe Ia 8 to 10 days before peak in the observer frame. In terms of precision, we can achieve an approximately 1.5-day uncertainty for a typical source redshift of $\sim$0.8 on the $i$ band under the same assumptions. To improve the measurement, we find that using three bands, where we train a RF for each band separately and combine them afterward, helps to reduce the uncertainty to $\sim$1.0 day. We have publicly released the microlensed spectra and light curves used in this work. △ Less

Submitted 21 December, 2021; v1 submitted 5 August, 2021; originally announced August 2021.

Comments: 25 pages, 28 figures; accepted for publication in A&A

Journal ref: A&A 658, A157 (2022)

arXiv:2010.03918 [pdf, other]

Near-Optimal Comparison Based Clustering

Authors: Michaël Perrot, Pascal Mattia Esser, Debarghya Ghoshdastidar

Abstract: The goal of clustering is to group similar objects into meaningful partitions. This process is well understood when an explicit similarity measure between the objects is given. However, far less is known when this information is not readily available and, instead, one only observes ordinal comparisons such as "object i is more similar to j than to k." In this paper, we tackle this problem using a… ▽ More The goal of clustering is to group similar objects into meaningful partitions. This process is well understood when an explicit similarity measure between the objects is given. However, far less is known when this information is not readily available and, instead, one only observes ordinal comparisons such as "object i is more similar to j than to k." In this paper, we tackle this problem using a two-step procedure: we estimate a pairwise similarity matrix from the comparisons before using a clustering method based on semi-definite programming (SDP). We theoretically show that our approach can exactly recover a planted clustering using a near-optimal number of passive comparisons. We empirically validate our theoretical findings and demonstrate the good behaviour of our method on real data. △ Less

Submitted 9 October, 2020; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: To appear in NeurIPS 2020

arXiv:1912.00458 [pdf, other]

On the optimality of kernels for high-dimensional clustering

Authors: Leena Chennuru Vankadara, Debarghya Ghoshdastidar

Abstract: This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian… ▽ More This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of $\sqrt{2}$ for large $k$. It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods. △ Less

Submitted 1 December, 2019; originally announced December 2019.

arXiv:1811.12752 [pdf, ps, other]

Practical methods for graph two-sample testing

Authors: Debarghya Ghoshdastidar, Ulrike von Luxburg

Abstract: Hypothesis testing for graphs has been an important tool in applied research fields for more than two decades, and still remains a challenging problem as one often needs to draw inference from few replicates of large graphs. Recent studies in statistics and learning theory have provided some theoretical insights about such high-dimensional graph testing problems, but the practicality of the develo… ▽ More Hypothesis testing for graphs has been an important tool in applied research fields for more than two decades, and still remains a challenging problem as one often needs to draw inference from few replicates of large graphs. Recent studies in statistics and learning theory have provided some theoretical insights about such high-dimensional graph testing problems, but the practicality of the developed theoretical methods remains an open question. In this paper, we consider the problem of two-sample testing of large graphs. We demonstrate the practical merits and limitations of existing theoretical tests and their bootstrapped variants. We also propose two new tests based on asymptotic distributions. We show that these tests are computationally less expensive and, in some cases, more reliable than the existing methods. △ Less

Submitted 30 November, 2018; originally announced November 2018.

Comments: To appear in Neural Information Processing Systems 2018

arXiv:1811.00928 [pdf, other]

Foundations of Comparison-Based Hierarchical Clustering

Authors: Debarghya Ghoshdastidar, Michaël Perrot, Ulrike von Luxburg

Abstract: We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. Instead, we assume that only a set of comparisons between objects is available, that is, statements of the form "objects $i$ and $j$ are more similar than objects $k$ and $l$." Such a scenario is commonly encountered in cr… ▽ More We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. Instead, we assume that only a set of comparisons between objects is available, that is, statements of the form "objects $i$ and $j$ are more similar than objects $k$ and $l$." Such a scenario is commonly encountered in crowdsourcing applications. The focus of this work is to develop comparison-based hierarchical clustering algorithms that do not rely on the principles of ordinal embedding. We show that single and complete linkage are inherently comparison-based and we develop variants of average linkage. We provide statistical guarantees for the different methods under a planted hierarchical partition model. We also empirically demonstrate the performance of the proposed approaches on several datasets. △ Less

Submitted 12 June, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: 26 pages

arXiv:1807.02821 [pdf, ps, other]

doi 10.1063/1.5026640

Ionic liquids make DNA rigid

Authors: Ashok Garai, Debostuti Ghoshdastidar, Sanjib Senapati, Prabal K. Maiti

Abstract: Persistence length of dsDNA is known to decrease with increase in ionic concentration of the solution. In contrast to this, here we show that persistence length of dsDNA increases dramatically as a function of ionic liquid (IL) concentration. Using all atomic explicit solvent molecular dynamics simulations and theoretical models we present, for the first time, a systematic study to determine the m… ▽ More Persistence length of dsDNA is known to decrease with increase in ionic concentration of the solution. In contrast to this, here we show that persistence length of dsDNA increases dramatically as a function of ionic liquid (IL) concentration. Using all atomic explicit solvent molecular dynamics simulations and theoretical models we present, for the first time, a systematic study to determine the mechanical properties of dsDNA in various hydrated ionic liquids at different concentrations. We find that dsDNA in 50 wt% ILs have lower persistence length and stretch modulus in comparison to 80 wt% ILs. We further observe that both persistence length and stretch modulus of dsDNA increase as we increase the ILs concentration. Present trend of stretch modulus and persistence length of dsDNA with ILs concentration supports the predictions of the macroscopic elastic theory, in contrast to the behavior exhibited by dsDNA in monovalent salt. Our study further suggests the preferable ILs that can be used for maintaining DNA stability during long-term storage. △ Less

Submitted 8 July, 2018; originally announced July 2018.

Comments: 16 pages, 3 figures, Supplementary Information (Accepted for publication in the Journal of Chemical Physics, AIP (USA))

arXiv:1707.00833 [pdf, ps, other]

doi 10.1214/19-AOS1884

Two-sample Hypothesis Testing for Inhomogeneous Random Graphs

Authors: Debarghya Ghoshdastidar, Maurilio Gutzeit, Alexandra Carpentier, Ulrike von Luxburg

Abstract: The study of networks leads to a wide range of high dimensional inference problems. In many practical applications, one needs to draw inference from one or few large sparse networks. The present paper studies hypothesis testing of graphs in this high-dimensional regime, where the goal is to test between two populations of inhomogeneous random graphs defined on the same set of $n$ vertices. The siz… ▽ More The study of networks leads to a wide range of high dimensional inference problems. In many practical applications, one needs to draw inference from one or few large sparse networks. The present paper studies hypothesis testing of graphs in this high-dimensional regime, where the goal is to test between two populations of inhomogeneous random graphs defined on the same set of $n$ vertices. The size of each population $m$ is much smaller than $n$, and can even be a constant as small as 1. The critical question in this context is whether the problem is solvable for small $m$. We answer this question from a minimax testing perspective. Let $P,Q$ be the population adjacencies of two sparse inhomogeneous random graph models, and $d$ be a suitably defined distance function. Given a population of $m$ graphs from each model, we derive minimax separation rates for the problem of testing $P=Q$ against $d(P,Q)>ρ$. We observe that if $m$ is small, then the minimax separation is too large for some popular choices of $d$, including total variation distance between corresponding distributions. This implies that some models that are widely separated in $d$ cannot be distinguished for small $m$, and hence, the testing problem is generally not solvable in these cases. We also show that if $m>1$, then the minimax separation is relatively small if $d$ is the Frobenius norm or operator norm distance between $P$ and $Q$. For $m=1$, only the latter distance provides small minimax separation. Thus, for these distances, the problem is solvable for small $m$. We also present near-optimal two-sample tests in both cases, where tests are adaptive with respect to sparsity level of the graphs. △ Less

Submitted 17 July, 2019; v1 submitted 4 July, 2017; originally announced July 2017.

Comments: To appear in the Annals of Statistics. This 54-page version includes the supplementary material (appendix to the main paper)

MSC Class: 62H15; 62C20; 05C80; 60B20

Journal ref: Ann. Statist. Volume 48, Number 4 (2020), 2208-2229

arXiv:1705.06168 [pdf, ps, other]

Two-Sample Tests for Large Random Graphs Using Network Statistics

Authors: Debarghya Ghoshdastidar, Maurilio Gutzeit, Alexandra Carpentier, Ulrike von Luxburg

Abstract: We consider a two-sample hypothesis testing problem, where the distributions are defined on the space of undirected graphs, and one has access to only one observation from each model. A motivating example for this problem is comparing the friendship networks on Facebook and LinkedIn. The practical approach to such problems is to compare the networks based on certain network statistics. In this pap… ▽ More We consider a two-sample hypothesis testing problem, where the distributions are defined on the space of undirected graphs, and one has access to only one observation from each model. A motivating example for this problem is comparing the friendship networks on Facebook and LinkedIn. The practical approach to such problems is to compare the networks based on certain network statistics. In this paper, we present a general principle for two-sample hypothesis testing in such scenarios without making any assumption about the network generation process. The main contribution of the paper is a general formulation of the problem based on concentration of network statistics, and consequently, a consistent two-sample test that arises as the natural solution for this problem. We also show that the proposed test is minimax optimal for certain network statistics. △ Less

Submitted 26 May, 2017; v1 submitted 17 May, 2017; originally announced May 2017.

Comments: To be presented in COLT 2017 (author sequence, funding details and minor typos updated in version 2)

arXiv:1704.01460 [pdf, other]

Comparison Based Nearest Neighbor Search

Authors: Siavash Haghiri, Debarghya Ghoshdastidar, Ulrike von Luxburg

Abstract: We consider machine learning in a comparison-based setting where we are given a set of points in a metric space, but we have no access to the actual distances between the points. Instead, we can only ask an oracle whether the distance between two points $i$ and $j$ is smaller than the distance between the points $i$ and $k$. We are concerned with data structures and algorithms to find nearest neig… ▽ More We consider machine learning in a comparison-based setting where we are given a set of points in a metric space, but we have no access to the actual distances between the points. Instead, we can only ask an oracle whether the distance between two points $i$ and $j$ is smaller than the distance between the points $i$ and $k$. We are concerned with data structures and algorithms to find nearest neighbors based on such comparisons. We focus on a simple yet effective algorithm that recursively splits the space by first selecting two random pivot points and then assigning all other points to the closer of the two (comparison tree). We prove that if the metric space satisfies certain expansion conditions, then with high probability the height of the comparison tree is logarithmic in the number of points, leading to efficient search performance. We also provide an upper bound for the failure probability to return the true nearest neighbor. Experiments show that the comparison tree is competitive with algorithms that have access to the actual distance values, and needs less triplet comparisons than other competitors. △ Less

Submitted 5 April, 2017; originally announced April 2017.

Comments: 16 Pages, 3 Figures

arXiv:1602.06516 [pdf, other]

Uniform Hypergraph Partitioning: Provable Tensor Methods and Sampling Techniques

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati

Abstract: In a series of recent works, we have generalised the consistency results in the stochastic block model literature to the case of uniform and non-uniform hypergraphs. The present paper continues the same line of study, where we focus on partitioning weighted uniform hypergraphs---a problem often encountered in computer vision. This work is motivated by two issues that arise when a hypergraph partit… ▽ More In a series of recent works, we have generalised the consistency results in the stochastic block model literature to the case of uniform and non-uniform hypergraphs. The present paper continues the same line of study, where we focus on partitioning weighted uniform hypergraphs---a problem often encountered in computer vision. This work is motivated by two issues that arise when a hypergraph partitioning approach is used to tackle computer vision problems: (i) The uniform hypergraphs constructed for higher-order learning contain all edges, but most have negligible weights. Thus, the adjacency tensor is nearly sparse, and yet, not binary. (ii) A more serious concern is that standard partitioning algorithms need to compute all edge weights, which is computationally expensive for hypergraphs. This is usually resolved in practice by merging the clustering algorithm with a tensor sampling strategy---an approach that is yet to be analysed rigorously. We build on our earlier work on partitioning dense unweighted uniform hypergraphs (Ghoshdastidar and Dukkipati, ICML, 2015), and address the aforementioned issues by proposing provable and efficient partitioning algorithms. Our analysis justifies the empirical success of practical sampling techniques. We also complement our theoretical findings by elaborate empirical comparison of various hypergraph partitioning schemes. △ Less

Submitted 17 May, 2017; v1 submitted 21 February, 2016; originally announced February 2016.

Comments: To appear in Journal of Machine Learning Research (vol 18)

arXiv:1507.00763 [pdf, ps, other]

Coloring Random Non-Uniform Bipartite Hypergraphs

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati

Abstract: Let $H_{n,(p_m)_{m=2,\ldots,M}}$ be a random non-uniform hypergraph of dimension $M$ on $2n$ vertices, where the vertices are split into two disjoint sets of size $n$, and colored by two distinct colors. Each non-monochromatic edge of size $m=2,\ldots,M$ is independently added with probability $p_m$. We show that if $p_2,\ldots,p_M$ are such that the expected number of edges in the hypergraph is a… ▽ More Let $H_{n,(p_m)_{m=2,\ldots,M}}$ be a random non-uniform hypergraph of dimension $M$ on $2n$ vertices, where the vertices are split into two disjoint sets of size $n$, and colored by two distinct colors. Each non-monochromatic edge of size $m=2,\ldots,M$ is independently added with probability $p_m$. We show that if $p_2,\ldots,p_M$ are such that the expected number of edges in the hypergraph is at least $dn\ln n$, for some $d>0$ sufficiently large, then with probability $(1-o(1))$, one can find a proper 2-coloring of $H_{n,(p_m)_{m=2,\ldots,M}}$ in polynomial time. We present a polynomial time algorithm for hypergraph 2-coloring, and provide discussions on extension of the approach for $k$-coloring of non-uniform hypergraphs. △ Less

Submitted 17 November, 2015; v1 submitted 2 July, 2015; originally announced July 2015.

Comments: 15 pages

arXiv:1505.01582 [pdf, other]

doi 10.1214/16-AOS1453

Consistency of Spectral Hypergraph Partitioning under Planted Partition Model

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati

Abstract: Hypergraph partitioning lies at the heart of a number of problems in machine learning and network sciences. Many algorithms for hypergraph partitioning have been proposed that extend standard approaches for graph partitioning to the case of hypergraphs. However, theoretical aspects of such methods have seldom received attention in the literature as compared to the extensive studies on the guarante… ▽ More Hypergraph partitioning lies at the heart of a number of problems in machine learning and network sciences. Many algorithms for hypergraph partitioning have been proposed that extend standard approaches for graph partitioning to the case of hypergraphs. However, theoretical aspects of such methods have seldom received attention in the literature as compared to the extensive studies on the guarantees of graph partitioning. For instance, consistency results of spectral graph partitioning under the stochastic block model are well known. In this paper, we present a planted partition model for sparse random non-uniform hypergraphs that generalizes the stochastic block model. We derive an error bound for a spectral hypergraph partitioning algorithm under this model using matrix concentration inequalities. To the best of our knowledge, this is the first consistency result related to partitioning non-uniform hypergraphs. △ Less

Submitted 3 February, 2016; v1 submitted 7 May, 2015; originally announced May 2015.

Comments: 35 pages, 2 figures, 1 table

Journal ref: Ann. Statist. Volume 45, Number 1 (2017), 289-315

arXiv:1403.4378 [pdf, other]

doi 10.1109/CVPR.2014.191

Spectral Clustering with Jensen-type kernels and their multi-point extensions

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati, Ajay P. Adsul, Aparna S. Vijayan

Abstract: Motivated by multi-distribution divergences, which originate in information theory, we propose a notion of `multi-point' kernels, and study their applications. We study a class of kernels based on Jensen type divergences and show that these can be extended to measure similarity among multiple points. We study tensor flattening methods and develop a multi-point (kernel) spectral clustering (MSC) me… ▽ More Motivated by multi-distribution divergences, which originate in information theory, we propose a notion of `multi-point' kernels, and study their applications. We study a class of kernels based on Jensen type divergences and show that these can be extended to measure similarity among multiple points. We study tensor flattening methods and develop a multi-point (kernel) spectral clustering (MSC) method. We further emphasize on a special case of the proposed kernels, which is a multi-point extension of the linear (dot-product) kernel and show the existence of cubic time tensor flattening algorithm in this case. Finally, we illustrate the usefulness of our contributions using standard data sets and image segmentation tasks. △ Less

Submitted 18 March, 2014; originally announced March 2014.

Comments: To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition

arXiv:1311.2296 [pdf, other]

doi 10.1016/j.automatica.2014.08.021

Newton based Stochastic Optimization using q-Gaussian Smoothed Functional Algorithms

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati, Shalabh Bhatnagar

Abstract: We present the first q-Gaussian smoothed functional (SF) estimator of the Hessian and the first Newton-based stochastic optimization algorithm that estimates both the Hessian and the gradient of the objective function using q-Gaussian perturbations. Our algorithm requires only two system simulations (regardless of the parameter dimension) and estimates both the gradient and the Hessian at each upd… ▽ More We present the first q-Gaussian smoothed functional (SF) estimator of the Hessian and the first Newton-based stochastic optimization algorithm that estimates both the Hessian and the gradient of the objective function using q-Gaussian perturbations. Our algorithm requires only two system simulations (regardless of the parameter dimension) and estimates both the gradient and the Hessian at each update epoch using these. We also present a proof of convergence of the proposed algorithm. In a related recent work (Ghoshdastidar et al., 2013), we presented gradient SF algorithms based on the q-Gaussian perturbations. Our work extends prior work on smoothed functional algorithms by generalizing the class of perturbation distributions as most distributions reported in the literature for which SF algorithms are known to work and turn out to be special cases of the q-Gaussian distribution. Besides studying the convergence properties of our algorithm analytically, we also show the results of several numerical simulations on a model of a queuing network, that illustrate the significance of the proposed method. In particular, we observe that our algorithm performs better in most cases, over a wide range of q-values, in comparison to Newton SF algorithms with the Gaussian (Bhatnagar, 2007) and Cauchy perturbations, as well as the gradient q-Gaussian SF algorithms (Ghoshdastidar et al., 2013). △ Less

Submitted 19 May, 2014; v1 submitted 10 November, 2013; originally announced November 2013.

Comments: This is a longer of version of the paper with the same title accepted in Automatica

arXiv:1206.4832 [pdf, other]

doi 10.1145/2628434

Smoothed Functional Algorithms for Stochastic Optimization using q-Gaussian Distributions

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati, Shalabh Bhatnagar

Abstract: Smoothed functional (SF) schemes for gradient estimation are known to be efficient in stochastic optimization algorithms, specially when the objective is to improve the performance of a stochastic system. However, the performance of these methods depends on several parameters, such as the choice of a suitable smoothing kernel. Different kernels have been studied in literature, which include Gaussi… ▽ More Smoothed functional (SF) schemes for gradient estimation are known to be efficient in stochastic optimization algorithms, specially when the objective is to improve the performance of a stochastic system. However, the performance of these methods depends on several parameters, such as the choice of a suitable smoothing kernel. Different kernels have been studied in literature, which include Gaussian, Cauchy and uniform distributions among others. This paper studies a new class of kernels based on the q-Gaussian distribution, that has gained popularity in statistical physics over the last decade. Though the importance of this family of distributions is attributed to its ability to generalize the Gaussian distribution, we observe that this class encompasses almost all existing smoothing kernels. This motivates us to study SF schemes for gradient estimation using the q-Gaussian distribution. Using the derived gradient estimates, we propose two-timescale algorithms for optimization of a stochastic objective function in a constrained setting with projected gradient search approach. We prove the convergence of our algorithms to the set of stationary points of an associated ODE. We also demonstrate their performance numerically through simulations on a queuing model. △ Less

Submitted 3 July, 2014; v1 submitted 21 June, 2012; originally announced June 2012.

ACM Class: G.1.6; I.6.8

arXiv:1205.0651 [pdf, ps, other]

Generative Maximum Entropy Learning for Multiclass Classification

Authors: Ambedkar Dukkipati, Gaurav Pandey, Debarghya Ghoshdastidar, Paramita Koley, D. M. V. Satya Sriram

Abstract: Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of… ▽ More Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a `maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys ($J$) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon ($JS$) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence ($JS_{GM}$), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using $J-$divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets. △ Less

Submitted 30 December, 2013; v1 submitted 3 May, 2012; originally announced May 2012.

arXiv:1204.1800 [pdf, other]

On Power-law Kernels, corresponding Reproducing Kernel Hilbert Space and Applications

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati

Abstract: The role of kernels is central to machine learning. Motivated by the importance of power-law distributions in statistical modeling, in this paper, we propose the notion of power-law kernels to investigate power-laws in learning problem. We propose two power-law kernels by generalizing Gaussian and Laplacian kernels. This generalization is based on distributions, arising out of maximization of a ge… ▽ More The role of kernels is central to machine learning. Motivated by the importance of power-law distributions in statistical modeling, in this paper, we propose the notion of power-law kernels to investigate power-laws in learning problem. We propose two power-law kernels by generalizing Gaussian and Laplacian kernels. This generalization is based on distributions, arising out of maximization of a generalized information measure known as nonextensive entropy that is very well studied in statistical mechanics. We prove that the proposed kernels are positive definite, and provide some insights regarding the corresponding Reproducing Kernel Hilbert Space (RKHS). We also study practical significance of both kernels in classification and regression, and present some simulation results. △ Less

Submitted 1 April, 2013; v1 submitted 9 April, 2012; originally announced April 2012.

Comments: 7 pages, 3 figures, 3 tables

arXiv:1202.5665 [pdf, other]

doi 10.1109/ISIT.2012.6283013

q-Gaussian based Smoothed Functional Algorithm for Stochastic Optimization

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati, Shalabh Bhatnagar

Abstract: The q-Gaussian distribution results from maximizing certain generalizations of Shannon entropy under some constraints. The importance of q-Gaussian distributions stems from the fact that they exhibit power-law behavior, and also generalize Gaussian distributions. In this paper, we propose a Smoothed Functional (SF) scheme for gradient estimation using q-Gaussian distribution, and also propose an a… ▽ More The q-Gaussian distribution results from maximizing certain generalizations of Shannon entropy under some constraints. The importance of q-Gaussian distributions stems from the fact that they exhibit power-law behavior, and also generalize Gaussian distributions. In this paper, we propose a Smoothed Functional (SF) scheme for gradient estimation using q-Gaussian distribution, and also propose an algorithm for optimization based on the above scheme. Convergence results of the algorithm are presented. Performance of the proposed algorithm is shown by simulation results on a queuing model. △ Less

Submitted 16 June, 2012; v1 submitted 25 February, 2012; originally announced February 2012.

Comments: 5 pages, 1 figure

Showing 1–35 of 35 results for author: Ghoshdastidar, D