-
Retraining with Predicted Hard Labels Provably Increases Model Accuracy
Authors:
Rudrajit Das,
Inderjit S. Dhillon,
Alessandro Epasto,
Adel Javanmard,
Jieming Mao,
Vahab Mirrokni,
Sujay Sanghavi,
Peilin Zhong
Abstract:
The performance of a model trained with \textit{noisy labels} is often improved by simply \textit{retraining} the model with its own predicted \textit{hard} labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove…
▽ More
The performance of a model trained with \textit{noisy labels} is often improved by simply \textit{retraining} the model with its own predicted \textit{hard} labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at \textit{no extra privacy cost}; we call this \textit{consensus-based retraining}. For e.g., when training ResNet-18 on CIFAR-100 with $ε=3$ label DP, we obtain $6.4\%$ improvement in accuracy with consensus-based retraining.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Bayesian reliability acceptance sampling plan with optional warranty under hybrid censoring
Authors:
Rathin Das,
Biswabrata Pradhan
Abstract:
This work considers design of Bayesian reliability acceptance sampling plan (RASP) under hybrid censored life test for the products sold under optional warranty. The consumer and manufacturer agree on a common lifetime distribution of the product. However, they differ in the assessment of the prior distributions because of the adversarial nature of the consumer and manufacturer. The consumer takes…
▽ More
This work considers design of Bayesian reliability acceptance sampling plan (RASP) under hybrid censored life test for the products sold under optional warranty. The consumer and manufacturer agree on a common lifetime distribution of the product. However, they differ in the assessment of the prior distributions because of the adversarial nature of the consumer and manufacturer. The consumer takes decision based on his/her utility and prior belief without warranty offer by the manufacturer. If the decision is rejection, manufacturer provides warranty offer to the consumer. If the consumer rejects the lot with a warranty, the manufacturer conducts life test under hybrid censoring scheme (HCS) and provide lifetime information to the consumer. The consumer updates his/her belief based on lifetime information provided by the manufacturer. The consumer then takes decision of acceptance or rejection of lot based on updated belief. Task of the manufacturer is to determine the optimal life testing plan.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Towards Quantifying the Preconditioning Effect of Adam
Authors:
Rudrajit Das,
Naman Agarwal,
Sujay Sanghavi,
Inderjit S. Dhillon
Abstract:
There is a notable dearth of results characterizing the preconditioning effect of Adam and showing how it may alleviate the curse of ill-conditioning -- an issue plaguing gradient descent (GD). In this work, we perform a detailed analysis of Adam's preconditioning effect for quadratic functions and quantify to what extent Adam can mitigate the dependence on the condition number of the Hessian. Our…
▽ More
There is a notable dearth of results characterizing the preconditioning effect of Adam and showing how it may alleviate the curse of ill-conditioning -- an issue plaguing gradient descent (GD). In this work, we perform a detailed analysis of Adam's preconditioning effect for quadratic functions and quantify to what extent Adam can mitigate the dependence on the condition number of the Hessian. Our key finding is that Adam can suffer less from the condition number but at the expense of suffering a dimension-dependent quantity. Specifically, for a $d$-dimensional quadratic with a diagonal Hessian having condition number $κ$, we show that the effective condition number-like quantity controlling the iteration complexity of Adam without momentum is $\mathcal{O}(\min(d, κ))$. For a diagonally dominant Hessian, we obtain a bound of $\mathcal{O}(\min(d \sqrt{d κ}, κ))$ for the corresponding quantity. Thus, when $d < \mathcal{O}(κ^p)$ where $p = 1$ for a diagonal Hessian and $p = 1/3$ for a diagonally dominant Hessian, Adam can outperform GD (which has an $\mathcal{O}(κ)$ dependence). On the negative side, our results suggest that Adam can be worse than GD for a sufficiently non-diagonal Hessian even if $d \ll \mathcal{O}(κ^{1/3})$; we corroborate this with empirical evidence. Finally, we extend our analysis to functions satisfying per-coordinate Lipschitz smoothness and a modified version of the Polyak-Łojasiewicz condition.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Understanding the Training Speedup from Sampling with Approximate Losses
Authors:
Rudrajit Das,
Xi Chen,
Bertram Ieong,
Parikshit Bansal,
Sujay Sanghavi
Abstract:
It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large \textit{approximate losses} instead of exact losses in order to reduce the selection…
▽ More
It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large \textit{approximate losses} instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer's representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12-layer BERT base model and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes ~43 hours compared to ~57 hours of vanilla training.
△ Less
Submitted 10 February, 2024;
originally announced February 2024.
-
Fairer and More Accurate Tabular Models Through NAS
Authors:
Richeek Das,
Samuel Dooley
Abstract:
Making models algorithmically fairer in tabular data has been long studied, with techniques typically oriented towards fixes which usually take a neural model with an undesirable outcome and make changes to how the data are ingested, what the model weights are, or how outputs are processed. We employ an emergent and different strategy where we consider updating the model's architecture and trainin…
▽ More
Making models algorithmically fairer in tabular data has been long studied, with techniques typically oriented towards fixes which usually take a neural model with an undesirable outcome and make changes to how the data are ingested, what the model weights are, or how outputs are processed. We employ an emergent and different strategy where we consider updating the model's architecture and training hyperparameters to find an entirely new model with better outcomes from the beginning of the debiasing procedure. In this work, we propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data. We conduct extensive exploration of architectural and hyperparameter spaces (MLP, ResNet, and FT-Transformer) across diverse datasets, demonstrating the dependence of accuracy and fairness metrics of model predictions on hyperparameter combinations. We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns. We propose a novel approach that jointly optimizes architectural and training hyperparameters in a multi-objective constraint of both accuracy and fairness. We produce architectures that consistently Pareto dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both, all of this while being Pareto-optimal over hyperparameters achieved through single-objective (accuracy) optimization runs. This research underscores the promise of automating fairness and accuracy optimization in deep learning models.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Understanding Self-Distillation in the Presence of Label Noise
Authors:
Rudrajit Das,
Sujay Sanghavi
Abstract:
Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(ξ*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-ξ)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is som…
▽ More
Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(ξ*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-ξ)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $ξ$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $ξ$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $ξ> 1$ works better than $ξ\leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Modelling and classifying joint trajectories of self-reported mood and pain in a large cohort study
Authors:
Rajenki Das,
Mark Muldoon,
Mark Lunt,
John McBeth,
Belay Birlie Yimer,
Thomas House
Abstract:
It is well-known that mood and pain interact with each other, however individual-level variability in this relationship has been less well quantified than overall associations between low mood and pain. Here, we leverage the possibilities presented by mobile health data, in particular the "Cloudy with a Chance of Pain" study, which collected longitudinal data from the residents of the UK with chro…
▽ More
It is well-known that mood and pain interact with each other, however individual-level variability in this relationship has been less well quantified than overall associations between low mood and pain. Here, we leverage the possibilities presented by mobile health data, in particular the "Cloudy with a Chance of Pain" study, which collected longitudinal data from the residents of the UK with chronic pain conditions. Participants used an App to record self-reported measures of factors including mood, pain and sleep quality. The richness of these data allows us to perform model-based clustering of the data as a mixture of Markov processes. Through this analysis we discover four endotypes with distinct patterns of co-evolution of mood and pain over time. The differences between endotypes are sufficiently large to play a role in clinical hypothesis generation for personalised treatments of comorbid pain and low mood.
△ Less
Submitted 1 January, 2024; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Beyond Uniform Lipschitz Condition in Differentially Private Optimization
Authors:
Rudrajit Das,
Satyen Kale,
Zheng Xu,
Tong Zhang,
Sujay Sanghavi
Abstract:
Most prior results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. We generalize uniform Lipschitzness by assuming that the per-sample gradients have sample-dependent upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We prov…
▽ More
Most prior results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. We generalize uniform Lipschitzness by assuming that the per-sample gradients have sample-dependent upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We provide principled guidance on choosing the clip norm in DP-SGD for convex over-parameterized settings satisfying our general version of Lipschitzness when the per-sample Lipschitz constants are bounded; specifically, we recommend tuning the clip norm only till values up to the minimum per-sample Lipschitz constant. This finds application in the private training of a softmax layer on top of a deep network pre-trained on public data. We verify the efficacy of our recommendation via experiments on 8 datasets. Furthermore, we provide new convergence results for DP-SGD on convex and nonconvex functions when the Lipschitz constants are unbounded but have bounded moments, i.e., they are heavy-tailed.
△ Less
Submitted 5 June, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Diversity of symptom phenotypes in SARS-CoV-2 community infections observed in multiple large datasets
Authors:
Martyn Fyles,
Karina-Doris Vihta,
Carole H Sudre,
Harry Long,
Rajenki Das,
Caroline Jay,
Tom Wingfield,
Fergus Cumming,
William Green,
Pantelis Hadjipantelis,
Joni Kirk,
Claire J Steves,
Sebastien Ourselin,
Graham F Medley,
Elizabeth Fearon,
Thomas House
Abstract:
Through the use of cutting-edge unsupervised classification techniques from statistics and machine learning, we characterise symptom phenotypes among symptomatic SARS-CoV-2 PCR-positive community cases. We first analyse each dataset in isolation and across age bands, before using methods that allow us to compare multiple datasets. While we observe separation due to the total number of symptoms exp…
▽ More
Through the use of cutting-edge unsupervised classification techniques from statistics and machine learning, we characterise symptom phenotypes among symptomatic SARS-CoV-2 PCR-positive community cases. We first analyse each dataset in isolation and across age bands, before using methods that allow us to compare multiple datasets. While we observe separation due to the total number of symptoms experienced by cases, we also see a separation of symptoms into gastrointestinal, respiratory and other types, and different symptom co-occurrence patterns at the extremes of age. In this way, we are able to demonstrate the deep structure of symptoms of COVID-19 without usual biases due to study design. This is expected to have implications for the identification and management of community SARS-CoV-2 cases and could be further applied to symptom-based management of other diseases and syndromes.
△ Less
Submitted 20 November, 2023; v1 submitted 10 November, 2021;
originally announced November 2021.
-
Deep learning models for predicting RNA degradation via dual crowdsourcing
Authors:
Hannah K. Wayment-Steele,
Wipapat Kladwang,
Andrew M. Watkins,
Do Soon Kim,
Bojan Tunguz,
Walter Reade,
Maggie Demkin,
Jonathan Romano,
Roger Wellington-Oguri,
John J. Nicol,
Jiayang Gao,
Kazuki Onodera,
Kazuki Fujikawa,
Hanfei Mao,
Gilles Vandewiele,
Michele Tinti,
Bram Steenwinckel,
Takuya Ito,
Taiga Noumi,
Shujun He,
Keiichiro Ishi,
Youhan Lee,
Fatih Öztürk,
Anthony Chiu,
Emin Öztürk
, et al. (4 additional authors not shown)
Abstract:
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a ke…
▽ More
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.
△ Less
Submitted 22 April, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
On the Convergence of Differentially Private Federated Learning on Non-Lipschitz Objectives, and with Normalized Client Updates
Authors:
Rudrajit Das,
Abolfazl Hashemi,
Sujay Sanghavi,
Inderjit S. Dhillon
Abstract:
There is a dearth of convergence results for differentially private federated learning (FL) with non-Lipschitz objective functions (i.e., when gradient norms are not bounded). The primary reason for this is that the clip** operation (i.e., projection onto an $\ell_2$ ball of a fixed radius called the clip** threshold) for bounding the sensitivity of the average update to each client's update i…
▽ More
There is a dearth of convergence results for differentially private federated learning (FL) with non-Lipschitz objective functions (i.e., when gradient norms are not bounded). The primary reason for this is that the clip** operation (i.e., projection onto an $\ell_2$ ball of a fixed radius called the clip** threshold) for bounding the sensitivity of the average update to each client's update introduces bias depending on the clip** threshold and the number of local steps in FL, and analyzing this is not easy. For Lipschitz functions, the Lipschitz constant serves as a trivial clip** threshold with zero bias. However, Lipschitzness does not hold in many practical settings; moreover, verifying it and computing the Lipschitz constant is hard. Thus, the choice of the clip** threshold is non-trivial and requires a lot of tuning in practice. In this paper, we provide the first convergence result for private FL on smooth \textit{convex} objectives \textit{for a general clip** threshold} -- \textit{without assuming Lipschitzness}. We also look at a simpler alternative to clip** (for bounding sensitivity) which is \textit{normalization} -- where we use only a scaled version of the unit vector along the client updates, completely discarding the magnitude information. {The resulting normalization-based private FL algorithm is theoretically shown to have better convergence than its clip**-based counterpart on smooth convex functions. We corroborate our theory with synthetic experiments as well as experiments on benchmarking datasets.
△ Less
Submitted 15 April, 2022; v1 submitted 13 June, 2021;
originally announced June 2021.
-
A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations
Authors:
Alex Markham,
Richeek Das,
Moritz Grosse-Wentrup
Abstract:
We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causa…
▽ More
We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. Indeed, we prove that the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. Even stronger, we prove that the kernel space is isometric to the space of causal ancestral graphs, so that distance between samples in the kernel space is guaranteed to correspond to distance between their generating causal structures. This kernel thus enables us to perform clustering to identify the homogeneous subpopulations, for which we can then learn causal structures using existing methods. Though we focus on the theoretical aspects of the kernel, we also evaluate its performance on synthetic data and demonstrate its use on a real gene expression data set.
△ Less
Submitted 18 February, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Faster Non-Convex Federated Learning via Global and Local Momentum
Authors:
Rudrajit Das,
Anish Acharya,
Abolfazl Hashemi,
Sujay Sanghavi,
Inderjit S. Dhillon,
Ufuk Topcu
Abstract:
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key…
▽ More
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key algorithmic idea that enables achieving this improved complexity is based on the observation that the convergence in FL is hampered by two sources of high variance: (i) the global server aggregation step with multiple local updates, exacerbated by client heterogeneity, and (ii) the noise of the local client-level stochastic gradients. By modeling the server aggregation step as a generalized gradient-type update, we propose a variance-reducing momentum-based global update at the server, which when applied in conjunction with variance-reduced local updates at the clients, enables \texttt{FedGLOMO} to enjoy an improved convergence rate. Moreover, we derive our results under a novel and more realistic client-heterogeneity assumption which we verify empirically -- unlike prior assumptions that are hard to verify. Our experiments illustrate the intrinsic variance reduction effect of \texttt{FedGLOMO}, which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication efficiency.
△ Less
Submitted 24 October, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization
Authors:
Abolfazl Hashemi,
Anish Acharya,
Rudrajit Das,
Haris Vikalo,
Sujay Sanghavi,
Inderjit Dhillon
Abstract:
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such co…
▽ More
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such compressed decentralized optimization settings, there are benefits to having {\em multiple} gossip steps between subsequent gradient iterations, even when the cost of doing so is appropriately accounted for e.g. by means of reducing the precision of compressed information. In particular, we show that having $O(\log\frac{1}ε)$ gradient iterations {with constant step size} - and $O(\log\frac{1}ε)$ gossip steps between every pair of these iterations - enables convergence to within $ε$ of the optimal value for smooth non-convex objectives satisfying Polyak-Łojasiewicz condition. This result also holds for smooth strongly convex objectives. To our knowledge, this is the first work that derives convergence results for nonconvex optimization under arbitrary communication compression.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
On the Separability of Classes with the Cross-Entropy Loss Function
Authors:
Rudrajit Das,
Subhasis Chaudhuri
Abstract:
In this paper, we focus on the separability of classes with the cross-entropy loss function for classification problems by theoretically analyzing the intra-class distance and inter-class distance (i.e. the distance between any two points belonging to the same class and different classes, respectively) in the feature space, i.e. the space of representations learnt by neural networks. Specifically,…
▽ More
In this paper, we focus on the separability of classes with the cross-entropy loss function for classification problems by theoretically analyzing the intra-class distance and inter-class distance (i.e. the distance between any two points belonging to the same class and different classes, respectively) in the feature space, i.e. the space of representations learnt by neural networks. Specifically, we consider an arbitrary network architecture having a fully connected final layer with Softmax activation and trained using the cross-entropy loss. We derive expressions for the value and the distribution of the squared L2 norm of the product of a network dependent matrix and a random intra-class and inter-class distance vector (i.e. the vector between any two points belonging to the same class and different classes), respectively, in the learnt feature space (or the transformation of the original data) just before Softmax activation, as a function of the cross-entropy loss value. The main result of our analysis is the derivation of a lower bound for the probability with which the inter-class distance is more than the intra-class distance in this feature space, as a function of the loss value. We do so by leveraging some empirical statistical observations with mild assumptions and sound theoretical analysis. As per intuition, the probability with which the inter-class distance is more than the intra-class distance decreases as the loss value increases, i.e. the classes are better separated when the loss value is low. To the best of our knowledge, this is the first work of theoretical nature trying to explain the separability of classes in the feature space learnt by neural networks trained with the cross-entropy loss function.
△ Less
Submitted 15 September, 2019;
originally announced September 2019.
-
Optimal Transport-based Alignment of Learned Character Representations for String Similarity
Authors:
Derek Tam,
Nicholas Monath,
Ari Kobren,
Aaron Traylor,
Rajarshi Das,
Andrew McCallum
Abstract:
String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE --a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. W…
▽ More
String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE --a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. We evaluate STANCE's ability to detect whether two strings can refer to the same entity--a task we term alias detection. We construct five new alias detection datasets (and make them publicly available). We show that STANCE or one of its variants outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. We also demonstrate STANCE's ability to improve downstream tasks by applying it to an instance of cross-document coreference and show that it leads to a 2.8 point improvement in B^3 F1 over the previous state-of-the-art approach.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Generative x-vectors for text-independent speaker verification
Authors:
Longting Xu,
Rohan Kumar Das,
Emre Yılmaz,
Jichen Yang,
Haizhou Li
Abstract:
Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose…
▽ More
Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose a novel method to include the complementary information of i-vector and x-vector, that is called generative x-vector. The generative x-vector utilizes a transformation model learned from the i-vector and x-vector representations of the background data. Canonical correlation analysis is applied to derive this transformation model, which is later used to transform the standard x-vectors of the enrollment and test segments to the corresponding generative x-vectors. The SV experiments performed on the NIST SRE 2010 dataset demonstrate that the system using generative x-vectors provides considerably better performance than the baseline i-vector and x-vector systems. Furthermore, the generative x-vectors outperform the fusion of i-vector and x-vector systems for long-duration utterances, while yielding comparable results for short-duration utterances.
△ Less
Submitted 17 September, 2018;
originally announced September 2018.
-
Sparse Kernel PCA for Outlier Detection
Authors:
Rudrajit Das,
Aditya Golatkar,
Suyash P. Awate
Abstract:
In this paper, we propose a new method to perform Sparse Kernel Principal Component Analysis (SKPCA) and also mathematically analyze the validity of SKPCA. We formulate SKPCA as a constrained optimization problem with elastic net regularization (Hastie et al.) in kernel feature space and solve it. We consider outlier detection (where KPCA is employed) as an application for SKPCA, using the RBF ker…
▽ More
In this paper, we propose a new method to perform Sparse Kernel Principal Component Analysis (SKPCA) and also mathematically analyze the validity of SKPCA. We formulate SKPCA as a constrained optimization problem with elastic net regularization (Hastie et al.) in kernel feature space and solve it. We consider outlier detection (where KPCA is employed) as an application for SKPCA, using the RBF kernel. We test it on 5 real-world datasets and show that by using just 4% (or even less) of the principal components (PCs), where each PC has on average less than 12% non-zero elements in the worst case among all 5 datasets, we are able to nearly match and in 3 datasets even outperform KPCA. We also compare the performance of our method with a recently proposed method for SKPCA by Wang et al. and show that our method performs better in terms of both accuracy and sparsity. We also provide a novel probabilistic proof to justify the existence of sparse solutions for KPCA using the RBF kernel. To the best of our knowledge, this is the first attempt at theoretically analyzing the validity of SKPCA.
△ Less
Submitted 13 September, 2018; v1 submitted 7 September, 2018;
originally announced September 2018.
-
SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
Authors:
Jade Shi,
Rhiju Das,
Vijay S. Pande
Abstract:
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in develo** machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the ot…
▽ More
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in develo** machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the other hand, human players of the online RNA design game EteRNA have consistently shown superior performance in this regard, being able to readily design sequences for targets that are challenging for machine algorithms. Here we present a novel approach to the RNA design problem, SentRNA, a design agent consisting of a fully-connected neural network trained end-to-end using human-designed RNA sequences. We show that through this approach, SentRNA can solve complex targets previously unsolvable by any machine-based approach and achieve state-of-the-art performance on two separate challenging test sets. Our results demonstrate that incorporating human design strategies into a design algorithm can significantly boost machine performance and suggests a new paradigm for machine-based RNA design.
△ Less
Submitted 5 March, 2019; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Perspective: Energy Landscapes for Machine Learning
Authors:
Andrew J. Ballard,
Ritankar Das,
Stefano Martiniani,
Dhagash Mehta,
Levent Sagun,
Jacob D. Stevenson,
David J. Wales
Abstract:
Machine learning techniques are being increasingly used as flexible non-linear fitting and prediction tools in the physical sciences. Fitting functions that exhibit multiple solutions as local minima can be analysed in terms of the corresponding machine learning landscape. Methods to explore and visualise molecular potential energy landscapes can be applied to these machine learning landscapes to…
▽ More
Machine learning techniques are being increasingly used as flexible non-linear fitting and prediction tools in the physical sciences. Fitting functions that exhibit multiple solutions as local minima can be analysed in terms of the corresponding machine learning landscape. Methods to explore and visualise molecular potential energy landscapes can be applied to these machine learning landscapes to gain new insight into the solution space involved in training and the nature of the corresponding predictions. In particular, we can define quantities analogous to molecular structure, thermodynamics, and kinetics, and relate these emergent properties to the structure of the underlying landscape. This Perspective aims to describe these analogies with examples from recent applications, and suggest avenues for new interdisciplinary research.
△ Less
Submitted 22 March, 2017;
originally announced March 2017.