Search | arXiv e-print repository

Mitigating LLM Hallucinations via Conformal Abstention

Authors: Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, Nenad Tomasev

Abstract: We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a general domain, instead of resorting to possibly "hallucinating" a non-sensical or incorrect answer. Building on earlier approaches that use self-consistency as a more reliable measure of model confidence, we propose using the LLM itself to self-e… ▽ More We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a general domain, instead of resorting to possibly "hallucinating" a non-sensical or incorrect answer. Building on earlier approaches that use self-consistency as a more reliable measure of model confidence, we propose using the LLM itself to self-evaluate the similarity between each of its sampled responses for a given query. We then further leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate). Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets, while also maintaining a significantly less conservative abstention rate on a dataset with long responses (Temporal Sequences) compared to baselines using log-probability scores to quantify uncertainty, while achieveing comparable performance on a dataset with short answers (TriviaQA). To evaluate the experiments automatically, one needs to determine if two responses are equivalent given a question. Following standard practice, we use a thresholded similarity function to determine if two responses match, but also provide a method for calibrating the threshold based on conformal prediction, with theoretical guarantees on the accuracy of the match prediction, which might be of independent interest. △ Less

Submitted 4 April, 2024; originally announced May 2024.

arXiv:2307.09302 [pdf, other]

Conformal prediction under ambiguous ground truth

Authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet

Abstract: Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y \in C(X))\geq 1-α$ for a user-chosen $α\in [0,1]$ by relying on calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} \otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that $\mathbb{P}^{Y|X}$ is the "true" posterior label… ▽ More Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y \in C(X))\geq 1-α$ for a user-chosen $α\in [0,1]$ by relying on calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} \otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that $\mathbb{P}^{Y|X}$ is the "true" posterior label distribution. However, in many real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating expert opinions using a voting procedure, resulting in a one-hot distribution $\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus w.r.t. $\mathbb{P}_{vote}=\mathbb{P}^X \otimes \mathbb{P}_{vote}^{Y|X}$ rather than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth labels, the distinction between $\mathbb{P}_{vote}$ and $\mathbb{P}$ is irrelevant. However, when experts do not agree because of ambiguous labels, approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution $\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a non-degenerate distribution $\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP procedures which provide guarantees w.r.t. $\mathbb{P}_{agg}=\mathbb{P}^X \otimes \mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels from $\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{vote}$ under-covers expert annotations: calibrated for $72\%$ coverage, it falls short by on average $10\%$; our Monte Carlo CP closes this gap both empirically and theoretically. △ Less

Submitted 24 October, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

arXiv:2307.02191 [pdf, other]

Evaluating AI systems under uncertain ground truth: a case study in dermatology

Authors: David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana Matejovicova, Melih Barsbey, Patricia Strachan, Mike Schaekermann, Jan Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang, Dale R. Webster, Yuan Liu, Greg S. Corrado, Yossi Matias, Pushmeet Kohli, Yun Liu, Arnaud Doucet, Alan Karthikesalingam

Abstract: For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid… ▽ More For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates. △ Less

Submitted 5 July, 2023; originally announced July 2023.

arXiv:2302.00049 [pdf, other]

Transformers Meet Directed Graphs

Authors: Simon Geisler, Yujia Li, Daniel Mankowitz, Ali Taylan Cemgil, Stephan Günnemann, Cosmin Paduraru

Abstract: Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains, including source code and logic circuits. In this work, we propose two directio… ▽ More Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains, including source code and logic circuits. In this work, we propose two direction- and structure-aware positional encodings for directed graphs: (1) the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial Laplacian; (2) directional random walk encodings. Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%. △ Less

Submitted 31 August, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

Comments: 29 pages

arXiv:2209.02270 [pdf, other]

doi 10.1016/j.pmcj.2022.101554

An Indoor Localization Dataset and Data Collection Framework with High Precision Position Annotation

Authors: F. Serhan Daniş, A. Teoman Naskali, A. Taylan Cemgil, Cem Ersoy

Abstract: We introduce a novel technique and an associated high resolution dataset that aims to precisely evaluate wireless signal based indoor positioning algorithms. The technique implements an augmented reality (AR) based positioning system that is used to annotate the wireless signal parameter data samples with high precision position data. We track the position of a practical and low cost navigable set… ▽ More We introduce a novel technique and an associated high resolution dataset that aims to precisely evaluate wireless signal based indoor positioning algorithms. The technique implements an augmented reality (AR) based positioning system that is used to annotate the wireless signal parameter data samples with high precision position data. We track the position of a practical and low cost navigable setup of cameras and a Bluetooth Low Energy (BLE) beacon in an area decorated with AR markers. We maximize the performance of the AR-based localization by using a redundant number of markers. Video streams captured by the cameras are subjected to a series of marker recognition, subset selection and filtering operations to yield highly precise pose estimations. Our results show that we can reduce the positional error of the AR localization system to a rate under 0.05 meters. The position data are then used to annotate the BLE data that are captured simultaneously by the sensors stationed in the environment, hence, constructing a wireless signal data set with the ground truth, which allows a wireless signal based localization system to be evaluated accurately. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: 30 pages

Journal ref: F. Serhan Daniş, A. Teoman Naskali, A. Taylan Cemgil, Cem Ersoy, "An indoor localization dataset and data collection framework with high precision position annotation", Pervasive and Mobile Computing, Volume 81, 101554, 2022

arXiv:2110.09192 [pdf, other]

Learning Optimal Conformal Classifiers

Authors: David Stutz, Krishnamurthy, Dvijotham, Ali Taylan Cemgil, Arnaud Doucet

Abstract: Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.… ▽ More Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP. △ Less

Submitted 6 May, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: ICLR 2022

arXiv:2012.12862 [pdf, other]

Towards Fair Personalization by Avoiding Feedback Loops

Authors: Gökhan Çapan, Özge Bozal, İlker Gündoğdu, Ali Taylan Cemgil

Abstract: Self-reinforcing feedback loops are both cause and effect of over and/or under-presentation of some content in interactive recommender systems. This leads to erroneous user preference estimates, namely, overestimation of over-presented content while violating the right to be presented of each alternative, contrary of which we define as a fair system. We consider two models that explicitly incorpor… ▽ More Self-reinforcing feedback loops are both cause and effect of over and/or under-presentation of some content in interactive recommender systems. This leads to erroneous user preference estimates, namely, overestimation of over-presented content while violating the right to be presented of each alternative, contrary of which we define as a fair system. We consider two models that explicitly incorporate, or ignore the systematic and limited exposure to alternatives. By simulations, we demonstrate that ignoring the systematic presentations overestimates promoted options and underestimates censored alternatives. Simply conditioning on the limited exposure is a remedy for these biases. △ Less

Submitted 20 December, 2020; originally announced December 2020.

Comments: NeurIPS 2019 Workshop on Human-Centric Machine Learning

arXiv:2012.03715 [pdf, other]

Autoencoding Variational Autoencoder

Authors: A. Taylan Cemgil, Sumedh Ghaisas, Krishnamurthy Dvijotham, Sven Gowal, Pushmeet Kohli

Abstract: Does a Variational AutoEncoder (VAE) consistently encode typical samples generated from its decoder? This paper shows that the perhaps surprising answer to this question is `No'; a (nominally trained) VAE does not necessarily amortize inference for typical samples that it is capable of generating. We study the implications of this behaviour on the learned representations and also the consequences… ▽ More Does a Variational AutoEncoder (VAE) consistently encode typical samples generated from its decoder? This paper shows that the perhaps surprising answer to this question is `No'; a (nominally trained) VAE does not necessarily amortize inference for typical samples that it is capable of generating. We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. Our approach hinges on an alternative construction of the variational approximation distribution to the true posterior of an extended VAE model with a Markov chain alternating between the encoder and the decoder. The method can be used to train a VAE model from scratch or given an already trained VAE, it can be run as a post processing step in an entirely self supervised way without access to the original training data. Our experimental analysis reveals that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks. We provide experimental results on the ColorMnist and CelebA benchmark datasets that quantify the properties of the learned representations and compare the approach with a baseline that is specifically trained for the desired property. △ Less

Submitted 7 December, 2020; originally announced December 2020.

Comments: Neurips 2020

arXiv:2010.01550 [pdf, other]

Intermittent Demand Forecasting with Renewal Processes

Authors: Ali Caner Turkmen, Tim Januschowski, Yuyang Wang, Ali Taylan Cemgil

Abstract: Intermittency is a common and challenging problem in demand forecasting. We introduce a new, unified framework for building intermittent demand forecasting models, which incorporates and allows to generalize existing methods in several directions. Our framework is based on extensions of well-established model-based methods to discrete-time renewal processes, which can parsimoniously account for pa… ▽ More Intermittency is a common and challenging problem in demand forecasting. We introduce a new, unified framework for building intermittent demand forecasting models, which incorporates and allows to generalize existing methods in several directions. Our framework is based on extensions of well-established model-based methods to discrete-time renewal processes, which can parsimoniously account for patterns such as aging, clustering and quasi-periodicity in demand arrivals. The connection to discrete-time renewal processes allows not only for a principled extension of Croston-type models, but also for an natural inclusion of neural network based models---by replacing exponential smoothing with a recurrent neural network. We also demonstrate that modeling continuous-time demand arrivals, i.e., with a temporal point process, is possible via a trivial extension of our framework. This leads to more flexible modeling in scenarios where data of individual purchase orders are directly available with granular timestamps. Complementing this theoretical advancement, we demonstrate the efficacy of our framework for forecasting practice via an extensive empirical study on standard intermittent demand data sets, in which we report predictive accuracy in a variety of scenarios that compares favorably to the state of the art. △ Less

Submitted 4 October, 2020; originally announced October 2020.

arXiv:1909.03280 [pdf, ps, other]

doi 10.1016/j.jqsrt.2019.106643

Gaussian Process and Design of Experiments for Surrogate Modeling of Optical Properties of Fractal Aggregates

Authors: Ozan Burak Ericok, Atay Kaan Ozbek, Ali Taylan Cemgil, Hakan Erturk

Abstract: A systematic approach based on the principles of supervised learning and design of experiments concepts is introduced to build a surrogate model for estimating the optical properties of fractal aggregates. The surrogate model is built on Gaussian process (GP) regression, and the input points for the GP regression are sampled with an adaptive sequential design algorithm. The covariance functions us… ▽ More A systematic approach based on the principles of supervised learning and design of experiments concepts is introduced to build a surrogate model for estimating the optical properties of fractal aggregates. The surrogate model is built on Gaussian process (GP) regression, and the input points for the GP regression are sampled with an adaptive sequential design algorithm. The covariance functions used are the squared exponential covariance function and the Matern covariance function both with Automatic Relevance Determination (ARD). The optical property considered is extinction efficiency of soot aggregates. The strengths and weaknesses of the proposed methodology are first tested with RDG-FA. Then, surrogate models are developed for the sampled points, for which the extinction efficiency is calculated by DDA. Four different uniformly gridded databases are also constructed for comparison. It is observed that the estimations based on the surrogate model designed with Matern covariance functions is superior to the estimations based on databases in terms of the accuracy of the estimations and the total number of input points they require. Finally, a preliminary surrogate model for S 11 is built to correct RDG-FA predictions with the aim of combining the speed of RDG-FA with the accuracy of DDA. △ Less

Submitted 7 September, 2019; originally announced September 2019.

Comments: 19 pages, 8 figures

arXiv:1908.05640 [pdf, other]

A Bayesian Choice Model for Eliminating Feedback Loops

Authors: Gökhan Çapan, Ilker Gündoğdu, Ali Caner Türkmen, Çağrı Sofuoğlu, Ali Taylan Cemgil

Abstract: Self-reinforcing feedback loops in personalization systems are typically caused by users choosing from a limited set of alternatives presented systematically based on previous choices. We propose a Bayesian choice model built on Luce axioms that explicitly accounts for users' limited exposure to alternatives. Our model is fair---it does not impose negative bias towards unpresented alternatives, an… ▽ More Self-reinforcing feedback loops in personalization systems are typically caused by users choosing from a limited set of alternatives presented systematically based on previous choices. We propose a Bayesian choice model built on Luce axioms that explicitly accounts for users' limited exposure to alternatives. Our model is fair---it does not impose negative bias towards unpresented alternatives, and practical---preference estimates are accurately inferred upon observing a small number of interactions. It also allows efficient sampling, leading to a straightforward online presentation mechanism based on Thompson sampling. Our approach achieves low regret in learning to present upon exploration of only a small fraction of possible presentations. The proposed structure can be reused as a building block in interactive systems, e.g., recommender systems, free of feedback loops. △ Less

Submitted 21 August, 2019; v1 submitted 15 August, 2019; originally announced August 2019.

arXiv:1903.04478 [pdf, other]

Bayesian Allocation Model: Inference by Sequential Monte Carlo for Nonnegative Tensor Factorizations and Topic Models using Polya Urns

Authors: Ali Taylan Cemgil, Mehmet Burak Kurutmaz, Sinan Yildirim, Melih Barsbey, Umut Simsekli

Abstract: We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, whe… ▽ More We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a Polya urn, an integer valued self-reinforcing process. This urn processes, which we name a Polya-Bayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes. △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: 70 pages, 16 figures

arXiv:1812.01502 [pdf, other]

Parallelising Particle Filters with Butterfly Interactions

Authors: Kari Heine, Nick Whiteley, A. Taylan Cemgil

Abstract: Bootstrap particle filter (BPF) is the corner stone of many popular algorithms used for solving inference problems involving time series that are observed through noisy measurements in a non-linear and non-Gaussian context. The long term stability of BPF arises from particle interactions which in the context of modern parallel computing systems typically means that particle information needs to be… ▽ More Bootstrap particle filter (BPF) is the corner stone of many popular algorithms used for solving inference problems involving time series that are observed through noisy measurements in a non-linear and non-Gaussian context. The long term stability of BPF arises from particle interactions which in the context of modern parallel computing systems typically means that particle information needs to be communicated between processing elements, which makes parallel implementation of BPF nontrivial. In this paper we show that it is possible to constrain the interactions in a way which, under some assumptions, enables the reduction of the cost of communicating the particle information while still preserving the consistency and the long term stability of the BPF. Numerical experiments demonstrate that although the imposed constraints introduce additional error, the proposed method shows potential to be the method of choice in certain settings. △ Less

Submitted 4 December, 2018; originally announced December 2018.

Comments: 35 pages, 4 figures

arXiv:1810.13104 [pdf, other]

doi 10.1109/LSP.2019.2929440

Audio Source Separation Using Variational Autoencoders and Weak Class Supervision

Authors: Ertuğ Karamatlı, Ali Taylan Cemgil, Serap Kırbız

Abstract: In this paper, we propose a source separation method that is trained by observing the mixtures and the class labels of the sources present in the mixture without any access to isolated sources. Since our method does not require source class labels for every time-frequency bin but only a single label for each source constituting the mixture signal, we call this scenario as weak class supervision. W… ▽ More In this paper, we propose a source separation method that is trained by observing the mixtures and the class labels of the sources present in the mixture without any access to isolated sources. Since our method does not require source class labels for every time-frequency bin but only a single label for each source constituting the mixture signal, we call this scenario as weak class supervision. We associate a variational autoencoder (VAE) with each source class within a non-negative (compositional) model. Each VAE provides a prior model to identify the signal from its associated class in a sound mixture. After training the model on mixtures, we obtain a generative model for each source class and demonstrate our method on one-second mixtures of utterances of digits from 0 to 9. We show that the separation performance obtained by source class supervision is as good as the performance obtained by source signal supervision. △ Less

Submitted 4 August, 2019; v1 submitted 31 October, 2018; originally announced October 2018.

Comments: Accepted version

Journal ref: IEEE Signal Processing Letters 26 (2019) 1349-1353

arXiv:1806.02617 [pdf, other]

Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization

Authors: Umut Şimşekli, Çağatay Yıldız, Thanh Huy Nguyen, Gaël Richard, A. Taylan Cemgil

Abstract: Recent studies have illustrated that stochastic gradient Markov Chain Monte Carlo techniques have a strong potential in non-convex optimization, where local and global convergence guarantees can be shown under certain conditions. By building up on this recent theory, in this study, we develop an asynchronous-parallel stochastic L-BFGS algorithm for non-convex optimization. The proposed algorithm i… ▽ More Recent studies have illustrated that stochastic gradient Markov Chain Monte Carlo techniques have a strong potential in non-convex optimization, where local and global convergence guarantees can be shown under certain conditions. By building up on this recent theory, in this study, we develop an asynchronous-parallel stochastic L-BFGS algorithm for non-convex optimization. The proposed algorithm is suitable for both distributed and shared-memory settings. We provide formal theoretical analysis and show that the proposed method achieves an ergodic convergence rate of ${\cal O}(1/\sqrt{N})$ ($N$ being the total number of iterations) and it can achieve a linear speedup under certain conditions. We perform several experiments on both synthetic and real datasets. The results support our theory and show that the proposed algorithm provides a significant speedup over the recently proposed synchronous distributed L-BFGS algorithm. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Comments: Published in the International Conference on Machine Learning (ICML 2018)

arXiv:1712.02629 [pdf, ps, other]

Differentially Private Variational Dropout

Authors: Beyza Ermis, Ali Taylan Cemgil

Abstract: Deep neural networks with their large number of parameters are highly flexible learning systems. The high flexibility in such networks brings with some serious problems such as overfitting, and regularization is used to address this problem. A currently popular and effective regularization technique for controlling the overfitting is dropout. Often, large data collections required for neural netwo… ▽ More Deep neural networks with their large number of parameters are highly flexible learning systems. The high flexibility in such networks brings with some serious problems such as overfitting, and regularization is used to address this problem. A currently popular and effective regularization technique for controlling the overfitting is dropout. Often, large data collections required for neural networks contain sensitive information such as the medical histories of patients, and the privacy of the training data should be protected. In this paper, we modify the recently proposed variational dropout technique which provided an elegant Bayesian interpretation to dropout, and show that the intrinsic noise in the variational dropout can be exploited to obtain a degree of differential privacy. The iterative nature of training neural networks presents a challenge for privacy-preserving estimation since multiple iterations increase the amount of noise added. We overcome this by using a relaxed notion of differential privacy, called concentrated differential privacy, which provides tighter estimates on the overall privacy loss. We demonstrate the accuracy of our privacy-preserving variational dropout algorithm on benchmark datasets. △ Less

Submitted 16 December, 2017; v1 submitted 30 November, 2017; originally announced December 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:1712.01665

arXiv:1712.01665 [pdf, ps, other]

Differentially Private Dropout

Authors: Beyza Ermis, Ali Taylan Cemgil

Abstract: Large data collections required for the training of neural networks often contain sensitive information such as the medical histories of patients, and the privacy of the training data must be preserved. In this paper, we introduce a dropout technique that provides an elegant Bayesian interpretation to dropout, and show that the intrinsic noise added, with the primary goal of regularization, can be… ▽ More Large data collections required for the training of neural networks often contain sensitive information such as the medical histories of patients, and the privacy of the training data must be preserved. In this paper, we introduce a dropout technique that provides an elegant Bayesian interpretation to dropout, and show that the intrinsic noise added, with the primary goal of regularization, can be exploited to obtain a degree of differential privacy. The iterative nature of training neural networks presents a challenge for privacy-preserving estimation since multiple iterations increase the amount of noise added. We overcome this by using a relaxed notion of differential privacy, called concentrated differential privacy, which provides tighter estimates on the overall privacy loss. We demonstrate the accuracy of our privacy-preserving dropout algorithm on benchmark datasets. △ Less

Submitted 30 November, 2017; originally announced December 2017.

Comments: arXiv admin note: text overlap with arXiv:1611.00340 by other authors

arXiv:1602.03442 [pdf, other]

Stochastic Quasi-Newton Langevin Monte Carlo

Authors: Umut Şimşekli, Roland Badeau, A. Taylan Cemgil, Gaël Richard

Abstract: Recently, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods have been proposed for scaling up Monte Carlo computations to large data problems. Whilst these approaches have proven useful in many applications, vanilla SG-MCMC might suffer from poor mixing rates when random variables exhibit strong couplings under the target densities or big scale differences. In this study, we propose a… ▽ More Recently, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods have been proposed for scaling up Monte Carlo computations to large data problems. Whilst these approaches have proven useful in many applications, vanilla SG-MCMC might suffer from poor mixing rates when random variables exhibit strong couplings under the target densities or big scale differences. In this study, we propose a novel SG-MCMC method that takes the local geometry into account by using ideas from Quasi-Newton optimization methods. These second order methods directly approximate the inverse Hessian by using a limited history of samples and their gradients. Our method uses dense approximations of the inverse Hessian while kee** the time and memory complexities linear with the dimension of the problem. We provide a formal theoretical analysis where we show that the proposed method is asymptotically unbiased and consistent with the posterior expectations. We illustrate the effectiveness of the approach on both synthetic and real datasets. Our experiments on two challenging applications show that our method achieves fast convergence rates similar to Riemannian approaches while at the same time having low computational requirements similar to diagonal preconditioning approaches. △ Less

Submitted 12 December, 2016; v1 submitted 10 February, 2016; originally announced February 2016.

Comments: Published in ICML 2016, International Conference on Machine Learning 2016, New York, NY, USA

arXiv:1509.01698 [pdf, other]

HAMSI: A Parallel Incremental Optimization Algorithm Using Quadratic Approximations for Solving Partially Separable Problems

Authors: Kamer Kaya, Figen Öztoprak, Ş. İlker Birbil, A. Taylan Cemgil, Umut Şimşekli, Nurdan Kuru, Hazal Koptagel, M. Kaan Öztürk

Abstract: We propose HAMSI (Hessian Approximated Multiple Subsets Iteration), which is a provably convergent, second order incremental algorithm for solving large-scale partially separable optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with t… ▽ More We propose HAMSI (Hessian Approximated Multiple Subsets Iteration), which is a provably convergent, second order incremental algorithm for solving large-scale partially separable optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with the number of processors. Combined with techniques for effectively utilizing modern parallel computer architectures, we illustrate that the proposed method converges more rapidly than a parallel stochastic gradient descent when both methods are used to solve large-scale matrix factorization problems. This performance gain comes only at the expense of using memory that scales linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many large scale problems, where first order methods based on variants of stochastic gradient descent are applicable. △ Less

Submitted 4 August, 2017; v1 submitted 5 September, 2015; originally announced September 2015.

Comments: The software is available at https://github.com/spartensor/hamsi-mf

arXiv:1506.01418 [pdf, other]

Parallel Stochastic Gradient Markov Chain Monte Carlo for Matrix Factorisation Models

Authors: Umut Şimşekli, Hazal Koptagel, Hakan Güldaş, A. Taylan Cemgil, Figen Öztoprak, Ş. İlker Birbil

Abstract: For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves… ▽ More For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves high performance by exploiting the conditional independence structure of the MF models to sub-sample data in a systematic manner as to allow parallelisation and distributed computation. We provide a convergence proof of the algorithm and verify its superior performance on various architectures such as Graphics Processing Units, shared memory multi-core systems and multi-computer clusters. △ Less

Submitted 28 September, 2015; v1 submitted 3 June, 2015; originally announced June 2015.

Comments: 10 pages, 6 figures

arXiv:1411.5876 [pdf, ps, other]

Butterfly resampling: asymptotics for particle filters with constrained interactions

Authors: Kari Heine, Nick Whiteley, A. Taylan Cemgil, Hakan Guldas

Abstract: We generalize the elementary mechanism of sampling with replacement $N$ times from a weighted population of size $N$, by introducing auxiliary variables and constraints on conditional independence characterised by modular congruence relations. Motivated by considerations of parallelism, a convergence study reveals how sparsity of the mechanism's conditional independence graph is related to fluctua… ▽ More We generalize the elementary mechanism of sampling with replacement $N$ times from a weighted population of size $N$, by introducing auxiliary variables and constraints on conditional independence characterised by modular congruence relations. Motivated by considerations of parallelism, a convergence study reveals how sparsity of the mechanism's conditional independence graph is related to fluctuation properties of particle filters which use it for resampling, in some cases exhibiting exotic scaling behaviour. The proofs involve detailed combinatorial analysis of conditional independence graphs. △ Less

Submitted 21 November, 2014; originally announced November 2014.

Comments: 29 pages, supplementary material (46 pages)

MSC Class: 60F05; 60F99; 60G35

arXiv:1410.6830 [pdf, ps, other]

Clustering Words by Projection Entropy

Authors: Işık Barış Fidaner, Ali Taylan Cemgil

Abstract: We apply entropy agglomeration (EA), a recently introduced algorithm, to cluster the words of a literary text. EA is a greedy agglomerative procedure that minimizes projection entropy (PE), a function that can quantify the segmentedness of an element set. To apply it, the text is reduced to a feature allocation, a combinatorial object to represent the word occurences in the text's paragraphs. The… ▽ More We apply entropy agglomeration (EA), a recently introduced algorithm, to cluster the words of a literary text. EA is a greedy agglomerative procedure that minimizes projection entropy (PE), a function that can quantify the segmentedness of an element set. To apply it, the text is reduced to a feature allocation, a combinatorial object to represent the word occurences in the text's paragraphs. The experiment results demonstrate that EA, despite its reduction and simplicity, is useful in capturing significant relationships among the words in the text. This procedure was implemented in Python and published as a free software: REBUS. △ Less

Submitted 24 October, 2014; originally announced October 2014.

Comments: Accepted to NIPS 2014 Modern ML+NLP Workshop: http://www.cs.cmu.edu/~apparikh/nips2014ml-nlp/

arXiv:1409.8276 [pdf, other]

A Bayesian Tensor Factorization Model via Variational Inference for Link Prediction

Authors: Beyza Ermis, A. Taylan Cemgil

Abstract: Probabilistic approaches for tensor factorization aim to extract meaningful structure from incomplete data by postulating low rank constraints. Recently, variational Bayesian (VB) inference techniques have successfully been applied to large scale models. This paper presents full Bayesian inference via VB on both single and coupled tensor factorization models. Our method can be run even for very la… ▽ More Probabilistic approaches for tensor factorization aim to extract meaningful structure from incomplete data by postulating low rank constraints. Recently, variational Bayesian (VB) inference techniques have successfully been applied to large scale models. This paper presents full Bayesian inference via VB on both single and coupled tensor factorization models. Our method can be run even for very large models and is easily implemented. It exhibits better prediction performance than existing approaches based on maximum likelihood on several real-world datasets for missing link prediction problem. △ Less

Submitted 29 September, 2014; originally announced September 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1409.8083

arXiv:1409.8083 [pdf, other]

Variational Inference For Probabilistic Latent Tensor Factorization with KL Divergence

Authors: Beyza Ermis, Y. Kenan Yılmaz, A. Taylan Cemgil, Evrim Acar

Abstract: Probabilistic Latent Tensor Factorization (PLTF) is a recently proposed probabilistic framework for modelling multi-way data. Not only the common tensor factorization models but also any arbitrary tensor factorization structure can be realized by the PLTF framework. This paper presents full Bayesian inference via variational Bayes that facilitates more powerful modelling and allows more sophistica… ▽ More Probabilistic Latent Tensor Factorization (PLTF) is a recently proposed probabilistic framework for modelling multi-way data. Not only the common tensor factorization models but also any arbitrary tensor factorization structure can be realized by the PLTF framework. This paper presents full Bayesian inference via variational Bayes that facilitates more powerful modelling and allows more sophisticated inference on the PLTF framework. We illustrate our approach on model order selection and link prediction. △ Less

Submitted 29 September, 2014; originally announced September 2014.

arXiv:1401.2490 [pdf, ps, other]

doi 10.3182/20120711-3-BE-2027.00312

An Online Expectation-Maximisation Algorithm for Nonnegative Matrix Factorisation Models

Authors: Sinan Yildirim, A. Taylan Cemgil, Sumeetpal S. Singh

Abstract: In this paper we formulate the nonnegative matrix factorisation (NMF) problem as a maximum likelihood estimation problem for hidden Markov models and propose online expectation-maximisation (EM) algorithms to estimate the NMF and the other unknown static parameters. We also propose a sequential Monte Carlo approximation of our online EM algorithm. We show the performance of the proposed method wit… ▽ More In this paper we formulate the nonnegative matrix factorisation (NMF) problem as a maximum likelihood estimation problem for hidden Markov models and propose online expectation-maximisation (EM) algorithms to estimate the NMF and the other unknown static parameters. We also propose a sequential Monte Carlo approximation of our online EM algorithm. We show the performance of the proposed method with two numerical examples. △ Less

Submitted 10 January, 2014; originally announced January 2014.

Comments: 6 pages, 3 figures

Journal ref: 16th IFAC Symposium on System Identification, 2012, Volume 16, Part 1,

arXiv:1310.0509 [pdf, ps, other]

Summary Statistics for Partitionings and Feature Allocations

Authors: Işık Barış Fidaner, Ali Taylan Cemgil

Abstract: Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of… ▽ More Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice. △ Less

Submitted 25 November, 2013; v1 submitted 1 October, 2013; originally announced October 2013.

Comments: Accepted to NIPS 2013: https://nips.cc/Conferences/2013/Program/event.php?ID=3763

arXiv:1209.4280 [pdf, ps, other]

Alpha/Beta Divergences and Tweedie Models

Authors: Y. Kenan Yilmaz, A. Taylan Cemgil

Abstract: We describe the underlying probabilistic interpretation of alpha and beta divergences. We first show that beta divergences are inherently tied to Tweedie distributions, a particular type of exponential family, known as exponential dispersion models. Starting from the variance function of a Tweedie model, we outline how to get alpha and beta divergences as special cases of Csiszár's $f$ and Bregman… ▽ More We describe the underlying probabilistic interpretation of alpha and beta divergences. We first show that beta divergences are inherently tied to Tweedie distributions, a particular type of exponential family, known as exponential dispersion models. Starting from the variance function of a Tweedie model, we outline how to get alpha and beta divergences as special cases of Csiszár's $f$ and Bregman divergences. This result directly generalizes the well-known relationship between the Gaussian distribution and least squares estimation to Tweedie models and beta divergence minimization. △ Less

Submitted 19 September, 2012; originally announced September 2012.

arXiv:1208.6231 [pdf, other]

Link Prediction via Generalized Coupled Tensor Factorisation

Authors: Beyza Ermiş, Evrim Acar, A. Taylan Cemgil

Abstract: This study deals with the missing link prediction problem: the problem of predicting the existence of missing connections between entities of interest. We address link prediction using coupled analysis of relational datasets represented as heterogeneous data, i.e., datasets in the form of matrices and higher-order tensors. We propose to use an approach based on probabilistic interpretation of tens… ▽ More This study deals with the missing link prediction problem: the problem of predicting the existence of missing connections between entities of interest. We address link prediction using coupled analysis of relational datasets represented as heterogeneous data, i.e., datasets in the form of matrices and higher-order tensors. We propose to use an approach based on probabilistic interpretation of tensor factorisation models, i.e., Generalised Coupled Tensor Factorisation, which can simultaneously fit a large class of tensor models to higher-order tensors/matrices with com- mon latent factors using different loss functions. Numerical experiments demonstrate that joint analysis of data from multiple sources via coupled factorisation improves the link prediction performance and the selection of right loss function and tensor model is crucial for accurately predicting missing links. △ Less

Submitted 30 August, 2012; originally announced August 2012.

arXiv:1106.4863 [pdf, ps]

doi 10.1613/jair.1121

Monte Carlo Methods for Tempo Tracking and Rhythm Quantization

Authors: A. T. Cemgil, B. Kappen

Abstract: We present a probabilistic generative model for timing deviations in expressive music performance. The structure of the proposed model is equivalent to a switching state space model. The switch variables correspond to discrete note locations as in a musical score. The continuous hidden variables denote the tempo. We formulate two well known music recognition problems, namely tempo tracking and aut… ▽ More We present a probabilistic generative model for timing deviations in expressive music performance. The structure of the proposed model is equivalent to a switching state space model. The switch variables correspond to discrete note locations as in a musical score. The continuous hidden variables denote the tempo. We formulate two well known music recognition problems, namely tempo tracking and automatic transcription (rhythm quantization) as filtering and maximum a posteriori (MAP) state estimation tasks. Exact computation of posterior features such as the MAP state is intractable in this model class, so we introduce Monte Carlo methods for integration and optimization. We compare Markov Chain Monte Carlo (MCMC) methods (such as Gibbs sampling, simulated annealing and iterative improvement) and sequential Monte Carlo methods (particle filters). Our simulation results suggest better results with sequential methods. The methods can be applied in both online and batch scenarios such as tempo tracking and transcription and are thus potentially useful in a number of music applications such as adaptive automatic accompaniment, score typesetting and music information retrieval. △ Less

Submitted 23 June, 2011; originally announced June 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 18, pages 45-81, 2003

Showing 1–29 of 29 results for author: Cemgil, A T