Search | arXiv e-print repository

Meta-experiments: Improving experimentation through experimentation

Abstract: A/B testing is widexly used in the industry to optimize customer facing websites. Many companies employ experimentation specialists to facilitate and improve the process of A/B testing. Here, we present the application of A/B testing to this improvement effort itself, by running experiments on the experimentation process, which we call 'meta-experiments'. We discuss the challenges of this approach… ▽ More A/B testing is widexly used in the industry to optimize customer facing websites. Many companies employ experimentation specialists to facilitate and improve the process of A/B testing. Here, we present the application of A/B testing to this improvement effort itself, by running experiments on the experimentation process, which we call 'meta-experiments'. We discuss the challenges of this approach using the example of one of our meta-experiments, which helped experimenters to run more sufficiently powered A/B tests. We also point out the benefits of 'dog fooding' for the experimentation specialists when running their own experiments. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 6 pages, 2 figures, 1 table

arXiv:2403.19448 [pdf, other]

Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

Authors: Johannes Müller, Semih Çaycı, Guido Montúfar

Abstract: Kakade's natural policy gradient method has been studied extensively in the last years showing linear convergence with and without regularization. We study another natural gradient method which is based on the Fisher information matrix of the state-action distributions and has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher-Rao gradient f… ▽ More Kakade's natural policy gradient method has been studied extensively in the last years showing linear convergence with and without regularization. We study another natural gradient method which is based on the Fisher information matrix of the state-action distributions and has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher-Rao gradient flow inside the state-action polytope with respect to a linear potential. Therefore, we study Fisher-Rao gradient flows of linear programs more generally and show linear convergence with a rate that depends on the geometry of the linear program. Equivalently, this yields an estimate on the error induced by entropic regularization of the linear program which improves existing results. We extend these results and show sublinear convergence for perturbed Fisher-Rao gradient flows and natural gradient flows up to an approximation error. In particular, these general results cover the case of state-action natural policy gradients. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: 27 pages, 4 figures, under review

MSC Class: 65K05; 90C05; 90C08; 90C40; 90C53

arXiv:2312.03654 [pdf, other]

Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies

Authors: Luka Grbcic, Juliane Müller, Wibe Albert de Jong

Abstract: This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruc… ▽ More This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruction problem. It leverages a machine learning model trained with low-fidelity simulation data, in each optimization cycle, thereby proficiently predicting a target variable and discerning whether a high-fidelity simulation is necessitated, which notably conserves computational resources. Additionally, the machine learning model is strategically deployed prior to optimization to compress the design space boundaries, thereby further accelerating convergence toward the optimal solution. The methodology has been employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses illustrate performance improvements across both algorithms. Notably, this method is adaptable across any inverse design application, facilitating a synergy between a representative low-fidelity ML model, and high-fidelity simulation, and can be seamlessly applied across any variety of population-based optimization algorithms.} △ Less

Submitted 3 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.00553 [pdf, other]

Polynomial Chaos Surrogate Construction for Random Fields with Parametric Uncertainty

Authors: Joy N. Mueller, Khachik Sargsyan, Craig J. Daniels, Habib N. Najm

Abstract: Engineering and applied science rely on computational experiments to rigorously study physical systems. The mathematical models used to probe these systems are highly complex, and sampling-intensive studies often require prohibitively many simulations for acceptable accuracy. Surrogate models provide a means of circumventing the high computational expense of sampling such complex models. In partic… ▽ More Engineering and applied science rely on computational experiments to rigorously study physical systems. The mathematical models used to probe these systems are highly complex, and sampling-intensive studies often require prohibitively many simulations for acceptable accuracy. Surrogate models provide a means of circumventing the high computational expense of sampling such complex models. In particular, polynomial chaos expansions (PCEs) have been successfully used for uncertainty quantification studies of deterministic models where the dominant source of uncertainty is parametric. We discuss an extension to conventional PCE surrogate modeling to enable surrogate construction for stochastic computational models that have intrinsic noise in addition to parametric uncertainty. We develop a PCE surrogate on a joint space of intrinsic and parametric uncertainty, enabled by Rosenblatt transformations, and then extend the construction to random field data via the Karhunen-Loeve expansion. We then take advantage of closed-form solutions for computing PCE Sobol indices to perform a global sensitivity analysis of the model which quantifies the intrinsic noise contribution to the overall model output variance. Additionally, the resulting joint PCE is generative in the sense that it allows generating random realizations at any input parameter setting that are statistically approximately equivalent to realizations from the underlying stochastic model. The method is demonstrated on a chemical catalysis example model. △ Less

Submitted 17 June, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

MSC Class: 60G99; 65C20; 33C45; 62G07; 62J02

arXiv:2307.08370 [pdf, other]

doi 10.1098/rsif.2023.0409

Parameter estimation for contact tracing in graph-based models

Authors: Augustine Okolie, Johannes Müller, Mirjam Kretzschmar

Abstract: We adopt a maximum-likelihood framework to estimate parameters of a stochastic susceptible-infected-recovered (SIR) model with contact tracing on a rooted random tree. Given the number of detectees per index case, our estimator allows to determine the degree distribution of the random tree as well as the tracing probability. Since we do not discover all infectees via contact tracing, this estimati… ▽ More We adopt a maximum-likelihood framework to estimate parameters of a stochastic susceptible-infected-recovered (SIR) model with contact tracing on a rooted random tree. Given the number of detectees per index case, our estimator allows to determine the degree distribution of the random tree as well as the tracing probability. Since we do not discover all infectees via contact tracing, this estimation is non-trivial. To keep things simple and stable, we develop an approximation suited for realistic situations (contract tracing probability small, or the probability for the detection of index cases small). In this approximation, the only epidemiological parameter entering the estimator is $R_0$. The estimator is tested in a simulation study and is furthermore applied to covid-19 contact tracing data from India. The simulation study underlines the efficiency of the method. For the empirical covid-19 data, we compare different degree distributions and perform a sensitivity analysis. We find that particularly a power-law and a negative binomial degree distribution fit the data well and that the tracing probability is rather large. The sensitivity analysis shows no strong dependency of the estimates on the reproduction number. Finally, we discuss the relevance of our findings. △ Less

Submitted 22 November, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: 24 pages, 8 figures, 3 tables

MSC Class: 92D30

Journal ref: Royal Society Interface 2023

arXiv:2306.13520 [pdf, other]

On the Convergence Rate of Gaussianization with Random Rotations

Authors: Felix Draxler, Lars Kühmichel, Armand Rousselot, Jens Müller, Christoph Schnörr, Ullrich Köthe

Abstract: Gaussianization is a simple generative model that can be trained without backpropagation. It has shown compelling performance on low dimensional data. As the dimension increases, however, it has been observed that the convergence speed slows down. We show analytically that the number of required layers scales linearly with the dimension for Gaussian input. We argue that this is because the model i… ▽ More Gaussianization is a simple generative model that can be trained without backpropagation. It has shown compelling performance on low dimensional data. As the dimension increases, however, it has been observed that the convergence speed slows down. We show analytically that the number of required layers scales linearly with the dimension for Gaussian input. We argue that this is because the model is unable to capture dependencies between dimensions. Empirically, we find the same linear increase in cost for arbitrary input $p(x)$, but observe favorable scaling for some distributions. We explore potential speed-ups and formulate challenges for further research. △ Less

Submitted 23 June, 2023; originally announced June 2023.

arXiv:2305.16583 [pdf, other]

Detecting Errors in a Numerical Response via any Regression Model

Authors: Hang Zhou, Jonas Mueller, Mayank Kumar, Jane-Ling Wang, **g Lei

Abstract: Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertaint… ▽ More Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches. △ Less

Submitted 12 March, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

arXiv:2303.18022 [pdf, other]

The Topology-Overlap Trade-Off in Retinal Arteriole-Venule Segmentation

Authors: Angel Victor Juanco Muller, Joao F. C. Mota, Keith A. Goatman, Corne Hoogendoorn

Abstract: Retinal fundus images can be an invaluable diagnosis tool for screening epidemic diseases like hypertension or diabetes. And they become especially useful when the arterioles and venules they depict are clearly identified and annotated. However, manual annotation of these vessels is extremely time demanding and taxing, which calls for automatic segmentation. Although convolutional neural networks… ▽ More Retinal fundus images can be an invaluable diagnosis tool for screening epidemic diseases like hypertension or diabetes. And they become especially useful when the arterioles and venules they depict are clearly identified and annotated. However, manual annotation of these vessels is extremely time demanding and taxing, which calls for automatic segmentation. Although convolutional neural networks can achieve high overlap between predictions and expert annotations, they often fail to produce topologically correct predictions of tubular structures. This situation is exacerbated by the bifurcation versus crossing ambiguity which causes classification mistakes. This paper shows that including a topology preserving term in the loss function improves the continuity of the segmented vessels, although at the expense of artery-vein misclassification and overall lower overlap metrics. However, we show that by including an orientation score guided convolutional module, based on the anisotropic single sided cake wavelet, we reduce such misclassification and further increase the topology correctness of the results. We evaluate our model on public datasets with conveniently chosen metrics to assess both overlap and topology correctness, showing that our model is able to produce results on par with state-of-the-art from the point of view of overlap, while increasing topological accuracy. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: To be published in proceedings of SPIE Medical Imaging 2023 Image Processing

arXiv:2303.09989 [pdf, other]

Finding Competence Regions in Domain Generalization

Authors: Jens Müller, Stefan T. Radev, Robert Schmier, Felix Draxler, Carsten Rother, Ullrich Köthe

Abstract: We investigate a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data from a new domain whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outrig… ▽ More We investigate a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data from a new domain whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outright. Trustworthiness is then predicted via a proxy incompetence score that is tightly linked to the performance of a classifier. We present a comprehensive experimental evaluation of existing proxy scores as incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold. However, the scores are not yet good enough to allow for a favorable accuracy/rejection trade-off in all tested domains. Surprisingly, our results also indicate that classifiers optimized for DG robustness do not outperform a naive Empirical Risk Minimization (ERM) baseline in the competence region, that is, where test samples elicit low incompetence scores. △ Less

Submitted 21 June, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: The paper has been published at TMLR (see https://openreview.net/forum?id=TSy0vuwQFN)

Journal ref: Transactions on Machine Learning Research (06/2023)

arXiv:2301.11856 [pdf, other]

ActiveLab: Active Learning with Re-Labeling by Multiple Annotators

Authors: Hui Wen Goh, Jonas Mueller

Abstract: In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label… ▽ More In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods. △ Less

Submitted 27 January, 2023; originally announced January 2023.

arXiv:2210.06812 [pdf, other]

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators

Authors: Hui Wen Goh, Ulyana Tkachenko, Jonas Mueller

Abstract: Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the o… ▽ More Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD. △ Less

Submitted 27 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Journal ref: NeurIPS 2022 Human in the Loop Learning Workshop

arXiv:2207.03061 [pdf, other]

Back to the Basics: Revisiting Out-of-Distribution Detection Baselines

Authors: Johnson Kuan, Jonas Mueller

Abstract: We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also… ▽ More We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a dead-simple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the in-distribution data). △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: ICML Workshop on Principles of Distribution Shift 2022

arXiv:2207.01279 [pdf, other]

Joint lifetime modelling with matrix distributions

Authors: Albrecher Hansjörg, Bladt Martin, Alaric J. A Müller

Abstract: Acyclic phase-type (PH) distributions have been a popular tool in survival analysis, thanks to their natural interpretation in terms of ageing towards its inevitable absorption. In this paper, we consider an extension to the bivariate setting for the modelling of joint lifetimes. In contrast to previous models in the literature that were based on a separate estimation of the marginal behavior and… ▽ More Acyclic phase-type (PH) distributions have been a popular tool in survival analysis, thanks to their natural interpretation in terms of ageing towards its inevitable absorption. In this paper, we consider an extension to the bivariate setting for the modelling of joint lifetimes. In contrast to previous models in the literature that were based on a separate estimation of the marginal behavior and the dependence structure through a copula, we propose a new time-inhomogeneous version of a multivariate PH class (mIPH) that leads to a model for joint lifetimes without that separation. We study properties of mIPH class members and provide an adapted estimation procedure that allows for right-censoring and covariate information. We show that initial distribution vectors in our construction can be tailored to reflect the dependence of the random variables, and use multinomial regression to determine the influence of covariates on starting probabilities. Moreover, we highlight the flexibility and parsimony, in terms of needed phases, introduced by the time-inhomogeneity. Numerical illustrations are given for the famous data set of joint lifetimes of Frees et al. [15], where 10 phases turn out to be sufficient for a reasonable fitting performance. As a by-product, the proposed approach enables a natural causal interpretation of the association in the ageing mechanism of joint lifetimes that goes beyond a statistical fit. △ Less

Submitted 3 October, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.07449 [pdf, other]

Self-Assessment for Single-Object Tracking in Clutter Using Subjective Logic

Authors: Thomas Griebel, Johannes Müller, Paul Geisler, Charlotte Hermann, Martin Herrmann, Michael Buchholz, Klaus Dietmayer

Abstract: Reliable tracking algorithms are essential for automated driving. However, the existing consistency measures are not sufficient to meet the increasing safety demands in the automotive sector. Therefore, this work presents a novel method for self-assessment of single-object tracking in clutter based on Kalman filtering and subjective logic. A key feature of the approach is that it additionally prov… ▽ More Reliable tracking algorithms are essential for automated driving. However, the existing consistency measures are not sufficient to meet the increasing safety demands in the automotive sector. Therefore, this work presents a novel method for self-assessment of single-object tracking in clutter based on Kalman filtering and subjective logic. A key feature of the approach is that it additionally provides a measure of the collected statistical evidence in its online reliability scores. In this way, various aspects of reliability, such as the correctness of the assumed measurement noise, detection probability, and clutter rate, can be monitored in addition to the overall assessment based on the available evidence. Here, we present a mathematical derivation of the reference distribution used in our self-assessment module for our studied problem. Moreover, we introduce a formula that describes how a threshold should be chosen for the degree of conflict, the subjective logic comparison measure used for the reliability decision making. Our approach is evaluated in a challenging simulation scenario designed to model adverse weather conditions. The simulations show that our method can significantly improve the reliability checking of single-object tracking in clutter in several aspects. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: Accepted for presentation at the 2022 IEEE 25th International Conference on Information Fusion (FUSION), July 4 - 7, 2022, Linkö**, Sweden

arXiv:2203.09438 [pdf, other]

An Explainable Stacked Ensemble Model for Static Route-Free Estimation of Time of Arrival

Authors: Sören Schleibaum, Jörg P. Müller, Monika Sester

Abstract: To compare alternative taxi schedules and to compute them, as well as to provide insights into an upcoming taxi trip to drivers and passengers, the duration of a trip or its Estimated Time of Arrival (ETA) is predicted. To reach a high prediction precision, machine learning models for ETA are state of the art. One yet unexploited option to further increase prediction precision is to combine multip… ▽ More To compare alternative taxi schedules and to compute them, as well as to provide insights into an upcoming taxi trip to drivers and passengers, the duration of a trip or its Estimated Time of Arrival (ETA) is predicted. To reach a high prediction precision, machine learning models for ETA are state of the art. One yet unexploited option to further increase prediction precision is to combine multiple ETA models into an ensemble. While an increase of prediction precision is likely, the main drawback is that the predictions made by such an ensemble become less transparent due to the sophisticated ensemble architecture. One option to remedy this drawback is to apply eXplainable Artificial Intelligence (XAI). The contribution of this paper is three-fold. First, we combine multiple machine learning models from our previous work for ETA into a two-level ensemble model - a stacked ensemble model - which on its own is novel; therefore, we can outperform previous state-of-the-art static route-free ETA approaches. Second, we apply existing XAI methods to explain the first- and second-level models of the ensemble. Third, we propose three joining methods for combining the first-level explanations with the second-level ones. Those joining methods enable us to explain stacked ensembles for regression tasks. An experimental evaluation shows that the ETA models correctly learned the importance of those input features driving the prediction. △ Less

Submitted 11 January, 2024; v1 submitted 17 March, 2022; originally announced March 2022.

arXiv:2202.12441 [pdf, other]

Long-Term Missing Value Imputation for Time Series Data Using Deep Neural Networks

Authors: Jangho Park, Juliane Muller, Bhavna Arora, Boris Faybishenko, Gilberto Pastorello, Charuleka Varadharajan, Reetik Sahu, Deborah Agarwal

Abstract: We present an approach that uses a deep learning model, in particular, a MultiLayer Perceptron (MLP), for estimating the missing values of a variable in multivariate time series data. We focus on filling a long continuous gap (e.g., multiple months of missing daily observations) rather than on individual randomly missing observations. Our proposed gap filling algorithm uses an automated method for… ▽ More We present an approach that uses a deep learning model, in particular, a MultiLayer Perceptron (MLP), for estimating the missing values of a variable in multivariate time series data. We focus on filling a long continuous gap (e.g., multiple months of missing daily observations) rather than on individual randomly missing observations. Our proposed gap filling algorithm uses an automated method for determining the optimal MLP model architecture, thus allowing for optimal prediction performance for the given time series. We tested our approach by filling gaps of various lengths (three months to three years) in three environmental datasets with different time series characteristics, namely daily groundwater levels, daily soil moisture, and hourly Net Ecosystem Exchange. We compared the accuracy of the gap-filled values obtained with our approach to the widely-used R-based time series gap filling methods ImputeTS and mtsdi. The results indicate that using an MLP for filling a large gap leads to better results, especially when the data behave nonlinearly. Thus, our approach enables the use of datasets that have a large gap in one variable, which is common in many long-term environmental monitoring observations. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2111.02705 [pdf, other]

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Authors: Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola

Abstract: We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised… ▽ More We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks 2021

arXiv:2106.10414 [pdf, other]

Deep Learning for Functional Data Analysis with Adaptive Basis Layers

Authors: Junwen Yao, Jonas Mueller, Jane-Ling Wang

Abstract: Despite their widespread success, the application of deep neural networks to functional data remains scarce today. The infinite dimensionality of functional data means standard learning algorithms can be applied only after appropriate dimension reduction, typically achieved via basis expansions. Currently, these bases are chosen a priori without the information for the task at hand and thus may no… ▽ More Despite their widespread success, the application of deep neural networks to functional data remains scarce today. The infinite dimensionality of functional data means standard learning algorithms can be applied only after appropriate dimension reduction, typically achieved via basis expansions. Currently, these bases are chosen a priori without the information for the task at hand and thus may not be effective for the designated task. We instead propose to adaptively learn these bases in an end-to-end fashion. We introduce neural networks that employ a new Basis Layer whose hidden units are each basis functions themselves implemented as a micro neural network. Our architecture learns to apply parsimonious dimension reduction to functional inputs that focuses only on information relevant to the target rather than irrelevant variation in the input function. Across numerous classification/regression tasks with functional data, our method empirically outperforms other types of neural networks, and we prove that our approach is statistically consistent with low generalization error. Code is available at: \url{https://github.com/jwyyy/AdaFNN}. △ Less

Submitted 19 June, 2021; originally announced June 2021.

Comments: ICML 2021

arXiv:2105.05334 [pdf, other]

Coupling from the Past for the Stochastic Simulation of Chemical Reaction Networks

Authors: J. N. Mueller, J. N. Corcoran

Abstract: Chemical reaction networks (CRNs) are fundamental computational models used to study the behavior of chemical reactions in well-mixed solutions. They have been used extensively to model a broad range of biological systems, and are primarily used when the more traditional model of deterministic continuous mass action kinetics is invalid due to small molecular counts. We present a perfect sampling a… ▽ More Chemical reaction networks (CRNs) are fundamental computational models used to study the behavior of chemical reactions in well-mixed solutions. They have been used extensively to model a broad range of biological systems, and are primarily used when the more traditional model of deterministic continuous mass action kinetics is invalid due to small molecular counts. We present a perfect sampling algorithm to draw error-free samples from the stationary distributions of stochastic models for coupled, linear chemical reaction networks. The state spaces of such networks are given by all permissible combinations of molecular counts for each chemical species, and thereby grow exponentially with the numbers of species in the network. To avoid simulations involving large numbers of states, we propose a subset of chemical species such that coupling of paths started from these states guarantee coupling of paths started from all states in the state space and we show for the well-known Reversible Michaelis-Menten model that the subset does in fact guarantee perfect draws from the stationary distribution of interest. We compare solutions computed in two ways with this algorithm to those found analytically using the chemical master equation and we compare the distribution of coupling times for the two simulation approaches. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: 27 pages, 30 figures

MSC Class: 60J27; 60J28; 60K30 ACM Class: I.6.3; I.6.8

arXiv:2103.14749 [pdf, other]

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Authors: Curtis G. Northcutt, Anish Athalye, Jonas Mueller

Abstract: We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of at least 3.3% errors across the 10 datasets, where for example label errors comprise at least 6% of the Ima… ▽ More We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of at least 3.3% errors across the 10 datasets, where for example label errors comprise at least 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (51% of the algorithmically-flagged candidates are indeed erroneously labeled, on average across the datasets). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy - our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%. Test set errors across the 10 datasets can be viewed at https://labelerrors.com and all label errors can be reproduced by https://github.com/cleanlab/label-errors. △ Less

Submitted 7 November, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Demo available at https://labelerrors.com/ and source code available at https://github.com/cleanlab/label-errors

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks

arXiv:2103.00083 [pdf, other]

Flexible Model Aggregation for Quantile Regression

Authors: Rasool Fakoor, Taesup Kim, Jonas Mueller, Alexander J. Smola, Ryan J. Tibshirani

Abstract: Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions, or to model a diverse population without being overly reductive. For instance, epidemiological forecasts, cost estimates, and revenue predictions all benefit from being able to quantify the range of possible values accurately. As such, many models have been developed for… ▽ More Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions, or to model a diverse population without being overly reductive. For instance, epidemiological forecasts, cost estimates, and revenue predictions all benefit from being able to quantify the range of possible values accurately. As such, many models have been developed for this problem over many years of research in statistics, machine learning, and related fields. Rather than proposing yet another (new) algorithm for quantile regression we adopt a meta viewpoint: we investigate methods for aggregating any number of conditional quantile models, in order to improve accuracy and robustness. We consider weighted ensembles where weights may vary over not only individual models, but also over quantile levels, and feature values. All of the models we consider in this paper can be fit using modern deep learning toolkits, and hence are widely accessible (from an implementation point of view) and scalable. To improve the accuracy of the predicted quantiles (or equivalently, prediction intervals), we develop tools for ensuring that quantiles remain monotonically ordered, and apply conformal calibration methods. These can be used without any modification of the original library of base models. We also review some basic theory surrounding quantile aggregation and related scoring rules, and contribute a few new results to this literature (for example, the fact that post sorting or post isotonic regression can only improve the weighted interval score). Finally, we provide an extensive suite of empirical comparisons across 34 data sets from two different benchmark repositories. △ Less

Submitted 15 April, 2023; v1 submitted 26 February, 2021; originally announced March 2021.

Comments: Accepted at JMLR 2023

arXiv:2102.09225 [pdf, other]

Continuous Doubly Constrained Batch Reinforcement Learning

Authors: Rasool Fakoor, Jonas Mueller, Kavosh Asadi, Pratik Chaudhari, Alexander J. Smola

Abstract: Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produc… ▽ More Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected. △ Less

Submitted 6 December, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021 conference paper

arXiv:2010.07167 [pdf, other]

Learning Robust Models Using The Principle of Independent Causal Mechanisms

Authors: Jens Müller, Robert Schmier, Lynton Ardizzone, Carsten Rother, Ullrich Köthe

Abstract: Standard supervised learning breaks down under data distribution shift. However, the principle of independent causal mechanisms (ICM, Peters et al. (2017)) can turn this weakness into an opportunity: one can take advantage of distribution shift between different environments during training in order to obtain more robust models. We propose a new gradient-based learning framework whose objective fu… ▽ More Standard supervised learning breaks down under data distribution shift. However, the principle of independent causal mechanisms (ICM, Peters et al. (2017)) can turn this weakness into an opportunity: one can take advantage of distribution shift between different environments during training in order to obtain more robust models. We propose a new gradient-based learning framework whose objective function is derived from the ICM principle. We show theoretically and experimentally that neural networks trained in this framework focus on relations remaining invariant across environments and ignore unstable ones. Moreover, we prove that the recovered stable relations correspond to the true causal mechanisms under certain conditions. In both regression and classification, the resulting models generalize well to unseen scenarios where traditionally trained models fail. △ Less

Submitted 8 February, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

arXiv:2006.14284 [pdf, other]

Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Authors: Rasool Fakoor, Jonas Mueller, Nick Erickson, Pratik Chaudhari, Alexander J. Smola

Abstract: Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily com… ▽ More Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Journal ref: NeurIPS 2020

arXiv:2004.02441 [pdf, other]

TraDE: Transformers for Density Estimation

Authors: Rasool Fakoor, Pratik Chaudhari, Jonas Mueller, Alexander J. Smola

Abstract: We present TraDE, a self-attention-based architecture for auto-regressive density estimation with continuous and discrete valued data. Our model is trained using a penalized maximum likelihood objective, which ensures that samples from the density estimate resemble the training data distribution. The use of self-attention means that the model need not retain conditional sufficient statistics durin… ▽ More We present TraDE, a self-attention-based architecture for auto-regressive density estimation with continuous and discrete valued data. Our model is trained using a penalized maximum likelihood objective, which ensures that samples from the density estimate resemble the training data distribution. The use of self-attention means that the model need not retain conditional sufficient statistics during the auto-regressive process beyond what is needed for each covariate. On standard tabular and image data benchmarks, TraDE produces significantly better density estimates than existing approaches such as normalizing flow estimators and recurrent auto-regressive models. However log-likelihood on held-out data only partially reflects how useful these estimates are in real-world applications. In order to systematically evaluate density estimators, we present a suite of tasks such as regression using generated samples, out-of-distribution detection, and robustness to noise in the training data and demonstrate that TraDE works well in these scenarios. △ Less

Submitted 14 October, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:2003.08907 [pdf, other]

Overinterpretation reveals image classification model pathologies

Authors: Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford

Abstract: Image classifiers are typically scored on their test set accuracy, but high accuracy can mask a subtle type of model failure. We find that high scoring convolutional neural networks (CNNs) on popular benchmarks exhibit troubling pathologies that allow them to display high accuracy even in the absence of semantically salient features. When a model provides a high-confidence decision without salient… ▽ More Image classifiers are typically scored on their test set accuracy, but high accuracy can mask a subtle type of model failure. We find that high scoring convolutional neural networks (CNNs) on popular benchmarks exhibit troubling pathologies that allow them to display high accuracy even in the absence of semantically salient features. When a model provides a high-confidence decision without salient supporting input features, we say the classifier has overinterpreted its input, finding too much class-evidence in patterns that appear nonsensical to humans. Here, we demonstrate that neural networks trained on CIFAR-10 and ImageNet suffer from overinterpretation, and we find models on CIFAR-10 make confident predictions even when 95% of input images are masked and humans cannot discern salient features in the remaining pixel-subsets. We introduce Batched Gradient SIS, a new method for discovering sufficient input subsets for complex datasets, and use this method to show the sufficiency of border pixels in ImageNet for training and testing. Although these patterns portend potential model fragility in real-world deployment, they are in fact valid statistical patterns of the benchmark that alone suffice to attain high test accuracy. Unlike adversarial examples, overinterpretation relies upon unmodified image pixels. We find ensembling and input dropout can each help mitigate overinterpretation. △ Less

Submitted 7 December, 2021; v1 submitted 19 March, 2020; originally announced March 2020.

Comments: NeurIPS 2021

arXiv:2003.06505 [pdf, other]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Authors: Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, Alexander Smola

Abstract: We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Unlike existing AutoML frameworks that primarily focus on model/hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Exper… ▽ More We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Unlike existing AutoML frameworks that primarily focus on model/hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Experiments reveal that our multi-layer combination of many models offers better use of allocated training time than seeking out the best. A second contribution is an extensive evaluation of public and commercial AutoML platforms including TPOT, H2O, AutoWEKA, auto-sklearn, AutoGluon, and Google AutoML Tables. Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate. We find that AutoGluon often even outperforms the best-in-hindsight combination of all of its competitors. In two popular Kaggle competitions, AutoGluon beat 99% of the participating data scientists after merely 4h of training on the raw data. △ Less

Submitted 13 March, 2020; originally announced March 2020.

arXiv:1911.13060 [pdf, other]

Orthogonal Wasserstein GANs

Authors: Jan Müller, Reinhard Klein, Michael Weinmann

Abstract: Wasserstein-GANs have been introduced to address the deficiencies of generative adversarial networks (GANs) regarding the problems of vanishing gradients and mode collapse during the training, leading to improved convergence behaviour and improved image quality. However, Wasserstein-GANs require the discriminator to be Lipschitz continuous. In current state-of-the-art Wasserstein-GANs this constra… ▽ More Wasserstein-GANs have been introduced to address the deficiencies of generative adversarial networks (GANs) regarding the problems of vanishing gradients and mode collapse during the training, leading to improved convergence behaviour and improved image quality. However, Wasserstein-GANs require the discriminator to be Lipschitz continuous. In current state-of-the-art Wasserstein-GANs this constraint is enforced via gradient norm regularization. In this paper, we demonstrate that this regularization does not encourage a broad distribution of spectral-values in the discriminator weights, hence resulting in less fidelity in the learned distribution. We therefore investigate the possibility of substituting this Lipschitz constraint with an orthogonality constraint on the weight matrices. We compare three different weight orthogonalization techniques with regards to their convergence properties, their ability to ensure the Lipschitz condition and the achieved quality of the learned distribution. In addition, we provide a comparison to Wasserstein-GANs trained with current state-of-the-art methods, where we demonstrate the potential of solely using orthogonality-based regularization. In this context, we propose an improved training procedure for Wasserstein-GANs which utilizes orthogonalization to further increase its generalization capability. Finally, we provide a novel metric to evaluate the generalization capabilities of the discriminators of different Wasserstein-GANs. △ Less

Submitted 14 December, 2019; v1 submitted 29 November, 2019; originally announced November 2019.

Comments: Correction of the formatting of the appendix

MSC Class: I.2.6 ACM Class: I.2.6

arXiv:1910.09599 [pdf, ps, other]

On the space-time expressivity of ResNets

Authors: Johannes Müller

Abstract: Residual networks (ResNets) are a deep learning architecture that substantially improved the state of the art performance in certain supervised learning tasks. Since then, they have received continuously growing attention. ResNets have a recursive structure $x_{k+1} = x_k + R_k(x_k)$ where $R_k$ is a neural network called a residual block. This structure can be seen as the Euler discretisation of… ▽ More Residual networks (ResNets) are a deep learning architecture that substantially improved the state of the art performance in certain supervised learning tasks. Since then, they have received continuously growing attention. ResNets have a recursive structure $x_{k+1} = x_k + R_k(x_k)$ where $R_k$ is a neural network called a residual block. This structure can be seen as the Euler discretisation of an associated ordinary differential equation (ODE) which is called a neural ODE. Recently, ResNets were proposed as the space-time approximation of ODEs which are not of this neural type. To elaborate this connection we show that by increasing the number of residual blocks as well as their expressivity the solution of an arbitrary ODE can be approximated in space and time simultaneously by deep ReLU ResNets. Further, we derive estimates on the complexity of the residual blocks required to obtain a prescribed accuracy under certain regularity assumptions. △ Less

Submitted 27 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: Extended abstract of master's thesis; presented at the ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations; full version of the thesis available under https://freidok.uni-freiburg.de/data/151788

arXiv:1909.04844 [pdf, other]

Recognizing Variables from their Data via Deep Embeddings of Distributions

Authors: Jonas Mueller, Alex Smola

Abstract: A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a comp… ▽ More A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a computationally efficient method to identify high-confidence variable matches between a given set of data values and a large repository of previously encountered datasets. Our approach enjoys numerous advantages over distributional similarity based techniques because we leverage learned vector embeddings of datasets which adaptively account for natural forms of data variation encountered in practice. Based on the neural architecture of deep sets, our embeddings can be computed for both numeric and string data. In dataset search and schema matching tasks, our methods outperform standard statistical techniques and we find that the learned embeddings generalize well to new data sources. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: IEEE International Conference on Data Mining (ICDM), 2019

arXiv:1908.10947 [pdf, other]

Surrogate Optimization of Deep Neural Networks for Groundwater Predictions

Authors: Juliane Mueller, Jangho Park, Reetik Sahu, Charuleka Varadharajan, Bhavna Arora, Boris Faybishenko, Deborah Agarwal

Abstract: Sustainable management of groundwater resources under changing climatic conditions require an application of reliable and accurate predictions of groundwater levels. Mechanistic multi-scale, multi-physics simulation models are often too hard to use for this purpose, especially for groundwater managers who do not have access to the complex compute resources and data. Therefore, we analyzed the appl… ▽ More Sustainable management of groundwater resources under changing climatic conditions require an application of reliable and accurate predictions of groundwater levels. Mechanistic multi-scale, multi-physics simulation models are often too hard to use for this purpose, especially for groundwater managers who do not have access to the complex compute resources and data. Therefore, we analyzed the applicability and performance of four modern deep learning computational models for predictions of groundwater levels. We compare three methods for optimizing the models' hyperparameters, including two surrogate model-based algorithms and a random sampling method. The models were tested using predictions of the groundwater level in Butte County, California, USA, taking into account the temporal variability of streamflow, precipitation, and ambient temperature. Our numerical study shows that the optimization of the hyperparameters can lead to reasonably accurate performance of all models (root mean squared errors of groundwater predictions of 2 meters or less), but the ''simplest'' network, namely a multilayer perceptron (MLP) performs overall better for learning and predicting groundwater data than the more advanced long short-term memory or convolutional neural networks in terms of prediction accuracy and time-to-solution, making the MLP a suitable candidate for groundwater prediction. △ Less

Submitted 3 February, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

Comments: submitted to Journal of Global Optimization; main paper: 25 pages, 19 figures, 1 table; online supplement: 11 pages, 18 figures, 3 tables

Report number: LBNL-2001234

arXiv:1906.07380 [pdf, other]

Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles

Authors: Siddhartha Jain, Ge Liu, Jonas Mueller, David Gifford

Abstract: The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a… ▽ More The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a straightforward approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs that might be encountered in the future. When applied to various neural network ensembles, MOD significantly improves predictive performance for out-of-distribution test examples without sacrificing in-distribution performance on 38 Protein-DNA binding regression datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. Across many Bayesian optimization tasks, the performance of UCB acquisition is also greatly improved by leveraging MOD uncertainty estimates. △ Less

Submitted 12 February, 2020; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: 10 pages, 3 figures

arXiv:1905.12777 [pdf, other]

Educating Text Autoencoders: Latent Representation Guidance via Denoising

Authors: Tianxiao Shen, Jonas Mueller, Regina Barzilay, Tommi Jaakkola

Abstract: Generative autoencoders offer a promising approach for controllable text generation by leveraging their latent sentence representations. However, current models struggle to maintain coherent latent spaces required to perform meaningful text manipulations via latent vector operations. Specifically, we demonstrate by example that neural encoders do not necessarily map similar sentences to nearby lat… ▽ More Generative autoencoders offer a promising approach for controllable text generation by leveraging their latent sentence representations. However, current models struggle to maintain coherent latent spaces required to perform meaningful text manipulations via latent vector operations. Specifically, we demonstrate by example that neural encoders do not necessarily map similar sentences to nearby latent vectors. A theoretical explanation for this phenomenon establishes that high capacity autoencoders can learn an arbitrary map** between sequences and associated latent representations. To remedy this issue, we augment adversarial autoencoders with a denoising objective where original sentences are reconstructed from perturbed versions (referred to as DAAE). We prove that this simple modification guides the latent space geometry of the resulting model by encouraging the encoder to map similar texts to similar latent representations. In empirical comparisons with various types of autoencoders, our model provides the best trade-off between generation quality and reconstruction capacity. Moreover, the improved geometry of the DAAE latent space enables zero-shot text style transfer via simple latent vector arithmetic. △ Less

Submitted 7 July, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: ICML 2020 camera-ready

arXiv:1811.00915 [pdf, ps, other]

doi 10.1109/BIBM.2018.8621225

Convolutional Neural Networks for Epileptic Seizure Prediction

Authors: Matthias Eberlein, Raphael Hildebrand, Ronald Tetzlaff, Nico Hoffmann, Levin Kuhlmann, Benjamin Brinkmann, Jens Müller

Abstract: Epilepsy is the most common neurological disorder and an accurate forecast of seizures would help to overcome the patient's uncertainty and helplessness. In this contribution, we present and discuss a novel methodology for the classification of intracranial electroencephalography (iEEG) for seizure prediction. Contrary to previous approaches, we categorically refrain from an extraction of hand-cra… ▽ More Epilepsy is the most common neurological disorder and an accurate forecast of seizures would help to overcome the patient's uncertainty and helplessness. In this contribution, we present and discuss a novel methodology for the classification of intracranial electroencephalography (iEEG) for seizure prediction. Contrary to previous approaches, we categorically refrain from an extraction of hand-crafted features and use a convolutional neural network (CNN) topology instead for both the determination of suitable signal characteristics and the binary classification of preictal and interictal segments. Three different models have been evaluated on public datasets with long-term recordings from four dogs and three patients. Overall, our findings demonstrate the general applicability. In this work we discuss the strengths and limitations of our methodology. △ Less

Submitted 11 April, 2023; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: accepted for MLESP 2018

Journal ref: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

arXiv:1810.03805 [pdf, other]

What made you do this? Understanding black-box decisions with sufficient input subsets

Authors: Brandon Carter, Jonas Mueller, Siddhartha Jain, David Gifford

Abstract: Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose obse… ▽ More Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model's decision-making can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely model-agnostic, simply implemented using instance-wise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on various neural network models trained on text, image, and genomic data. △ Less

Submitted 8 February, 2019; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: Published in AISTATS 2019; Equal contribution by first two authors

arXiv:1809.10784 [pdf, other]

Adaptive Gaussian process surrogates for Bayesian inference

Authors: Timur Takhtaganov, Juliane Müller

Abstract: We present an adaptive approach to the construction of Gaussian process surrogates for Bayesian inference with expensive-to-evaluate forward models. Our method relies on the fully Bayesian approach to training Gaussian process models and utilizes the expected improvement idea from Bayesian global optimization. We adaptively construct training designs by maximizing the expected improvement in fit o… ▽ More We present an adaptive approach to the construction of Gaussian process surrogates for Bayesian inference with expensive-to-evaluate forward models. Our method relies on the fully Bayesian approach to training Gaussian process models and utilizes the expected improvement idea from Bayesian global optimization. We adaptively construct training designs by maximizing the expected improvement in fit of the Gaussian process model to the noisy observational data. Numerical experiments on model problems with synthetic data demonstrate the effectiveness of the obtained adaptive designs compared to the fixed non-adaptive designs in terms of accurate posterior estimation at a fraction of the cost of inference with forward models. △ Less

Submitted 27 September, 2018; originally announced September 2018.

Comments: 38 pages, submitted to the SIAM/ASA Journal on Uncertainty Quantification

MSC Class: 62F15; 60G15; 62G08; 62K20; 62K86

arXiv:1806.00050 [pdf, other]

Interpretable Set Functions

Authors: Andrew Cotter, Maya Gupta, Heinrich Jiang, James Muller, Taman Narayan, Serena Wang, Tao Zhu

Abstract: We propose learning flexible but interpretable functions that aggregate a variable-length set of permutation-invariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputs-and-outputs. We then use the proposed set function to automate the engineering of dense, int… ▽ More We propose learning flexible but interpretable functions that aggregate a variable-length set of permutation-invariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputs-and-outputs. We then use the proposed set function to automate the engineering of dense, interpretable features from sparse categorical features, which we call semantic feature engine. Experiments on real-world data show the achieved accuracy is similar to deep sets or deep neural networks, and is easier to debug and understand. △ Less

Submitted 31 May, 2018; originally announced June 2018.

arXiv:1801.10242 [pdf, other]

Low-Rank Bandit Methods for High-Dimensional Dynamic Pricing

Authors: Jonas Mueller, Vasilis Syrgkanis, Matt Taddy

Abstract: We consider dynamic pricing with many products under an evolving but low-dimensional demand model. Assuming the temporal variation in cross-elasticities exhibits low-rank structure based on fixed (latent) features of the products, we show that the revenue maximization problem reduces to an online bandit convex optimization with side information given by the observed demands. We design dynamic pric… ▽ More We consider dynamic pricing with many products under an evolving but low-dimensional demand model. Assuming the temporal variation in cross-elasticities exhibits low-rank structure based on fixed (latent) features of the products, we show that the revenue maximization problem reduces to an online bandit convex optimization with side information given by the observed demands. We design dynamic pricing algorithms whose revenue approaches that of the best fixed price vector in hindsight, at a rate that only depends on the intrinsic rank of the demand model and not the number of products. Our approach applies a bandit convex optimization algorithm in a projected low-dimensional space spanned by the latent product features, while simultaneously learning this span via online singular value decomposition of a carefully-crafted matrix containing the observed demands. △ Less

Submitted 10 September, 2019; v1 submitted 30 January, 2018; originally announced January 2018.

Comments: NeurIPS 2019

arXiv:1606.05027 [pdf, other]

Learning Optimal Interventions

Authors: Jonas Mueller, David N. Reshef, George Du, Tommi Jaakkola

Abstract: Our goal is to identify beneficial interventions from observational data. We consider interventions that are narrowly focused (impacting few covariates) and may be tailored to each individual or globally enacted over a population. For applications where harmful intervention is drastically worse than proposing no change, we propose a conservative definition of the optimal intervention. Assuming the… ▽ More Our goal is to identify beneficial interventions from observational data. We consider interventions that are narrowly focused (impacting few covariates) and may be tailored to each individual or globally enacted over a population. For applications where harmful intervention is drastically worse than proposing no change, we propose a conservative definition of the optimal intervention. Assuming the underlying relationship remains invariant under intervention, we develop efficient algorithms to identify the optimal intervention policy from limited data and provide theoretical guarantees for our approach in a Gaussian Process setting. Although our methods assume covariates can be precisely adjusted, they remain capable of improving outcomes in misspecified settings where interventions incur unintentional downstream effects. Empirically, our approach identifies good interventions in two practical applications: gene perturbation and writing improvement. △ Less

Submitted 22 February, 2017; v1 submitted 15 June, 2016; originally announced June 2016.

Comments: AISTATS 2017

Journal ref: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:1039-1047, 2017

arXiv:1511.04486 [pdf, other]

doi 10.1080/01621459.2017.1341412

Modeling Persistent Trends in Distributions

Authors: Jonas Mueller, Tommi Jaakkola, David Gifford

Abstract: We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the rece… ▽ More We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship reflects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a "trend" in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data. △ Less

Submitted 24 May, 2017; v1 submitted 13 November, 2015; originally announced November 2015.

Comments: To appear in: Journal of the American Statistical Association

Journal ref: Journal of the American Statistical Association, 113(523):1296-1310, 2018

arXiv:1510.08956 [pdf, other]

Principal Differences Analysis: Interpretable Characterization of Differences between Distributions

Authors: Jonas Mueller, Tommi Jaakkola

Abstract: We introduce principal differences analysis (PDA) for analyzing differences between high-dimensional distributions. The method operates by finding the projection that maximizes the Wasserstein divergence between the resulting univariate populations. Relying on the Cramer-Wold device, it requires no assumptions about the form of the underlying distributions, nor the nature of their inter-class diff… ▽ More We introduce principal differences analysis (PDA) for analyzing differences between high-dimensional distributions. The method operates by finding the projection that maximizes the Wasserstein divergence between the resulting univariate populations. Relying on the Cramer-Wold device, it requires no assumptions about the form of the underlying distributions, nor the nature of their inter-class differences. A sparse variant of the method is introduced to identify features responsible for the differences. We provide algorithms for both the original minimax formulation as well as its semidefinite relaxation. In addition to deriving some convergence results, we illustrate how the approach may be applied to identify differences between cell populations in the somatosensory cortex and hippocampus as manifested by single cell RNA-seq. Our broader framework extends beyond the specific choice of Wasserstein divergence. △ Less

Submitted 29 October, 2015; originally announced October 2015.

Comments: Advances in Neural Information Processing Systems 28 (NIPS 2015)

Journal ref: Advances in Neural Information Processing Systems 28: 1702-1710, 2015

arXiv:1110.4531 [pdf, other]

Regression for sets of polynomial equations

Authors: Franz Johannes Király, Paul von Bünau, Jan Saputra Müller, Duncan Blythe, Frank Meinecke, Klaus-Robert Müller

Abstract: We propose a method called ideal regression for approximating an arbitrary system of polynomial equations by a system of a particular type. Using techniques from approximate computational algebraic geometry, we show how we can solve ideal regression directly without resorting to numerical optimization. Ideal regression is useful whenever the solution to a learning problem can be described by a sys… ▽ More We propose a method called ideal regression for approximating an arbitrary system of polynomial equations by a system of a particular type. Using techniques from approximate computational algebraic geometry, we show how we can solve ideal regression directly without resorting to numerical optimization. Ideal regression is useful whenever the solution to a learning problem can be described by a system of polynomial equations. As an example, we demonstrate how to formulate Stationary Subspace Analysis (SSA), a source separation problem, in terms of ideal regression, which also yields a consistent estimator for SSA. We then compare this estimator in simulations with previous optimization-based approaches for SSA. △ Less

Submitted 25 March, 2013; v1 submitted 20 October, 2011; originally announced October 2011.

Comments: arXiv admin note: substantial text overlap with arXiv:1108.1483

Journal ref: Journal of Machine Learning Research Workshop and Conference Proceedings Vol.22: Proceedings on the Fifteenth International Conference on Artificial Intelligence and Statistics, 22:628-637. 2012

Showing 1–42 of 42 results for author: Mueller, J