Search | arXiv e-print repository

Detecting and Identifying Selection Structure in Sequential Data

Authors: Yujia Zheng, Zeyu Tang, Yiwen Qiu, Bernhard Schölkopf, Kun Zhang

Abstract: We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportun… ▽ More We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: ICML 2024

arXiv:2406.00388 [pdf, ps, other]

Products, Abstractions and Inclusions of Causal Spaces

Authors: Simon Buchholz, Junhyung Park, Bernhard Schölkopf

Abstract: Causal spaces have recently been introduced as a measure-theoretic framework to encode the notion of causality. While it has some advantages over established frameworks, such as structural causal models, the theory is so far only developed for single causal spaces. In many mathematical theories, not least the theory of probability spaces of which causal spaces are a direct extension, combinations… ▽ More Causal spaces have recently been introduced as a measure-theoretic framework to encode the notion of causality. While it has some advantages over established frameworks, such as structural causal models, the theory is so far only developed for single causal spaces. In many mathematical theories, not least the theory of probability spaces of which causal spaces are a direct extension, combinations of objects and maps between objects form a central part. In this paper, taking inspiration from such objects in probability theory, we propose the definitions of products of causal spaces, as well as (stochastic) transformations between causal spaces. In the context of causality, these quantities can be given direct semantic interpretations as causally independent components, abstractions and extensions. △ Less

Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2402.09236 [pdf, other]

Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models

Authors: Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar

Abstract: To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn… ▽ More To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn human-interpretable concepts from data. Weaving together ideas from both fields, we formally define a notion of concepts and show that they can be provably recovered from diverse data. Experiments on synthetic data and large language models show the utility of our unified approach. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 36 pages

arXiv:2306.02235 [pdf, other]

Learning Linear Causal Representations from Interventions under General Nonlinear Mixing

Authors: Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar

Abstract: We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker cl… ▽ More We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks. △ Less

Submitted 18 December, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

Comments: Accepted as Oral paper at NeurIPS 2023

arXiv:2305.17139 [pdf, other]

A Measure-Theoretic Axiomatisation of Causality

Authors: Junhyung Park, Simon Buchholz, Bernhard Schölkopf, Krikamol Muandet

Abstract: Causality is a central concept in a wide range of research areas, yet there is still no universally agreed axiomatisation of causality. We view causality both as an extension of probability theory and as a study of \textit{what happens when one intervenes on a system}, and argue in favour of taking Kolmogorov's measure-theoretic axiomatisation of probability as the starting point towards an axioma… ▽ More Causality is a central concept in a wide range of research areas, yet there is still no universally agreed axiomatisation of causality. We view causality both as an extension of probability theory and as a study of \textit{what happens when one intervenes on a system}, and argue in favour of taking Kolmogorov's measure-theoretic axiomatisation of probability as the starting point towards an axiomatisation of causality. To that end, we propose the notion of a \textit{causal space}, consisting of a probability space along with a collection of transition probability kernels, called \textit{causal kernels}, that encode the causal information of the space. Our proposed framework is not only rigorously grounded in measure theory, but it also sheds light on long-standing limitations of existing frameworks including, for example, cycles, latent variables and stochastic processes. △ Less

Submitted 6 June, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

arXiv:2212.08498 [pdf, other]

Evaluating vaccine allocation strategies using simulation-assisted causal modelling

Authors: Armin Kekić, Jonas Dehning, Luigi Gresele, Julius von Kügelgen, Viola Priesemann, Bernhard Schölkopf

Abstract: Early on during a pandemic, vaccine availability is limited, requiring prioritisation of different population groups. Evaluating vaccine allocation is therefore a crucial element of pandemics response. In the present work, we develop a model to retrospectively evaluate age-dependent counterfactual vaccine allocation strategies against the COVID-19 pandemic. To estimate the effect of allocation on… ▽ More Early on during a pandemic, vaccine availability is limited, requiring prioritisation of different population groups. Evaluating vaccine allocation is therefore a crucial element of pandemics response. In the present work, we develop a model to retrospectively evaluate age-dependent counterfactual vaccine allocation strategies against the COVID-19 pandemic. To estimate the effect of allocation on the expected severe-case incidence, we employ a simulation-assisted causal modelling approach which combines a compartmental infection-dynamics simulation, a coarse-grained, data-driven causal model and literature estimates for immunity waning. We compare Israel's implemented vaccine allocation strategy in 2021 to counterfactual strategies such as no prioritisation, prioritisation of younger age groups or a strict risk-ranked approach; we find that Israel's implemented strategy was indeed highly effective. We also study the marginal impact of increasing vaccine uptake for a given age group and find that increasing vaccinations in the elderly is most effective at preventing severe cases, whereas additional vaccinations for middle-aged groups reduce infections most effectively. Due to its modular structure, our model can easily be adapted to study future pandemics. We demonstrate this flexibility by investigating vaccine allocation strategies for a pandemic with characteristics of the Spanish Flu. Our approach thus helps evaluate vaccination strategies under the complex interplay of core epidemic factors, including age-dependent risk profiles, immunity waning, vaccine availability and spreading rates. △ Less

Submitted 14 December, 2022; originally announced December 2022.

arXiv:2207.12067 [pdf, other]

Homomorphism Autoencoder -- Learning Group Structured Representations from Observed Transitions

Authors: Hamza Keurti, Hsiao-Ru Pan, Michel Besserve, Benjamin F. Grewe, Bernhard Schölkopf

Abstract: How can agents learn internal models that veridically represent interactions with the real world is a largely open question. As machine learning is moving towards representations containing not just observational but also interventional knowledge, we study this problem using tools from representation learning and group theory. We propose methods enabling an agent acting upon the world to learn int… ▽ More How can agents learn internal models that veridically represent interactions with the real world is a largely open question. As machine learning is moving towards representations containing not just observational but also interventional knowledge, we study this problem using tools from representation learning and group theory. We propose methods enabling an agent acting upon the world to learn internal representations of sensory information that are consistent with actions that modify it. We use an autoencoder equipped with a group representation acting on its latent space, trained using an equivariance-derived loss in order to enforce a suitable homomorphism property on the group representation. In contrast to existing work, our approach does not require prior knowledge of the group and does not restrict the set of actions the agent can perform. We motivate our method theoretically, and show empirically that it can learn a group representation of the actions, thereby capturing the structure of the set of transformations applied to the environment. We further show that this allows agents to predict the effect of sequences of future actions with improved accuracy. △ Less

Submitted 2 July, 2024; v1 submitted 25 July, 2022; originally announced July 2022.

Comments: Accepted at ICML2023, Presented at the Symmetry and Geometry in Neural Representations Workshop (NeurReps) @ NeurIPS2022, 26 pages, 17 figures

arXiv:2207.04771 [pdf, other]

Functional Generalized Empirical Likelihood Estimation for Conditional Moment Restrictions

Authors: Heiner Kremer, Jia-Jie Zhu, Krikamol Muandet, Bernhard Schölkopf

Abstract: Important problems in causal inference, economics, and, more generally, robust machine learning can be expressed as conditional moment restrictions, but estimation becomes challenging as it requires solving a continuum of unconditional moment restrictions. Previous works addressed this problem by extending the generalized method of moments (GMM) to continuum moment restrictions. In contrast, gener… ▽ More Important problems in causal inference, economics, and, more generally, robust machine learning can be expressed as conditional moment restrictions, but estimation becomes challenging as it requires solving a continuum of unconditional moment restrictions. Previous works addressed this problem by extending the generalized method of moments (GMM) to continuum moment restrictions. In contrast, generalized empirical likelihood (GEL) provides a more general framework and has been shown to enjoy favorable small-sample properties compared to GMM-based estimators. To benefit from recent developments in machine learning, we provide a functional reformulation of GEL in which arbitrary models can be leveraged. Motivated by a dual formulation of the resulting infinite dimensional optimization problem, we devise a practical method and explore its asymptotic properties. Finally, we provide kernel- and neural network-based implementations of the estimator, which achieve state-of-the-art empirical performance on two conditional moment restriction problems. △ Less

Submitted 16 February, 2024; v1 submitted 11 July, 2022; originally announced July 2022.

arXiv:2206.02953 [pdf, other]

Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization

Authors: Aniket Das, Bernhard Schölkopf, Michael Muehlebach

Abstract: We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and prese… ▽ More We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-Łojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all. △ Less

Submitted 10 October, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2204.11564 [pdf, other]

Maximum Mean Discrepancy Distributionally Robust Nonlinear Chance-Constrained Optimization with Finite-Sample Guarantee

Authors: Yassine Nemmour, Heiner Kremer, Bernhard Schölkopf, Jia-Jie Zhu

Abstract: This paper is motivated by addressing open questions in distributionally robust chance-constrained programs (DRCCP) using the popular Wasserstein ambiguity sets. Specifically, the computational techniques for those programs typically place restrictive assumptions on the constraint functions and the size of the Wasserstein ambiguity sets is often set using costly cross-validation (CV) procedures or… ▽ More This paper is motivated by addressing open questions in distributionally robust chance-constrained programs (DRCCP) using the popular Wasserstein ambiguity sets. Specifically, the computational techniques for those programs typically place restrictive assumptions on the constraint functions and the size of the Wasserstein ambiguity sets is often set using costly cross-validation (CV) procedures or conservative measure concentration bounds. In contrast, we propose a practical DRCCP algorithm using kernel maximum mean discrepancy (MMD) ambiguity sets, which we term MMD-DRCCP, to treat general nonlinear constraints without using ad-hoc reformulation techniques. MMD-DRCCP can handle general nonlinear and non-convex constraints with a proven finite-sample constraint satisfaction guarantee of a dimension-independent $\mathcal{O}(\frac{1}{\sqrt{N}})$ rate, achievable by a practical algorithm. We further propose an efficient bootstrap scheme for constructing sharp MMD ambiguity sets in practice without resorting to CV. Our algorithm is validated numerically on a portfolio optimization problem and a tube-based distributionally robust model predictive control problem with non-convex constraints. △ Less

Submitted 25 April, 2022; originally announced April 2022.

arXiv:2203.15756 [pdf, other]

Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data

Authors: Siyuan Guo, Viktor Tóth, Bernhard Schölkopf, Ferenc Huszár

Abstract: Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\textit{independent and identically distributed}$ data. However, it is known that even with infinite i.i.d.$\ $ data, constraint-based methods can only identify causal… ▽ More Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\textit{independent and identically distributed}$ data. However, it is known that even with infinite i.i.d.$\ $ data, constraint-based methods can only identify causal structures up to broad Markov equivalence classes, posing a fundamental limitation for causal discovery. In this work, we observe that exchangeable data contains richer conditional independence structure than i.i.d.$\ $ data, and show how the richer structure can be leveraged for causal discovery. We first present causal de Finetti theorems, which state that exchangeable distributions with certain non-trivial conditional independences can always be represented as $\textit{independent causal mechanism (ICM)}$ generative processes. We then present our main identifiability theorem, which shows that given data from an ICM generative process, its unique causal structure can be identified through performing conditional independence tests. We finally develop a causal discovery algorithm and demonstrate its applicability to inferring causal relationships from multi-environment data. Our code and models are publicly available at: https://github.com/syguo96/Causal-de-Finetti △ Less

Submitted 24 May, 2024; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: camera-ready NeurIPS 2023

arXiv:2201.05830 [pdf, other]

Physical Derivatives: Computing policy gradients by physical forward-propagation

Authors: Arash Mehrjou, Ashkan Soleymani, Stefan Bauer, Bernhard Schölkopf

Abstract: Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with… ▽ More Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them. △ Less

Submitted 15 January, 2022; originally announced January 2022.

arXiv:2110.13588 [pdf, other]

Distributional Robustness Regularized Scenario Optimization with Application to Model Predictive Control

Authors: Yassine Nemmour, Bernhard Schölkopf, Jia-Jie Zhu

Abstract: We provide a functional view of distributional robustness motivated by robust statistics and functional analysis. This results in two practical computational approaches for approximate distributionally robust nonlinear optimization based on gradient norms and reproducing kernel Hilbert spaces. Our method can be applied to the settings of statistical learning with small sample size and test distrib… ▽ More We provide a functional view of distributional robustness motivated by robust statistics and functional analysis. This results in two practical computational approaches for approximate distributionally robust nonlinear optimization based on gradient norms and reproducing kernel Hilbert spaces. Our method can be applied to the settings of statistical learning with small sample size and test distribution shift. As a case study, we robustify scenario-based stochastic model predictive control with general nonlinear constraints. In particular, we demonstrate constraint satisfaction with only a small number of scenarios under distribution shift. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Journal ref: Proceedings of the 3rd Conference on Learning for Dynamics and Control, PMLR 144:1255-1269, 2021

arXiv:2102.11834 [pdf, other]

Finding Stable Matchings in PhD Markets with Consistent Preferences and Cooperative Partners

Authors: Maximilian Mordig, Riccardo Della Vecchia, Nicolò Cesa-Bianchi, Bernhard Schölkopf

Abstract: We introduce a new algorithm for finding stable matchings in multi-sided matching markets. Our setting is motivated by a PhD market of students, advisors, and co-advisors, and can be generalized to supply chain networks viewed as $n$-sided markets. In the three-sided PhD market, students primarily care about advisors and then about co-advisors (consistent preferences), while advisors and co-adviso… ▽ More We introduce a new algorithm for finding stable matchings in multi-sided matching markets. Our setting is motivated by a PhD market of students, advisors, and co-advisors, and can be generalized to supply chain networks viewed as $n$-sided markets. In the three-sided PhD market, students primarily care about advisors and then about co-advisors (consistent preferences), while advisors and co-advisors have preferences over students only (hence they are cooperative). A student must be matched to one advisor and one co-advisor, or not at all. In contrast to previous work, advisor-student and student-co-advisor pairs may not be mutually acceptable (e.g., a student may not want to work with an advisor or co-advisor and vice versa). We show that three-sided stable matchings always exist, and present an algorithm that, in time quadratic in the market size (up to log factors), finds a three-sided stable matching using any two-sided stable matching algorithm as matching engine. We illustrate the challenges that arise when not all advisor-co-advisor pairs are compatible. We then generalize our algorithm to $n$-sided markets with quotas and show how they can model supply chain networks. Finally, we show how our algorithm outperforms the baseline given by [Danilov, 2003] in terms of both producing a stable matching and a larger number of matches on a synthetic dataset. △ Less

Submitted 6 July, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

arXiv:2102.08474 [pdf, other]

Adversarially Robust Kernel Smoothing

Authors: Jia-Jie Zhu, Christina Kouridi, Yassine Nemmour, Bernhard Schölkopf

Abstract: We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function ma… ▽ More We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance. △ Less

Submitted 19 February, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

arXiv:2101.12080 [pdf, other]

Two-Sided Matching Markets in the ELLIS 2020 PhD Program

Authors: Maximilian Mordig, Riccardo Della Vecchia, Nicolò Cesa-Bianchi, Bernhard Schölkopf

Abstract: The ELLIS PhD program is a European initiative that supports excellent young researchers by connecting them to leading researchers in AI. In particular, PhD students are supervised by two advisors from different countries: an advisor and a co-advisor. In this work we summarize the procedure that, in its final step, matches students to advisors in the ELLIS 2020 PhD program. The steps of the proced… ▽ More The ELLIS PhD program is a European initiative that supports excellent young researchers by connecting them to leading researchers in AI. In particular, PhD students are supervised by two advisors from different countries: an advisor and a co-advisor. In this work we summarize the procedure that, in its final step, matches students to advisors in the ELLIS 2020 PhD program. The steps of the procedure are based on the extensive literature of two-sided matching markets and the college admissions problem [Knuth and De Bruijn, 1997, Gale and Shapley, 1962, Rothand Sotomayor, 1992]. We introduce PolyGS, an algorithm for the case of two-sided markets with quotas on both sides (also known as many-to-many markets) which we use throughout the selection procedure of pre-screening, interview matching and final matching with advisors. The algorithm returns a stable matching in the sense that no unmatched persons prefer to be matched together rather than with their current partners (given their indicated preferences). Roth [1984] gives evidence that only stable matchings are likely to be adhered to over time. Additionally, the matching is student-optimal. Preferences are constructed based on the rankings each side gives to the other side and the overlaps of research fields. We present and discuss the matchings that the algorithm produces in the ELLIS 2020 PhD program. △ Less

Submitted 11 March, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

arXiv:2007.02938 [pdf, other]

Causal Feature Selection via Orthogonal Search

Authors: Ashkan Soleymani, Anant Raj, Stefan Bauer, Bernhard Schölkopf, Michel Besserve

Abstract: The problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. However, established approaches often scale at least exponentially with the number of explanatory variables, are difficult to extend to nonlinear relationships, and are difficult to extend to cyclic data. Inspired by {\em Debiased… ▽ More The problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. However, established approaches often scale at least exponentially with the number of explanatory variables, are difficult to extend to nonlinear relationships, and are difficult to extend to cyclic data. Inspired by {\em Debiased} machine learning methods, we study a one-vs.-the-rest feature selection approach to discover the direct causal parent of the response. We propose an algorithm that works for purely observational data while also offering theoretical guarantees, including the case of partially nonlinear relationships possibly under the presence of cycles. As it requires only one estimation for each variable, our approach is applicable even to large graphs. We demonstrate significant improvements compared to established approaches. △ Less

Submitted 16 September, 2022; v1 submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.09268 [pdf, ps, other]

Metrizing Weak Convergence with Maximum Mean Discrepancies

Authors: Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, Lester Mackey

Abstract: This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel k, whose reproducing kernel Hilbert space (RKHS) functions vanish at infinity, metrizes the weak convergence of… ▽ More This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel k, whose reproducing kernel Hilbert space (RKHS) functions vanish at infinity, metrizes the weak convergence of probability measures if and only if k is continuous and integrally strictly positive definite (i.s.p.d.) over all signed, finite, regular Borel measures. We also correct a prior result of Simon-Gabriel & Schölkopf (JMLR, 2018, Thm.12) by showing that there exist both bounded continuous i.s.p.d. kernels that do not metrize weak convergence and bounded continuous non-i.s.p.d. kernels that do metrize it. △ Less

Submitted 3 September, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: 14 pages. Corrects in particular Thm.12 of Simon-Gabriel and Schölkopf, JMLR, 19(44):1-29, 2018. See http://jmlr.org/papers/v19/16-291.html

MSC Class: 60B10 (Primary) 60F05; 60-08; 28-08 (Secondary) ACM Class: G.3; I.2.6; I.5.0

arXiv:2006.06981 [pdf, other]

Kernel Distributionally Robust Optimization

Authors: Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Abstract: We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing r… ▽ More We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks. △ Less

Submitted 27 February, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

Journal ref: Proceedings of Machine Learning Research, PMLR 130:280-288, 2021

arXiv:2005.06413 [pdf, ps, other]

Crackovid: Optimizing Group Testing

Authors: Louis Abraham, Gary Bécigneul, Bernhard Schölkopf

Abstract: We study the problem usually referred to as group testing in the context of COVID-19. Given $n$ samples taken from patients, how should we select mixtures of samples to be tested, so as to maximize information and minimize the number of tests? We consider both adaptive and non-adaptive strategies, and take a Bayesian approach with a prior both for infection of patients and test errors. We start by… ▽ More We study the problem usually referred to as group testing in the context of COVID-19. Given $n$ samples taken from patients, how should we select mixtures of samples to be tested, so as to maximize information and minimize the number of tests? We consider both adaptive and non-adaptive strategies, and take a Bayesian approach with a prior both for infection of patients and test errors. We start by proposing a mathematically principled objective, grounded in information theory. We then optimize non-adaptive optimization strategies using genetic algorithms, and leverage the mathematical framework of adaptive sub-modularity to obtain theoretical guarantees for the greedy-adaptive method. △ Less

Submitted 13 May, 2020; originally announced May 2020.

arXiv:2004.00166 [pdf, other]

Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem

Authors: Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Abstract: In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods -- the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparame… ▽ More In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods -- the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparametric manner. We then present the tractable approximation and its theoretical justification. As a concrete application, we numerically test the proposed method in characterizing the worst-case constraint violation probability in the context of a constrained stochastic control system. △ Less

Submitted 6 September, 2020; v1 submitted 31 March, 2020; originally announced April 2020.

arXiv:2003.02658 [pdf, other]

SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives

Authors: Emmanouil Angelis, Philippe Wenk, Bernhard Schölkopf, Stefan Bauer, Andreas Krause

Abstract: Gaussian processes are an important regression tool with excellent analytic properties which allow for direct integration of derivative observations. However, vanilla GP methods scale cubically in the amount of observations. In this work, we propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We then prove deterministic, non-asymptotic and expo… ▽ More Gaussian processes are an important regression tool with excellent analytic properties which allow for direct integration of derivative observations. However, vanilla GP methods scale cubically in the amount of observations. In this work, we propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We then prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior. To furthermore illustrate the practical applicability of our method, we then apply it to ODIN, a recently developed algorithm for ODE parameter inference. In an extensive experiments section, all results are empirically validated, demonstrating the speed, accuracy, and practical applicability of this approach. △ Less

Submitted 5 March, 2020; originally announced March 2020.

arXiv:2002.10271 [pdf, other]

Testing Goodness of Fit of Conditional Density Models with Kernels

Authors: Wittawat Jitkrittum, Heishiro Kanagawa, Bernhard Schölkopf

Abstract: We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function $p(y|x)$ and a joint sample, decide whether the sample is drawn from $p(y|x)r_x(x)$ for some density $r_x$. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the norma… ▽ More We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function $p(y|x)$ and a joint sample, decide whether the sample is drawn from $p(y|x)r_x(x)$ for some density $r_x$. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City's taxi drop-off location given a pick-up point. To our knowledge, our work is the first to propose such conditional goodness-of-fit tests that simultaneously have all these desirable properties. △ Less

Submitted 30 June, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: In UAI 2020. http://auai.org/uai2020/accepted.php

MSC Class: 46E22; 62G10 ACM Class: G.3; I.2.6

arXiv:2001.10398 [pdf, other]

A Kernel Mean Embedding Approach to Reducing Conservativeness in Stochastic Programming and Control

Authors: Jia-Jie Zhu, Moritz Diehl, Bernhard Schölkopf

Abstract: We apply kernel mean embedding methods to sample-based stochastic optimization and control. Specifically, we use the reduced-set expansion method as a way to discard sampled scenarios. The effect of such constraint removal is improved optimality and decreased conservativeness. This is achieved by solving a distributional-distance-regularized optimization problem. We demonstrated this optimization… ▽ More We apply kernel mean embedding methods to sample-based stochastic optimization and control. Specifically, we use the reduced-set expansion method as a way to discard sampled scenarios. The effect of such constraint removal is improved optimality and decreased conservativeness. This is achieved by solving a distributional-distance-regularized optimization problem. We demonstrated this optimization formulation is well-motivated in theory, computationally tractable and effective in numerical algorithms. △ Less

Submitted 22 April, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

arXiv:1911.11082 [pdf, other]

A New Distribution-Free Concept for Representing, Comparing, and Propagating Uncertainty in Dynamical Systems with Kernel Probabilistic Programming

Authors: Jia-Jie Zhu, Krikamol Muandet, Moritz Diehl, Bernhard Schölkopf

Abstract: This work presents the concept of kernel mean embedding and kernel probabilistic programming in the context of stochastic systems. We propose formulations to represent, compare, and propagate uncertainties for fairly general stochastic dynamics in a distribution-free manner. The new tools enjoy sound theory rooted in functional analysis and wide applicability as demonstrated in distinct numerical… ▽ More This work presents the concept of kernel mean embedding and kernel probabilistic programming in the context of stochastic systems. We propose formulations to represent, compare, and propagate uncertainties for fairly general stochastic dynamics in a distribution-free manner. The new tools enjoy sound theory rooted in functional analysis and wide applicability as demonstrated in distinct numerical examples. The implication of this new concept is a new mode of thinking about the statistical nature of uncertainty in dynamical systems. △ Less

Submitted 4 May, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

arXiv:1910.14428

Kernel-Guided Training of Implicit Generative Models with Stability Guarantees

Authors: Arash Mehrjou, Wittawat Jitkrittum, Krikamol Muandet, Bernhard Schölkopf

Abstract: Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meaningful way during the course of training. In this wo… ▽ More Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meaningful way during the course of training. In this work, we propose a theoretically grounded method to guide the training trajectories of GANs by augmenting the GAN loss function with a kernel-based regularization term that controls local and global discrepancies between the model and true distributions. This control signal allows us to inject prior knowledge into the model. We provide theoretical guarantees on the stability of the resulting dynamical system and demonstrate different aspects of it via a wide range of experiments. △ Less

Submitted 3 November, 2019; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: There was a misunderstanding in how an article should be updated on arXiv. We have withdrawn this article from this link. The same article can be found at arXiv:1901.09206

arXiv:1902.08480 [pdf, other]

AReS and MaRS - Adversarial and MMD-Minimizing Regression for SDEs

Authors: Gabriele Abbati, Philippe Wenk, Michael A Osborne, Andreas Krause, Bernhard Schölkopf, Stefan Bauer

Abstract: Stochastic differential equations are an important modeling class in many disciplines. Consequently, there exist many methods relying on various discretization and numerical integration schemes. In this paper, we propose a novel, probabilistic model for estimating the drift and diffusion given noisy observations of the underlying stochastic system. Using state-of-the-art adversarial and moment mat… ▽ More Stochastic differential equations are an important modeling class in many disciplines. Consequently, there exist many methods relying on various discretization and numerical integration schemes. In this paper, we propose a novel, probabilistic model for estimating the drift and diffusion given noisy observations of the underlying stochastic system. Using state-of-the-art adversarial and moment matching inference techniques, we avoid the discretization schemes of classical approaches. This leads to significant improvements in parameter accuracy and robustness given random initial guesses. On four established benchmark systems, we compare the performance of our algorithms to state-of-the-art solutions based on extended Kalman filtering and Gaussian processes. △ Less

Submitted 28 May, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

Comments: Published at the Thirty-sixth International Conference on Machine Learning (ICML 2019)

arXiv:1902.06278 [pdf, other]

ODIN: ODE-Informed Regression for Parameter and State Inference in Time-Continuous Dynamical Systems

Authors: Philippe Wenk, Gabriele Abbati, Michael A Osborne, Bernhard Schölkopf, Andreas Krause, Stefan Bauer

Abstract: Parameter inference in ordinary differential equations is an important problem in many applied sciences and in engineering, especially in a data-scarce setting. In this work, we introduce a novel generative modeling approach based on constrained Gaussian processes and leverage it to build a computationally and data efficient algorithm for state and parameter inference. In an extensive set of exper… ▽ More Parameter inference in ordinary differential equations is an important problem in many applied sciences and in engineering, especially in a data-scarce setting. In this work, we introduce a novel generative modeling approach based on constrained Gaussian processes and leverage it to build a computationally and data efficient algorithm for state and parameter inference. In an extensive set of experiments, our approach outperforms the current state of the art for parameter inference both in terms of accuracy and computational cost. It also shows promising results for the much more challenging problem of model selection. △ Less

Submitted 5 December, 2019; v1 submitted 17 February, 2019; originally announced February 2019.

Comments: Published at the Thirty-fourth AAAI Conference on Artificial Intelligence

arXiv:1901.08403 [pdf, other]

Deep Lyapunov Function: Automatic Stability Analysis for Dynamical Systems

Authors: Arash Mehrjou, Bernhard Schölkopf

Abstract: Stability analysis plays a crucial role in studying the behavior of dynamical systems with theoretical and engineering applications. Among various kinds of stability, the stability of equilibrium points is of the greatest importance which is mainly studied by Lyapunov's stability theory. This theory requires finding a function with specified properties. Except for a few simple examples, there is n… ▽ More Stability analysis plays a crucial role in studying the behavior of dynamical systems with theoretical and engineering applications. Among various kinds of stability, the stability of equilibrium points is of the greatest importance which is mainly studied by Lyapunov's stability theory. This theory requires finding a function with specified properties. Except for a few simple examples, there is no straightforward constructive algorithm to find a Lyapunov function for an arbitrary dynamical system. The goal of this work is proposing a simple yet effective way to approximate this function using deep learning tools. △ Less

Submitted 24 January, 2019; originally announced January 2019.

arXiv:1805.10615 [pdf, other]

A Local Information Criterion for Dynamical Systems

Authors: Arash Mehrjou, Friedrich Solowjow, Sebastian Trimpe, Bernhard Schölkopf

Abstract: Encoding a sequence of observations is an essential task with many applications. The encoding can become highly efficient when the observations are generated by a dynamical system. A dynamical system imposes regularities on the observations that can be leveraged to achieve a more efficient code. We propose a method to encode a given or learned dynamical system. Apart from its application for encod… ▽ More Encoding a sequence of observations is an essential task with many applications. The encoding can become highly efficient when the observations are generated by a dynamical system. A dynamical system imposes regularities on the observations that can be leveraged to achieve a more efficient code. We propose a method to encode a given or learned dynamical system. Apart from its application for encoding a sequence of observations, we propose to use the compression achieved by this encoding as a criterion for model selection. Given a dataset, different learning algorithms result in different models. But not all learned models are equally good. We show that the proposed encoding approach can be used to choose the learned model which is closer to the true underlying dynamics. We provide experiments for both encoding and model selection, and theoretical results that shed light on why the approach works. △ Less

Submitted 27 May, 2018; originally announced May 2018.

arXiv:1804.03911 [pdf, ps, other]

Structural causal models for macro-variables in time-series

Authors: Dominik Janzing, Paul Rubenstein, Bernhard Schölkopf

Abstract: We consider a bivariate time series $(X_t,Y_t)$ that is given by a simple linear autoregressive model. Assuming that the equations describing each variable as a linear combination of past values are considered structural equations, there is a clear meaning of how intervening on one particular $X_t$ influences $Y_{t'}$ at later times $t'>t$. In the present work, we describe conditions under which o… ▽ More We consider a bivariate time series $(X_t,Y_t)$ that is given by a simple linear autoregressive model. Assuming that the equations describing each variable as a linear combination of past values are considered structural equations, there is a clear meaning of how intervening on one particular $X_t$ influences $Y_{t'}$ at later times $t'>t$. In the present work, we describe conditions under which one can define a causal model between variables that are coarse-grained in time, thus admitting statements like `setting $X$ to $x$ changes $Y$ in a certain way' without referring to specific time instances. We show that particularly simple statements follow in the frequency domain, thus providing meaning to interventions on frequencies. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: 8 pages

arXiv:1803.09539 [pdf, other]

On Matching Pursuit and Coordinate Descent

Authors: Francesco Locatello, Anant Raj, Sai Praneeth Karimireddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U. Stich, Martin Jaggi

Abstract: Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affin… ▽ More Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $\mathcal{O}(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives. As a byproduct of our affine invariant analysis of matching pursuit, our rates for steepest coordinate descent are the tightest known. Furthermore, we show the first accelerated convergence rate $\mathcal{O}(1/t^2)$ for matching pursuit and steepest coordinate descent on convex objectives. △ Less

Submitted 31 May, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

Journal ref: ICML 2018 - Proceedings of the 35th International Conference on Machine Learning

arXiv:1705.02212 [pdf, other]

Group invariance principles for causal generative models

Authors: Michel Besserve, Naji Shajarisales, Bernhard Schölkopf, Dominik Janzing

Abstract: The postulate of independence of cause and mechanism (ICM) has recently led to several new causal discovery algorithms. The interpretation of independence and the way it is utilized, however, varies across these methods. Our aim in this paper is to propose a group theoretic framework for ICM to unify and generalize these approaches. In our setting, the cause-mechanism relationship is assessed by c… ▽ More The postulate of independence of cause and mechanism (ICM) has recently led to several new causal discovery algorithms. The interpretation of independence and the way it is utilized, however, varies across these methods. Our aim in this paper is to propose a group theoretic framework for ICM to unify and generalize these approaches. In our setting, the cause-mechanism relationship is assessed by comparing it against a null hypothesis through the application of random generic group transformations. We show that the group theoretic view provides a very general tool to study the structure of data generating mechanisms with direct applications to machine learning. △ Less

Submitted 5 May, 2017; originally announced May 2017.

Comments: 16 pages, 6 figures

ACM Class: I.2.6; I.2.10; G.3; I.5.3

arXiv:1609.07478 [pdf, other]

Screening Rules for Convex Problems

Authors: Anant Raj, Jakob Olbrich, Bernd Gärtner, Bernhard Schölkopf, Martin Jaggi

Abstract: We propose a new framework for deriving screening rules for convex optimization problems. Our approach covers a large class of constrained and penalized optimization formulations, and works in two steps. First, given any approximate point, the structure of the objective function and the duality gap is used to gather information on the optimal solution. In the second step, this information is used… ▽ More We propose a new framework for deriving screening rules for convex optimization problems. Our approach covers a large class of constrained and penalized optimization formulations, and works in two steps. First, given any approximate point, the structure of the objective function and the duality gap is used to gather information on the optimal solution. In the second step, this information is used to produce screening rules, i.e. safely identifying unimportant weight variables of the optimal solution. Our general framework leads to a large variety of useful existing as well as new screening rules for many applications. For example, we provide new screening rules for general simplex and $L_1$-constrained problems, Elastic Net, squared-loss Support Vector Machines, minimum enclosing ball, as well as structured norm regularized problems, such as group lasso. △ Less

Submitted 23 September, 2016; originally announced September 2016.

arXiv:1604.05251 [pdf, ps, other]

Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions

Authors: Carl-Johann Simon-Gabriel, Bernhard Schölkopf

Abstract: Kernel mean embeddings have recently attracted the attention of the machine learning community. They map measures $μ$ from some set $M$ to functions in a reproducing kernel Hilbert space (RKHS) with kernel $k$. The RKHS distance of two mapped measures is a semi-metric $d_k$ over $M$. We study three questions. (I) For a given kernel, what sets $M$ can be embedded? (II) When is the embedding injecti… ▽ More Kernel mean embeddings have recently attracted the attention of the machine learning community. They map measures $μ$ from some set $M$ to functions in a reproducing kernel Hilbert space (RKHS) with kernel $k$. The RKHS distance of two mapped measures is a semi-metric $d_k$ over $M$. We study three questions. (I) For a given kernel, what sets $M$ can be embedded? (II) When is the embedding injective over $M$ (in which case $d_k$ is a metric)? (III) How does the $d_k$-induced topology compare to other topologies on $M$? The existing machine learning literature has addressed these questions in cases where $M$ is (a subset of) the finite regular Borel measures. We unify, improve and generalise those results. Our approach naturally leads to continuous and possibly even injective embeddings of (Schwartz-) distributions, i.e., generalised measures, but the reader is free to focus on measures only. In particular, we systemise and extend various (partly known) equivalences between different notions of universal, characteristic and strictly positive definite kernels, and show that on an underlying locally compact Hausdorff space, $d_k$ metrises the weak convergence of probability measures if and only if $k$ is continuous and characteristic. △ Less

Submitted 17 December, 2019; v1 submitted 18 April, 2016; originally announced April 2016.

Comments: Old and longer version of the JMLR paper with same title (published 2018). Please start with the JMLR version. 55 pages (33 pages main text, 22 pages appendix), 2 tables, 1 figure (in appendix)

MSC Class: G.3 ACM Class: G.3

Journal ref: Journal of Machine Learning Research, 19(44):1-29, 2018

arXiv:1603.00784 [pdf, other]

The Arrow of Time in Multivariate Time Series

Authors: Stefan Bauer, Bernhard Schölkopf, Jonas Peters

Abstract: We prove that a time series satisfying a (linear) multivariate autoregressive moving average (VARMA) model satisfies the same model assumption in the reversed time direction, too, if all innovations are normally distributed. This reversibility breaks down if the innovations are non-Gaussian. This means that under the assumption of a VARMA process with non-Gaussian noise, the arrow of time becomes… ▽ More We prove that a time series satisfying a (linear) multivariate autoregressive moving average (VARMA) model satisfies the same model assumption in the reversed time direction, too, if all innovations are normally distributed. This reversibility breaks down if the innovations are non-Gaussian. This means that under the assumption of a VARMA process with non-Gaussian noise, the arrow of time becomes detectable. Our work thereby provides a theoretic justification of an algorithm that has been used for inferring the direction of video snippets. We present a slightly modified practical algorithm that estimates the time direction for a given sample and prove its consistency. We further investigate how the performance of the algorithm depends on sample size, number of dimensions of the time series and the order of the process. An application to real world data from economics shows that considering multivariate processes instead of univariate processes can be beneficial for estimating the time direction. Our result extends earlier work on univariate time series. It relates to the concept of causal inference, where recent methods exploit non-Gaussianity of the error terms for causal structure learning. △ Less

Submitted 2 March, 2016; originally announced March 2016.

arXiv:1603.00285 [pdf, ps, other]

Kernel-based Tests for Joint Independence

Authors: Niklas Pfister, Peter Bühlmann, Bernhard Schölkopf, Jonas Peters

Abstract: We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert… ▽ More We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert space and define the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) as the squared distance between the embeddings. In the population case, the value of dHSIC is zero if and only if the $d$ variables are jointly independent, as long as the kernel is characteristic. Based on an empirical estimate of dHSIC, we define three different non-parametric hypothesis tests: a permutation test, a bootstrap test and a test based on a Gamma approximation. We prove that the permutation test achieves the significance level and that the bootstrap test achieves pointwise asymptotic significance level as well as pointwise asymptotic consistency (i.e., it is able to detect any type of fixed dependence in the large sample limit). The Gamma approximation does not come with these guarantees; however, it is computationally very fast and for small $d$, it performs well in practice. Finally, we apply the test to a problem in causal discovery. △ Less

Submitted 4 November, 2016; v1 submitted 1 March, 2016; originally announced March 2016.

Comments: 67 pages

arXiv:1512.02057 [pdf, other]

doi 10.1088/1367-2630/18/9/093052

Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference

Authors: Dominik Janzing, Rafael Chaves, Bernhard Schoelkopf

Abstract: We postulate a principle stating that the initial condition of a physical system is typically algorithmically independent of the dynamical law. We argue that this links thermodynamics and causal inference. On the one hand, it entails behaviour that is similar to the usual arrow of time. On the other hand, it motivates a statistical asymmetry between cause and effect that has recently postulated in… ▽ More We postulate a principle stating that the initial condition of a physical system is typically algorithmically independent of the dynamical law. We argue that this links thermodynamics and causal inference. On the one hand, it entails behaviour that is similar to the usual arrow of time. On the other hand, it motivates a statistical asymmetry between cause and effect that has recently postulated in the field of causal inference, namely, that the probability distribution P(cause) contains no information about the conditional distribution P(effect|cause) and vice versa, while P(effect) may contain information about P(cause|effect). △ Less

Submitted 7 December, 2015; originally announced December 2015.

Comments: 7 pages, latex, 2 figures

Journal ref: New J. Phys. 18, 093052 (2016)

arXiv:1502.02398 [pdf, other]

Towards a Learning Theory of Cause-Effect Inference

Authors: David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, Ilya Tolstikhin

Abstract: We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in tw… ▽ More We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in two steps. First, we featurize each $S_i$ using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves state-of-the-art cause-effect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables. △ Less

Submitted 18 May, 2015; v1 submitted 9 February, 2015; originally announced February 2015.

arXiv:1411.0900 [pdf, ps, other]

Kernel Mean Estimation via Spectral Filtering

Authors: Krikamol Muandet, Bharath Sriperumbudur, Bernhard Schölkopf

Abstract: The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has sho… ▽ More The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has shown that shrinkage can help in constructing "better" estimators of the kernel mean than the empirical estimator. The present paper studies the consistency and admissibility of the estimators in Muandet et al. (2014), and proposes a wider class of shrinkage estimators that improve upon the empirical estimator by considering appropriate basis functions. Using the kernel PCA basis, we show that some of these estimators can be constructed using spectral filtering algorithms which are shown to be consistent under some technical assumptions. Our theoretical analysis also reveals a fundamental connection to the kernel-based supervised learning framework. The proposed estimators are simple to implement and perform well in practice. △ Less

Submitted 4 November, 2014; originally announced November 2014.

Comments: To appear at the 28th Annual Conference on Neural Information Processing Systems (NIPS 2014). 16 pages

arXiv:1306.0842 [pdf, ps, other]

Kernel Mean Estimation and Stein's Effect

Authors: Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, Bernhard Schölkopf

Abstract: A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein'… ▽ More A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein's phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several benchmark applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator. △ Less

Submitted 6 June, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

Comments: first draft

arXiv:1205.1928 [pdf, ps, other]

The representer theorem for Hilbert spaces: a necessary and sufficient condition

Authors: Francesco Dinuzzo, Bernhard Schölkopf

Abstract: A family of regularization functionals is said to admit a linear representer theorem if every member of the family admits minimizers that lie in a fixed finite dimensional subspace. A recent characterization states that a general class of regularization functionals with differentiable regularizer admits a linear representer theorem if and only if the regularization term is a non-decreasing functio… ▽ More A family of regularization functionals is said to admit a linear representer theorem if every member of the family admits minimizers that lie in a fixed finite dimensional subspace. A recent characterization states that a general class of regularization functionals with differentiable regularizer admits a linear representer theorem if and only if the regularization term is a non-decreasing function of the norm. In this report, we improve over such result by replacing the differentiability assumption with lower semi-continuity and deriving a proof that is independent of the dimensionality of the space. △ Less

Submitted 17 July, 2012; v1 submitted 9 May, 2012; originally announced May 2012.

arXiv:1203.6502 [pdf, ps, other]

doi 10.1214/13-AOS1145

Quantifying causal influences

Authors: Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, Bernhard Schölkopf

Abstract: Many methods for causal inference generate directed acyclic graphs (DAGs) that formalize causal relations between $n$ variables. Given the joint distribution on all these variables, the DAG contains all information about how intervening on one variable changes the distribution of the other $n-1$ variables. However, quantifying the causal influence of one variable on another one remains a nontrivia… ▽ More Many methods for causal inference generate directed acyclic graphs (DAGs) that formalize causal relations between $n$ variables. Given the joint distribution on all these variables, the DAG contains all information about how intervening on one variable changes the distribution of the other $n-1$ variables. However, quantifying the causal influence of one variable on another one remains a nontrivial question. Here we propose a set of natural, intuitive postulates that a measure of causal strength should satisfy. We then introduce a communication scenario, where edges in a DAG play the role of channels that can be locally corrupted by interventions. Causal strength is then the relative entropy distance between the old and the new distribution. Many other measures of causal strength have been proposed, including average causal effect, transfer entropy, directed information, and information flow. We explain how they fail to satisfy the postulates on simple DAGs of $\leq3$ nodes. Finally, we investigate the behavior of our measure on time-series, supporting our claims with experiments on simulated data. △ Less

Submitted 28 January, 2014; v1 submitted 29 March, 2012; originally announced March 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1145 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1145

Journal ref: Annals of Statistics 2013, Vol. 41, No. 5, 2324-2358

arXiv:0907.5309 [pdf, ps, other]

Hilbert space embeddings and metrics on probability measures

Authors: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R. G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution… ▽ More A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $γ_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS. We present three theoretical properties of $γ_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $γ_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $γ_k$. Third, to understand the nature of the topology induced by $γ_k$, we relate $γ_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $γ_k$ metrizes the weak topology. △ Less

Submitted 29 January, 2010; v1 submitted 30 July, 2009; originally announced July 2009.

Comments: 48 pages

arXiv:0810.4752 [pdf, other]

Statistical Learning Theory: Models, Concepts, and Results

Authors: Ulrike von Luxburg, Bernhard Schoelkopf

Abstract: Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We target at a broad audience, not necessarily machine learning researchers. This paper can serve as a starting point for people who want to get an overview o… ▽ More Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We target at a broad audience, not necessarily machine learning researchers. This paper can serve as a starting point for people who want to get an overview on the field before diving into technical details. △ Less

Submitted 27 October, 2008; originally announced October 2008.

arXiv:0804.3678 [pdf, ps, other]

Causal inference using the algorithmic Markov condition

Authors: Dominik Janzing, Bernhard Schoelkopf

Abstract: Inferring the causal structure that links n observables is usually based upon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace t… ▽ More Inferring the causal structure that links n observables is usually based upon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace the notion of conditional stochastic independence in the causal Markov condition with the vanishing of conditional algorithmic mutual information and describe the corresponding causal inference rules. We explain why a consistent reformulation of causal inference in terms of algorithmic complexity implies a new inference principle that takes into account also the complexity of conditional probability densities, making it possible to select among Markov equivalent causal graphs. This insight provides a theoretical foundation of a heuristic principle proposed in earlier work. We also discuss how to replace Kolmogorov complexity with decidable complexity criteria. This can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical independence with practical independence tests that are based on implicit or explicit assumptions on the underlying distribution. △ Less

Submitted 23 April, 2008; originally announced April 2008.

Comments: 16 figures

MSC Class: 62A01

arXiv:math/0701907 [pdf, ps, other]

doi 10.1214/009053607000000677

Kernel methods in machine learning

Authors: Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola

Abstract: We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowin… ▽ More We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data. △ Less

Submitted 1 July, 2008; v1 submitted 30 January, 2007; originally announced January 2007.

Comments: Published in at http://dx.doi.org/10.1214/009053607000000677 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0290 MSC Class: 30C40 (Primary) 68T05 (Secondary)

Journal ref: Annals of Statistics 2008, Vol. 36, No. 3, 1171-1220

arXiv:math/0612820 [pdf, ps, other]

doi 10.1214/088342306000000484

Comment on "Support Vector Machines with Applications"

Authors: Olivier Bousquet, Bernhard Schölkopf

Abstract: Comment on ``Support Vector Machines with Applications'' [math.ST/0612817] Comment on ``Support Vector Machines with Applications'' [math.ST/0612817] △ Less

Submitted 28 December, 2006; originally announced December 2006.

Comments: Published at http://dx.doi.org/10.1214/088342306000000484 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS153D

Journal ref: Statistical Science 2006, Vol. 21, No. 3, 337-340

Showing 1–48 of 48 results for author: Schölkopf, B