Search | arXiv e-print repository

arXiv:2405.20879 [pdf, other]

Flow matching achieves minimax optimal convergence

Authors: Kenji Fukumizu, Taiji Suzuki, Noboru Isobe, Kazusato Oko, Masanori Koyama

Abstract: Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of F… ▽ More Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of FM in terms of the $p$-Wasserstein distance, a measure of distributional discrepancy. We establish that FM can achieve the minmax optimal convergence rate for $1 \leq p \leq 2$, presenting the first theoretical evidence that FM can reach convergence rates comparable to those of diffusion models. Our analysis extends existing frameworks by examining a broader class of mean and variance functions for the vector fields and identifies specific conditions necessary to attain these optimal rates. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2403.11520 [pdf, other]

State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards

Authors: Yuto Tanimoto, Kenji Fukumizu

Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorith… ▽ More While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.10859 [pdf, other]

Neural-Kernel Conditional Mean Embeddings

Authors: Eiki Shimizu, Kenji Fukumizu, Dino Sejdinovic

Abstract: Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimizati… ▽ More Kernel conditional mean embeddings (CMEs) offer a powerful framework for representing conditional distribution, but they often face scalability and expressiveness challenges. In this work, we propose a new method that effectively combines the strengths of deep learning with CMEs in order to address these challenges. Specifically, our approach leverages the end-to-end neural network (NN) optimization framework using a kernel-based objective. This design circumvents the computationally expensive Gram matrix inversion required by current CME methods. To further enhance performance, we provide efficient strategies to optimize the remaining kernel hyperparameters. In conditional density estimation tasks, our NN-CME hybrid achieves competitive performance and often surpasses existing deep learning-based methods. Lastly, we showcase its remarkable versatility by seamlessly integrating it into reinforcement learning (RL) contexts. Building on Q-learning, our approach naturally leads to a new variant of distributional RL methods, which demonstrates consistent effectiveness across different environments. △ Less

Submitted 16 March, 2024; originally announced March 2024.

arXiv:2402.18839 [pdf, other]

Extended Flow Matching: a Method of Conditional Generation with Generalized Continuity Equation

Authors: Noboru Isobe, Masanori Koyama, **zhe Zhang, Kohei Hayashi, Kenji Fukumizu

Abstract: The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated flow-based models. However, many flow-based models in use today are not built to allow one to introduce an explicit inductive bias to how the conditional distribution to be generated changes with respect to conditions. This can… ▽ More The task of conditional generation is one of the most important applications of generative models, and numerous methods have been developed to date based on the celebrated flow-based models. However, many flow-based models in use today are not built to allow one to introduce an explicit inductive bias to how the conditional distribution to be generated changes with respect to conditions. This can result in unexpected behavior in the task of style transfer, for example. In this research, we introduce extended flow matching (EFM), a direct extension of flow matching that learns a "matrix field" corresponding to the continuous map from the space of conditions to the space of distributions. We show that we can introduce inductive bias to the conditional generation through the matrix field and demonstrate this fact with MMOT-EFM, a version of EFM that aims to minimize the Dirichlet energy or the sensitivity of the distribution with respect to conditions. We will present our theory along with experimental results that support the competitiveness of EFM in conditional generation. △ Less

Submitted 5 July, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 27 pages, 10 figures, We have corrected an error in our experiment on COT-FM

MSC Class: 68T07 (Primary); 49Q22 (Secondary)

arXiv:2402.04516 [pdf, other]

Generalized Sobolev Transport for Probability Measures on a Graph

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: We study the optimal transport (OT) problem for measures supported on a graph metric space. Recently, Le et al. (2022) leverage the graph structure and propose a variant of OT, namely Sobolev transport (ST), which yields a closed-form expression for a fast computation. However, ST is essentially coupled with the $L^p$ geometric structure within its definition which makes it nontrivial to utilize S… ▽ More We study the optimal transport (OT) problem for measures supported on a graph metric space. Recently, Le et al. (2022) leverage the graph structure and propose a variant of OT, namely Sobolev transport (ST), which yields a closed-form expression for a fast computation. However, ST is essentially coupled with the $L^p$ geometric structure within its definition which makes it nontrivial to utilize ST for other prior structures. In contrast, the classic OT has the flexibility to adapt to various geometric structures by modifying the underlying cost function. An important instance is the Orlicz-Wasserstein (OW) which moves beyond the $L^p$ structure by leveraging the \emph{Orlicz geometric structure}. Comparing to the usage of standard $p$-order Wasserstein, OW remarkably helps to advance certain machine learning approaches. Nevertheless, OW brings up a new challenge on its computation due to its two-level optimization formulation. In this work, we leverage a specific class of convex functions for Orlicz structure to propose the generalized Sobolev transport (GST). GST encompasses the ST as its special case, and can be utilized for prior structures beyond the $L^p$ geometry. In connection with the OW, we show that one only needs to simply solve a univariate optimization problem to compute the GST, unlike the complex two-level optimization problem in OW. We empirically illustrate that GST is several-order faster than the OW. Moreover, we provide preliminary evidences on the advantages of GST for document classification and for several tasks in topological data analysis. △ Less

Submitted 29 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: To appear at ICML'2024

arXiv:2310.13653 [pdf, other]

Optimal Transport for Measures with Noisy Tree Metric

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mit… ▽ More We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis. △ Less

Submitted 29 February, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: To appear in AISTATS 2024

arXiv:2307.11972 [pdf, other]

Out-of-Distribution Optimality of Invariant Risk Minimization

Authors: Shoji Toyota, Kenji Fukumizu

Abstract: Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as… ▽ More Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as a minimization problem of the o.o.d. risk. Invariant Risk Minimization (IRM) is considered to be a promising approach to minimize the o.o.d. risk: IRM estimates a minimum of the o.o.d. risk by solving a bi-level optimization problem. While IRM has attracted considerable attention with empirical success, it comes with few theoretical guarantees. Especially, a solid theoretical guarantee that the bi-level optimization problem gives the minimum of the o.o.d. risk has not yet been established. Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-level optimization problem minimizes the o.o.d. risk under certain conditions. The result also provides sufficient conditions on distributions providing training data and on a dimension of feature space for the bi-leveled optimization problem to minimize the o.o.d. risk. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: 23 pages, submitted for a publication

arXiv:2305.18484 [pdf, other]

Neural Fourier Transform: A General Approach to Equivariant Representation Learning

Authors: Masanori Koyama, Kenji Fukumizu, Kohei Hayashi, Takeru Miyato

Abstract: Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the g… ▽ More Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group. △ Less

Submitted 14 February, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2304.12770 [pdf, other]

Controlling Posterior Collapse by an Inverse Lipschitz Constraint on the Decoder Network

Authors: Yuri Kinoshita, Kenta Oono, Kenji Fukumizu, Yuichi Yoshida, Shin-ichi Maeda

Abstract: Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an invers… ▽ More Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an inverse Lipschitz neural network into the decoder and, based on this architecture, provide a new method that can control in a simple and clear manner the degree of posterior collapse for a wide range of VAE models equipped with a concrete theoretical guarantee. We also illustrate the effectiveness of our method through several numerical experiments. △ Less

Submitted 2 February, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: accepted to ICML 2023, some notations adjusted from the submitted version

arXiv:2302.12498 [pdf, other]

Scalable Unbalanced Sobolev Transport for Measures on a Graph

Authors: Tam Le, Truyen Nguyen, Kenji Fukumizu

Abstract: Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport fo… ▽ More Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)--(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: to appear in AISTATS 2023. arXiv admin note: text overlap with arXiv:2101.09756

arXiv:2210.09745 [pdf, other]

Transfer learning with affine model transformation

Authors: Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, Ryo Yoshida

Abstract: Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally a… ▽ More Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally and intuitively plausible methods have achieved great success in a wide range of real-world applications, the lack of a theoretical basis hinders further methodological development. This paper presents a general class of transfer learning regression called affine model transfer, following the principle of expected-square loss minimization. It is shown that the affine model transfer broadly encompasses various existing methods, including the most common procedure based on neural feature extractors. Furthermore, the current paper clarifies theoretical properties of the affine model transfer such as generalization error and excess risk. Through several case studies, we demonstrate the practical benefits of modeling and estimating inter-domain commonality and domain-specific factors separately with the affine-type transfer models. △ Less

Submitted 19 January, 2024; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: 34 pages

Journal ref: NeurIPS 2023

arXiv:2210.07413 [pdf, other]

Invariance-adapted decomposition and Lasso-type contrastive learning

Authors: Masanori Koyama, Takeru Miyato, Kenji Fukumizu

Abstract: Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of… ▽ More Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of \citet{content_isolate} has shown that contrastive learning is capable of decomposing the data space into the space that is invariant to all augmentations and its complement. In this paper, we introduce the notion of invariance-adapted latent space that decomposes the data space into the intersections of the invariant spaces of each augmentation and their complements. This decomposition generalizes the one introduced in \citet{content_isolate}, and describes a structure that is analogous to the frequencies in the harmonic analysis of a group. We experimentally show that contrastive learning with lasso-type metric can be used to find an invariance-adapted latent space, thereby suggesting a new potential for the contrastive learning. We also investigate when such a latent space can be identified up to mixings within each component. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Journal ref: 2022 ICML workshop of Topology, Algebra and Geometry in Machine Learning (spotlight)

arXiv:2210.05972 [pdf, other]

Unsupervised Learning of Equivariant Structure from Sequences

Authors: Takeru Miyato, Masanori Koyama, Kenji Fukumizu

Abstract: In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict t… ▽ More In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict the future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product by applying simultaneous block-diagonalization to the transition operators in the latent space, the procedure which is commonly used in representation theory to decompose the feature-space based on the type of response to group actions. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that finding a simple structured relation and learning a model with extrapolation capability are two sides of the same coin. The code is available at https://github.com/takerum/meta_sequential_prediction. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2206.01795 [pdf, other]

Robust Topological Inference in the Presence of Outliers

Authors: Siddharth Vishwanath, Bharath K. Sriperumbudur, Kenji Fukumizu, Satoshi Kuriki

Abstract: The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w… ▽ More The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a $\textit{median-of-means}$ variant of the distance function ($\textsf{MoM Dist}$), and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by $\textsf{MoM Dist}$ are both consistent estimators of the true underlying population counterpart, and their rates of convergence in the bottleneck metric are controlled by the fraction of outliers in the data. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 50 pages, 10 figures

MSC Class: 62R40; 55N31; 68T09

arXiv:2204.12194 [pdf, other]

Procedure to Reveal the Mechanism of Pattern Formation Process by Topological Data Analysis

Authors: Yoh-ichi Mototake, Masaichiro Mizumaki, Kazue Kudo, Kenji Fukumizu

Abstract: Topological data analysis (TDA) is a versatile tool that can be used to extract scientific knowledge from complex pattern formation processes. However, the physics correspondence between the features obtained from TDA and pattern dynamics does not agree one-to-one, and the physical interpretation of the TDA features needs to be set appropriately according to the phenomenon to be analyzed. In this… ▽ More Topological data analysis (TDA) is a versatile tool that can be used to extract scientific knowledge from complex pattern formation processes. However, the physics correspondence between the features obtained from TDA and pattern dynamics does not agree one-to-one, and the physical interpretation of the TDA features needs to be set appropriately according to the phenomenon to be analyzed. In this study, we propose an analytical procedure to physically interpret pattern dynamics through TDA and machine learning techniques. The proposed procedure was applied to the process of magnetic domain pattern formation to quantify non-trivial domain pattern classifications and reveal the nature of the underlying dynamics. On the basis of these findings, we also propose a candidate reduction model to understand the nature of magnetic domain formation. △ Less

Submitted 8 July, 2024; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: 54 pages, 19 figures

arXiv:2203.15549 [pdf, other]

Invariance Learning based on Label Hierarchy

Authors: Shoji Toyota, Kenji Fukumizu

Abstract: Deep Neural Networks inherit spurious correlations embedded in training data and hence may fail to predict desired labels on unseen domains (or environments), which have different distributions from the domain used in training. Invariance Learning (IL) has been developed recently to overcome this shortcoming; using training data in many domains, IL estimates such a predictor that is invariant to a… ▽ More Deep Neural Networks inherit spurious correlations embedded in training data and hence may fail to predict desired labels on unseen domains (or environments), which have different distributions from the domain used in training. Invariance Learning (IL) has been developed recently to overcome this shortcoming; using training data in many domains, IL estimates such a predictor that is invariant to a change of domain. However, the requirement of training data in multiple domains is a strong restriction of IL, since it often needs high annotation cost. We propose a novel IL framework to overcome this problem. Assuming the availability of data from multiple domains for a higher level of classification task, for which the labeling cost is low, we estimate an invariant predictor for the target classification task with training data in a single domain. Additionally, we propose two cross-validation methods for selecting hyperparameters of invariance regularization to solve the issue of hyperparameter selection, which has not been handled properly in existing IL methods. The effectiveness of the proposed framework, including the cross-validation, is demonstrated empirically, and the correctness of the hyperparameter selection is proved under some conditions. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 30 pages, submitted for a publication

arXiv:2202.10281 [pdf, other]

doi 10.1109/ACCESS.2022.3169594

ALGAN: Anomaly Detection by Generating Pseudo Anomalous Data via Latent Variables

Authors: Hironori Murase, Kenji Fukumizu

Abstract: In many anomaly detection tasks, where anomalous data rarely appear and are difficult to collect, training using only normal data is important. Although it is possible to manually create anomalous data using prior knowledge, they may be subject to user bias. In this paper, we propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anom… ▽ More In many anomaly detection tasks, where anomalous data rarely appear and are difficult to collect, training using only normal data is important. Although it is possible to manually create anomalous data using prior knowledge, they may be subject to user bias. In this paper, we propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anomalous data as well as fake-normal data, whereas the discriminator is trained to distinguish between normal and pseudo-anomalous data. This differs from the standard GAN discriminator, which specializes in classifying two similar classes. The training dataset contains only normal data; the latent variables are introduced in anomalous states and are input into the generator to produce diverse pseudo-anomalous data. We compared the performance of ALGAN with other existing methods on the MVTec-AD, Magnetic Tile Defects, and COIL-100 datasets. The experimental results showed that ALGAN exhibited an AUROC comparable to those of state-of-the-art methods while achieving a much faster prediction time. △ Less

Submitted 9 May, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: 13 pages, 8 figures

Journal ref: IEEE Access, vol. 10, pp. 44259-44270, 2022

arXiv:2110.05225 [pdf, other]

$β$-Intact-VAE: Identifying and Estimating Causal Effects under Limited Overlap

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: As an important problem in causal inference, we discuss the identification and estimation of treatment effects (TEs) under limited overlap; that is, when subjects with certain features belong to a single treatment group. We use a latent variable to model a prognostic score which is widely used in biostatistics and sufficient for TEs; i.e., we build a generative prognostic model. We prove that the… ▽ More As an important problem in causal inference, we discuss the identification and estimation of treatment effects (TEs) under limited overlap; that is, when subjects with certain features belong to a single treatment group. We use a latent variable to model a prognostic score which is widely used in biostatistics and sufficient for TEs; i.e., we build a generative prognostic model. We prove that the latent variable recovers a prognostic score, and the model identifies individualized treatment effects. The model is then learned as β-Intact-VAE--a new type of variational autoencoder (VAE). We derive the TE error bounds that enable representations balanced for treatment groups conditioned on individualized features. The proposed method is compared with recent methods using (semi-)synthetic datasets. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: Updated version of the NeurIPS 2021 submission (https://openreview.net/forum?id=Z3yd722b5X5). Largely improve readability and the presentation of experimental results. arXiv admin note: text overlap with arXiv:2109.15062, arXiv:2101.06662

arXiv:2109.15062 [pdf, other]

Towards Principled Causal Effect Estimation by Deep Identifiable Models

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: As an important problem in causal inference, we discuss the estimation of treatment effects (TEs). Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying TEs. Our VAE also naturally gives representations balanced for treatment groups, using its prior. Experiments o… ▽ More As an important problem in causal inference, we discuss the estimation of treatment effects (TEs). Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying TEs. Our VAE also naturally gives representations balanced for treatment groups, using its prior. Experiments on (semi-)synthetic datasets show state-of-the-art performance under diverse settings, including unobserved confounding. Based on the identifiability of our model, we prove identification of TEs under unconfoundedness, and also discuss (possible) extensions to harder settings. △ Less

Submitted 1 November, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: Fully updated. Largely improve clarity, add identification under unconfoundedness (Sec. 4.2), and more. arXiv admin note: substantial text overlap with arXiv:2101.06662

arXiv:2108.11018 [pdf, other]

A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective?

Authors: Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, Kohei Hayashi

Abstract: Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks. The most significant advantage of using synthetic images is that the ground-truth labels are automatically available, enabling unlimited expansion of the data size without human cost. However, synthetic data may have a huge doma… ▽ More Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks. The most significant advantage of using synthetic images is that the ground-truth labels are automatically available, enabling unlimited expansion of the data size without human cost. However, synthetic data may have a huge domain gap, in which case increasing the data size does not improve the performance. How can we know that? In this study, we derive a simple scaling law that predicts the performance from the amount of pre-training data. By estimating the parameters of the law, we can judge whether we should increase the data or change the setting of image synthesis. Further, we analyze the theory of transfer learning by considering learning dynamics and confirm that the derived generalization bound is consistent with our empirical findings. We empirically validated our scaling law on various experimental settings of benchmark tasks, model sizes, and complexities of synthetic images. △ Less

Submitted 8 October, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

arXiv:2101.06662 [pdf, other]

Intact-VAE: Estimating Treatment Effects under Unobserved Confounding

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: NOTE: This preprint has a flawed theoretical formulation. Please avoid it and refer to the ICLR22 publication https://openreview.net/forum?id=q7n2RngwOM. Also, arXiv:2109.15062 contains some new ideas on unobserved Confounding. As an important problem of causal inference, we discuss the identification and estimation of treatment effects under unobserved confounding. Representing the confounder a… ▽ More NOTE: This preprint has a flawed theoretical formulation. Please avoid it and refer to the ICLR22 publication https://openreview.net/forum?id=q7n2RngwOM. Also, arXiv:2109.15062 contains some new ideas on unobserved Confounding. As an important problem of causal inference, we discuss the identification and estimation of treatment effects under unobserved confounding. Representing the confounder as a latent variable, we propose Intact-VAE, a new variant of variational autoencoder (VAE), motivated by the prognostic score that is sufficient for identifying treatment effects. We theoretically show that, under certain settings, treatment effects are identified by our model, and further, based on the identifiability of our model (i.e., determinacy of representation), our VAE is a consistent estimator with representation balanced for treatment groups. Experiments on (semi-)synthetic datasets show state-of-the-art performance under diverse settings. △ Less

Submitted 20 April, 2022; v1 submitted 17 January, 2021; originally announced January 2021.

Comments: This preprint has a flawed theoretical formulation. It was intended as a theoretical update of https://openreview.net/forum?id=D3TNqCspFpM

arXiv:2011.02315 [pdf, other]

Kernel Mean Embedding of Probability Measures and its Applications to Functional Data Analysis

Authors: Saeed Hayati, Kenji Fukumizu, Afshin Parvardeh

Abstract: This study intends to introduce kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces induced by functional response statistical models. The embedded function represents the concentration of probability measures in small open neighborhoods, which identifies a pseudo-likelihood and fosters a rich framework for statistical inference. Utilizing Maximum Mean… ▽ More This study intends to introduce kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces induced by functional response statistical models. The embedded function represents the concentration of probability measures in small open neighborhoods, which identifies a pseudo-likelihood and fosters a rich framework for statistical inference. Utilizing Maximum Mean Discrepancy, we devise new tests in functional response models. The performance of new derived tests is evaluated against competitors in three major problems in functional data analysis including function-on-scalar regression, functional one-way ANOVA, and equality of covariance operators. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: 37 Pages, 2 figures, Submitted to Electronic Journal of Statistic

MSC Class: 62R10 (Primary) 46N30 (Secondary)

arXiv:2011.02256 [pdf, other]

Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces

Authors: Masaaki Imaizumi, Kenji Fukumizu

Abstract: We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fil… ▽ More We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure. △ Less

Submitted 8 February, 2022; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Complete version of arXiv:1802.04474

arXiv:2007.02809 [pdf, other]

Meta Learning for Causal Direction

Authors: Jean-Francois Ton, Dino Sejdinovic, Kenji Fukumizu

Abstract: The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that… ▽ More The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that allows distinguishing cause and effect in the small data setting. Using a learnt task variable that contains distributional information of each dataset, we propose an end-to-end algorithm that makes use of similar training datasets at test time. We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes. △ Less

Submitted 21 February, 2021; v1 submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.13228 [pdf, other]

A General Class of Transfer Learning Regression without Implementation Cost

Authors: Shunya Minami, Song Liu, Stephen Wu, Kenji Fukumizu, Ryo Yoshida

Abstract: We propose a novel framework that unifies and extends existing methods of transfer learning (TL) for regression. To bridge a pretrained source model to the model on a target task, we introduce a density-ratio reweighting function, which is estimated through the Bayesian framework with a specific prior distribution. By changing two intrinsic hyperparameters and the choice of the density-ratio model… ▽ More We propose a novel framework that unifies and extends existing methods of transfer learning (TL) for regression. To bridge a pretrained source model to the model on a target task, we introduce a density-ratio reweighting function, which is estimated through the Bayesian framework with a specific prior distribution. By changing two intrinsic hyperparameters and the choice of the density-ratio model, the proposed method can integrate three popular methods of TL: TL based on cross-domain similarity regularization, a probabilistic TL using the density-ratio estimation, and fine-tuning of pretrained neural networks. Moreover, the proposed method can benefit from its simple implementation without any additional cost; the regression model can be fully trained using off-the-shelf libraries for supervised learning in which the original output variable is simply transformed to a new output variable. We demonstrate its simplicity, generality, and applicability using various real data applications. △ Less

Submitted 16 December, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

Comments: 31 pages, 6 figures

arXiv:2006.10012 [pdf, other]

Robust Persistence Diagrams using Reproducing Kernels

Authors: Siddharth Vishwanath, Kenji Fukumizu, Satoshi Kuriki, Bharath Sriperumbudur

Abstract: Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtra… ▽ More Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel. This, in turn, allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets. △ Less

Submitted 3 June, 2022; v1 submitted 17 June, 2020; originally announced June 2020.

MSC Class: 55N31; 62R40; 62G07; 46E22

arXiv:2004.01822 [pdf, other]

The equivalence between Stein variational gradient descent and black-box variational inference

Authors: Casey Chu, Kentaro Minami, Kenji Fukumizu

Abstract: We formalize an equivalence between two popular methods for Bayesian inference: Stein variational gradient descent (SVGD) and black-box variational inference (BBVI). In particular, we show that BBVI corresponds precisely to SVGD when the kernel is the neural tangent kernel. Furthermore, we interpret SVGD and BBVI as kernel gradient flows; we do this by leveraging the recent perspective that views… ▽ More We formalize an equivalence between two popular methods for Bayesian inference: Stein variational gradient descent (SVGD) and black-box variational inference (BBVI). In particular, we show that BBVI corresponds precisely to SVGD when the kernel is the neural tangent kernel. Furthermore, we interpret SVGD and BBVI as kernel gradient flows; we do this by leveraging the recent perspective that views SVGD as a gradient flow in the space of probability distributions and showing that BBVI naturally motivates a Riemannian structure on that space. We observe that kernel gradient flow also describes dynamics found in the training of generative adversarial networks (GANs). This work thereby unifies several existing techniques in variational inference and generative modeling and identifies the kernel as a fundamental object governing the behavior of these algorithms, motivating deeper analysis of its properties. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: ICLR 2020, Workshop on Integration of Deep Neural Models and Differential Equations

arXiv:2002.04185 [pdf, other]

Smoothness and Stability in GANs

Authors: Casey Chu, Kentaro Minami, Kenji Fukumizu

Abstract: Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that… ▽ More Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator's architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: ICLR 2020

arXiv:2001.01894 [pdf]

Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method

Authors: Pengzhou Wu, Kenji Fukumizu

Abstract: We address the problem of distinguishing cause from effect in bivariate setting. Based on recent developments in nonlinear independent component analysis (ICA), we train nonparametrically general nonlinear causal models that allow non-additive noise. Further, we build an ensemble framework, namely Causal Mosaic, which models a causal pair by a mixture of nonlinear models. We compare this method wi… ▽ More We address the problem of distinguishing cause from effect in bivariate setting. Based on recent developments in nonlinear independent component analysis (ICA), we train nonparametrically general nonlinear causal models that allow non-additive noise. Further, we build an ensemble framework, namely Causal Mosaic, which models a causal pair by a mixture of nonlinear models. We compare this method with other recent methods on artificial and real world benchmark datasets, and our method shows state-of-the-art performance. △ Less

Submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted to AISTATS 2020. Camera-ready version in preparation

Journal ref: An updated version at AISTATS 2020: http://proceedings.mlr.press/v108/wu20b/wu20b.pdf. Main changes: a correction in Theorem 3 and additional explanations in Sec. 4

arXiv:2001.00220 [pdf, other]

On the Limits of Topological Data Analysis for Statistical Inference

Authors: Siddharth Vishwanath, Kenji Fukumizu, Satoshi Kuriki, Bharath Sriperumbudur

Abstract: Topological data analysis has emerged as a powerful tool for extracting the metric, geometric and topological features underlying the data as a multi-resolution summary statistic, and has found applications in several areas where data arises from complex sources. In this paper, we examine the use of topological summary statistics through the lens of statistical inference. We investigate necessary… ▽ More Topological data analysis has emerged as a powerful tool for extracting the metric, geometric and topological features underlying the data as a multi-resolution summary statistic, and has found applications in several areas where data arises from complex sources. In this paper, we examine the use of topological summary statistics through the lens of statistical inference. We investigate necessary and sufficient conditions under which \textit{valid statistical inference} is possible using {topological summary statistics}. Additionally, we provide examples of models that demonstrate invariance with respect to topological summaries. △ Less

Submitted 15 February, 2024; v1 submitted 1 January, 2020; originally announced January 2020.

Comments: 36 pages, 9 figures

MSC Class: 62F30; 55N31; 62R40

arXiv:1910.09972 [pdf, other]

doi 10.1007/978-3-030-58520-4_37

Exchangeable deep neural networks for set-to-set matching and learning

Authors: Yuki Saito, Takuma Nakamura, Hirotaka Hachiya, Kenji Fukumizu

Abstract: Matching two different sets of items, called heterogeneous set-to-set matching problem, has recently received attention as a promising problem. The difficulties are to extract features to match a correct pair of different sets and also preserve two types of exchangeability required for set-to-set matching: the pair of sets, as well as the items in each set, should be exchangeable. In this study, w… ▽ More Matching two different sets of items, called heterogeneous set-to-set matching problem, has recently received attention as a promising problem. The difficulties are to extract features to match a correct pair of different sets and also preserve two types of exchangeability required for set-to-set matching: the pair of sets, as well as the items in each set, should be exchangeable. In this study, we propose a novel deep learning architecture to address the abovementioned difficulties and also an efficient training framework for set-to-set matching. We evaluate the methods through experiments based on two industrial applications: fashion set recommendation and group re-identification. In these experiments, we show that the proposed method provides significant improvements and results compared with the state-of-the-art methods, thereby validating our architecture for the heterogeneous set matching problem. △ Less

Submitted 28 January, 2021; v1 submitted 22 October, 2019; originally announced October 2019.

arXiv:1908.09112 [pdf, other]

Disjunct Support Spike and Slab Priors for Variable Selection in Regression under Quasi-sparseness

Authors: Daniel Andrade, Kenji Fukumizu

Abstract: Sparseness of the regression coefficient vector is often a desirable property, since, among other benefits, sparseness improves interpretability. In practice, many true regression coefficients might be negligibly small, but non-zero, which we refer to as quasi-sparseness. Spike-and-slab priors as introduced in (Chipman et al., 2001) can be tuned to ignore very small regression coefficients, and, a… ▽ More Sparseness of the regression coefficient vector is often a desirable property, since, among other benefits, sparseness improves interpretability. In practice, many true regression coefficients might be negligibly small, but non-zero, which we refer to as quasi-sparseness. Spike-and-slab priors as introduced in (Chipman et al., 2001) can be tuned to ignore very small regression coefficients, and, as a consequence provide a trade-off between prediction accuracy and interpretability. However, spike-and-slab priors with full support lead to inconsistent Bayes factors, in the sense that the Bayes factors of any two models are bounded in probability. This is clearly an undesirable property for Bayesian hypotheses testing, where we wish that increasing sample sizes lead to increasing Bayes factors favoring the true model. The moment matching priors as in (Johnson and Rossell, 2012) can resolve this issue, but are unsuitable for the quasi-sparse setting due to their full support outside the exact value 0. As a remedy, we suggest disjunct support spike and slab priors, for which we prove consistent Bayes factors in the quasi-sparse setting, and show experimentally fast growing Bayes factors favoring the true model. Several experiments on simulated and real data confirm the usefulness of our proposed method to identify models with high effect size, while leading to better control over false positives than hard-thresholding. △ Less

Submitted 29 September, 2019; v1 submitted 24 August, 2019; originally announced August 2019.

arXiv:1907.00586 [pdf, other]

doi 10.1093/jrsssb/qkad050

A Kernel Stein Test for Comparing Latent Variable Models

Authors: Heishiro Kanagawa, Wittawat Jitkrittum, Lester Mackey, Kenji Fukumizu, Arthur Gretton

Abstract: We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) t… ▽ More We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure. △ Less

Submitted 9 May, 2023; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: This is a pre-copyedited, author-produced version of an article accepted for publication in The Journal of the Royal Statistical Society Series: B following peer review

arXiv:1906.04868 [pdf, other]

Semi-flat minima and saddle points by embedding neural networks to overparameterization

Authors: Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, Mirai Tanaka

Abstract: We theoretically study the landscape of the training error for neural networks in overparameterized cases. We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. Our results show that the networks with smooth and ReLU activation have different part… ▽ More We theoretically study the landscape of the training error for neural networks in overparameterized cases. We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. Our results show that the networks with smooth and ReLU activation have different partially flat landscapes around the embedded point. We also relate these results to a difference of their generalization abilities in overparameterized realization. △ Less

Submitted 14 June, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

Comments: 38 pages, 4 figures

arXiv:1903.01680 [pdf, other]

Convex Covariate Clustering for Classification

Authors: Daniel Andrade, Kenji Fukumizu, Yuzuru Okajima

Abstract: Clustering, like covariate selection for classification, is an important step to compress and interpret the data. However, clustering of covariates is often performed independently of the classification step, which can lead to undesirable clustering results that harm interpretability and compression rate. Therefore, we propose a method that can cluster covariates while taking into account class la… ▽ More Clustering, like covariate selection for classification, is an important step to compress and interpret the data. However, clustering of covariates is often performed independently of the classification step, which can lead to undesirable clustering results that harm interpretability and compression rate. Therefore, we propose a method that can cluster covariates while taking into account class label information of samples. We formulate the problem as a convex optimization problem which uses both, a-priori similarity information between covariates, and information from class-labeled samples. Like ordinary convex clustering [Chi and Lange, 2015], the proposed method offers a unique global minima making it insensitive to initialization. In order to solve the convex problem, we propose a specialized alternating direction method of multipliers (ADMM), which scales up to several thousands of variables. Furthermore, in order to circumvent computationally expensive cross-validation, we propose a model selection criterion based on approximating the marginal likelihood. Experiments on synthetic and real data confirm the usefulness of the proposed clustering method and the selection criterion. △ Less

Submitted 6 April, 2020; v1 submitted 5 March, 2019; originally announced March 2019.

Comments: Under consideration at Pattern Recognition Letters

arXiv:1902.00342 [pdf, other]

Tree-Sliced Variants of Wasserstein Distances

Authors: Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi

Abstract: Optimal transport (\OT) theory defines a powerful set of tools to compare probability distributions. \OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \textit{sliced} formulation, which exploits the closed-form formula between univariate distri… ▽ More Optimal transport (\OT) theory defines a powerful set of tools to compare probability distributions. \OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \textit{sliced} formulation, which exploits the closed-form formula between univariate distributions by projecting high-dimensional measures onto random lines. We consider in this work a more general family of ground metrics, namely \textit{tree metrics}, which also yield fast closed-form computations and negative definite, and of which the sliced-Wasserstein distance is a particular case (the tree is a chain). We propose the tree-sliced Wasserstein distance, computed by averaging the Wasserstein distance between these measures using random tree metrics, built adaptively in either low or high-dimensional spaces. Exploiting the negative definiteness of that distance, we also propose a positive definite kernel, and test it against other baselines on a few benchmark tasks. △ Less

Submitted 28 October, 2019; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: Camera-ready for NeurIPS 2019

arXiv:1809.00800 [pdf, other]

doi 10.18653/v1/D18-1203

Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions

Authors: Sho Yokoi, Sosuke Kobayashi, Kenji Fukumizu, Jun Suzuki, Kentaro Inui

Abstract: In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert--Schmidt independence criterion (HSIC); thus, we call the new measure the point… ▽ More In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert--Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be interpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sentence embeddings) to be plugged in as kernels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN-based PMI while outperforming PMI in accuracy. In addition, we also demonstrate that PHSIC is beneficial as a criterion of a data selection task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: Accepted by EMNLP 2018

Journal ref: EMNLP 2018

arXiv:1806.05924 [pdf, other]

Robust Bayesian Model Selection for Variable Clustering with the Gaussian Graphical Model

Authors: Daniel Andrade, Akiko Takeda, Kenji Fukumizu

Abstract: Variable clustering is important for explanatory analysis. However, only few dedicated methods for variable clustering with the Gaussian graphical model have been proposed. Even more severe, small insignificant partial correlations due to noise can dramatically change the clustering result when evaluating for example with the Bayesian Information Criteria (BIC). In this work, we try to address thi… ▽ More Variable clustering is important for explanatory analysis. However, only few dedicated methods for variable clustering with the Gaussian graphical model have been proposed. Even more severe, small insignificant partial correlations due to noise can dramatically change the clustering result when evaluating for example with the Bayesian Information Criteria (BIC). In this work, we try to address this issue by proposing a Bayesian model that accounts for negligible small, but not necessarily zero, partial correlations. Based on our model, we propose to evaluate a variable clustering result using the marginal likelihood. To address the intractable calculation of the marginal likelihood, we propose two solutions: one based on a variational approximation, and another based on MCMC. Experiments on simulated data shows that the proposed method is similarly accurate as BIC in the no noise setting, but considerably more accurate when there are noisy partial correlations. Furthermore, on real data the proposed method provides clustering results that are intuitively sensible, which is not always the case when using BIC or its extensions. △ Less

Submitted 15 June, 2018; originally announced June 2018.

arXiv:1805.08463 [pdf, other]

Variational Learning on Aggregate Outputs with Gaussian Processes

Authors: Ho Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim CD Lucas, Seth Flaxman, Katherine Battle, Kenji Fukumizu

Abstract: While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global map** of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on varia… ▽ More While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global map** of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations. △ Less

Submitted 22 May, 2018; originally announced May 2018.

arXiv:1802.08404 [pdf, other]

Kernel Recursive ABC: Point Estimation with Intractable Likelihood

Authors: Takafumi Kajihara, Motonobu Kanagawa, Keisuke Yamazaki, Kenji Fukumizu

Abstract: We propose a novel approach to parameter estimation for simulator-based statistical models with intractable likelihood. Our proposed method involves recursive application of kernel ABC and kernel herding to the same observed data. We provide a theoretical explanation regarding why the approach works, showing (for the population setting) that, under a certain assumption, point estimates obtained wi… ▽ More We propose a novel approach to parameter estimation for simulator-based statistical models with intractable likelihood. Our proposed method involves recursive application of kernel ABC and kernel herding to the same observed data. We provide a theoretical explanation regarding why the approach works, showing (for the population setting) that, under a certain assumption, point estimates obtained with this method converge to the true parameter, as recursion proceeds. We have conducted a variety of numerical experiments, including parameter estimation for a real-world pedestrian flow simulator, and show that in most cases our method outperforms existing approaches. △ Less

Submitted 12 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

Comments: to appear in ICML 2018. 18 pages

arXiv:1802.06226 [pdf, other]

Post Selection Inference with Incomplete Maximum Mean Discrepancy Estimator

Authors: Makoto Yamada, Denny Wu, Yao-Hung Hubert Tsai, Ichiro Takeuchi, Ruslan Salakhutdinov, Kenji Fukumizu

Abstract: Measuring divergence between two distributions is essential in machine learning and statistics and has various applications including binary classification, change point detection, and two-sample test. Furthermore, in the era of big data, designing divergence measure that is interpretable and can handle high-dimensional and complex data becomes extremely important. In the paper, we propose a post… ▽ More Measuring divergence between two distributions is essential in machine learning and statistics and has various applications including binary classification, change point detection, and two-sample test. Furthermore, in the era of big data, designing divergence measure that is interpretable and can handle high-dimensional and complex data becomes extremely important. In the paper, we propose a post selection inference (PSI) framework for divergence measure, which can select a set of statistically significant features that discriminate two distributions. Specifically, we employ an additive variant of maximum mean discrepancy (MMD) for features and introduce a general hypothesis test for PSI. A novel MMD estimator using the incomplete U-statistics, which has an asymptotically Normal distribution (under mild assumptions) and gives high detection power in PSI, is also proposed and analyzed theoretically. Through synthetic and real-world feature selection experiments, we show that the proposed framework can successfully detect statistically significant features. Last, we propose a sample selection framework for analyzing different members in the Generative Adversarial Networks (GANs) family. △ Less

Submitted 17 February, 2018; originally announced February 2018.

arXiv:1802.05411 [pdf, ps, other]

Selecting the Best in GANs Family: a Post Selection Inference Framework

Authors: Yao-Hung Hubert Tsai, Makoto Yamada, Denny Wu, Ruslan Salakhutdinov, Ichiro Takeuchi, Kenji Fukumizu

Abstract: "Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymp… ▽ More "Which Generative Adversarial Networks (GANs) generates the most plausible images?" has been a frequently asked question among researchers. To address this problem, we first propose an \emph{incomplete} U-statistics estimate of maximum mean discrepancy $\mathrm{MMD}_{inc}$ to measure the distribution discrepancy between generated and real images. $\mathrm{MMD}_{inc}$ enjoys the advantages of asymptotic normality, computation efficiency, and model agnosticity. We then propose a GANs analysis framework to select and test the "best" member in GANs family using the Post Selection Inference (PSI) with $\mathrm{MMD}_{inc}$. In the experiments, we adopt the proposed framework on 7 GANs variants and compare their $\mathrm{MMD}_{inc}$ scores. △ Less

Submitted 23 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

arXiv:1802.04474 [pdf, other]

Deep Neural Networks Learn Non-Smooth Functions Effectively

Authors: Masaaki Imaizumi, Kenji Fukumizu

Abstract: We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods att… ▽ More We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain the optimal rate of generalization errors for smooth functions in large sample asymptotics, and thus it has not been straightforward to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive the generalization error of estimators by DNNs with a ReLU activation, and show that convergence rates of the generalization by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results. △ Less

Submitted 7 July, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

Comments: 31 pages

arXiv:1709.00147 [pdf, other]

Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings

Authors: Motonobu Kanagawa, Bharath K. Sriperumbudur, Kenji Fukumizu

Abstract: This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature… ▽ More This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand. △ Less

Submitted 30 October, 2018; v1 submitted 1 September, 2017; originally announced September 2017.

Comments: 36 pages

MSC Class: 65D30 (Primary); 65D32; 65D05; 46E35; 46E22 (Secondary)

arXiv:1706.03472 [pdf, other]

Kernel method for persistence diagrams via kernel embedding and weight factor

Authors: Genki Kusano, Kenji Fukumizu, Yasuaki Hiraoka

Abstract: Topological data analysis is an emerging mathematical concept for characterizing shapes in multi-scale data. In this field, persistence diagrams are widely used as a descriptor of the input data, and can distinguish robust and noisy topological properties. Nowadays, it is highly desired to develop a statistical framework on persistence diagrams to deal with practical data. This paper proposes a ke… ▽ More Topological data analysis is an emerging mathematical concept for characterizing shapes in multi-scale data. In this field, persistence diagrams are widely used as a descriptor of the input data, and can distinguish robust and noisy topological properties. Nowadays, it is highly desired to develop a statistical framework on persistence diagrams to deal with practical data. This paper proposes a kernel method on persistence diagrams. A theoretical contribution of our method is that the proposed kernel allows one to control the effect of persistence, and, if necessary, noisy topological properties can be discounted in data analysis. Furthermore, the method provides a fast approximation technique. The method is applied into several problems including practical data in physics, and the results show the advantage compared to the existing kernel method on persistence diagrams. △ Less

Submitted 12 June, 2017; originally announced June 2017.

Comments: 12 figures, 30 pages

arXiv:1705.07673 [pdf, other]

A Linear-Time Kernel Goodness-of-Fit Test

Authors: Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, Arthur Gretton

Abstract: We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We anal… ▽ More We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model. △ Less

Submitted 24 October, 2017; v1 submitted 22 May, 2017; originally announced May 2017.

Comments: Accepted to NIPS 2017

MSC Class: 46E22; 62G10 ACM Class: G.3; I.2.6

arXiv:1705.04194 [pdf, ps, other]

Influence Function and Robust Variant of Kernel Canonical Correlation Analysis

Authors: Md. Ashad Alam, Kenji Fukumizu, Yu-** Wang

Abstract: Many unsupervised kernel methods rely on the estimation of the kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). Both kernel CO and kernel CCO are sensitive to contaminated data, even when bounded positive definite kernels are used. To the best of our knowledge, there are few well-founded robust kernel methods for statistical unsupervised learning. In additio… ▽ More Many unsupervised kernel methods rely on the estimation of the kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). Both kernel CO and kernel CCO are sensitive to contaminated data, even when bounded positive definite kernels are used. To the best of our knowledge, there are few well-founded robust kernel methods for statistical unsupervised learning. In addition, while the influence function (IF) of an estimator can characterize its robustness, asymptotic properties and standard error, the IF of a standard kernel canonical correlation analysis (standard kernel CCA) has not been derived yet. To fill this gap, we first propose a robust kernel covariance operator (robust kernel CO) and a robust kernel cross-covariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we derive the IF for robust kernel CCO and standard kernel CCA. Using the IF of the standard kernel CCA, we can detect influential observations from two sets of data. Finally, we propose a method based on the robust kernel CO and the robust kernel CCO, called {\bf robust kernel CCA}, which is less sensitive to noise than the standard kernel CCA. The introduced principles can also be applied to many other kernel methods involving kernel CO or kernel CCO. Our experiments on synthesized data and imaging genetics analysis demonstrate that the proposed IF of standard kernel CCA can identify outliers. It is also seen that the proposed robust kernel CCA method performs better for ideal and contaminated data than the standard kernel CCA. △ Less

Submitted 9 May, 2017; originally announced May 2017.

Comments: arXiv admin note: text overlap with arXiv:1602.05563

arXiv:1703.03216 [pdf, other]

Trimmed Density Ratio Estimation

Authors: Song Liu, Akiko Takeda, Taiji Suzuki, Kenji Fukumizu

Abstract: Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation procedure can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a co… ▽ More Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation procedure can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator. △ Less

Submitted 6 November, 2017; v1 submitted 9 March, 2017; originally announced March 2017.

Comments: Made minor revisions. Restructured the introductory sections

arXiv:1701.01582 [pdf, other]

Learning Sparse Structural Changes in High-dimensional Markov Networks: A Review on Methodologies and Theories

Authors: Song Liu, Kenji Fukumizu, Taiji Suzuki

Abstract: Recent years have seen an increasing popularity of learning the sparse \emph{changes} in Markov Networks. Changes in the structure of Markov Networks reflect alternations of interactions between random variables under different regimes and provide insights into the underlying system. While each individual network structure can be complicated and difficult to learn, the overall change from one netw… ▽ More Recent years have seen an increasing popularity of learning the sparse \emph{changes} in Markov Networks. Changes in the structure of Markov Networks reflect alternations of interactions between random variables under different regimes and provide insights into the underlying system. While each individual network structure can be complicated and difficult to learn, the overall change from one network to another can be simple. This intuition gave birth to an approach that \emph{directly} learns the sparse changes without modelling and learning the individual (possibly dense) networks. In this paper, we review such a direct learning method with some latest developments along this line of research. △ Less

Submitted 9 January, 2017; v1 submitted 6 January, 2017; originally announced January 2017.

Comments: Fixed a few typos in Section 4.4: θshould be δ

arXiv:1610.03725 [pdf, ps, other]

Post Selection Inference with Kernels

Authors: Makoto Yamada, Yuta Umezu, Kenji Fukumizu, Ichiro Takeuchi

Abstract: We propose a novel kernel based post selection inference (PSI) algorithm, which can not only handle non-linearity in data but also structured output such as multi-dimensional and multi-label outputs. Specifically, we develop a PSI algorithm for independence measures, and propose the Hilbert-Schmidt Independence Criterion (HSIC) based PSI algorithm (hsicInf). The novelty of the proposed algorithm i… ▽ More We propose a novel kernel based post selection inference (PSI) algorithm, which can not only handle non-linearity in data but also structured output such as multi-dimensional and multi-label outputs. Specifically, we develop a PSI algorithm for independence measures, and propose the Hilbert-Schmidt Independence Criterion (HSIC) based PSI algorithm (hsicInf). The novelty of the proposed algorithm is that it can handle non-linearity and/or structured data through kernels. Namely, the proposed algorithm can be used for wider range of applications including nonlinear multi-class classification and multi-variate regressions, while existing PSI algorithms cannot handle them. Through synthetic experiments, we show that the proposed approach can find a set of statistically significant features for both regression and classification problems. Moreover, we apply the hsicInf algorithm to a real-world data, and show that hsicInf can successfully identify important features. △ Less

Submitted 13 October, 2016; v1 submitted 12 October, 2016; originally announced October 2016.

Showing 1–50 of 88 results for author: Fukumizu, K