Search | arXiv e-print repository

Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Abstract: Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samp… ▽ More Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samples degrade classification performance on a downstream supervised task, while empirically, they improve the performance. We provide a novel framework to analyze this empirical result regarding negative samples using the coupon collector's problem. Our bound can implicitly incorporate the supervised loss of the downstream task in the self-supervised loss by increasing the number of negative samples. We confirm that our proposed analysis holds on real-world benchmark datasets. △ Less

Submitted 14 January, 2022; v1 submitted 13 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021. 26 pages, 6 figures, and 6 tables

arXiv:2102.00678 [pdf, other]

Binary Classification from Multiple Unlabeled Datasets via Surrogate Set Classification

Authors: Nan Lu, Shida Lei, Gang Niu, Issei Sato, Masashi Sugiyama

Abstract: To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing… ▽ More To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing risk-consistent methods are theoretically grounded with high flexibility, they can learn only from two U sets. In this paper, we propose a new approach for binary classification from $m$ U-sets for $m\ge2$. Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC), which is aimed at predicting from which U set each observed data is drawn. SSC can be solved by a standard (multi-class) classification method, and we use the SSC solution to obtain the final binary classifier through a certain linear-fractional transformation. We built our method in a flexible and efficient end-to-end deep learning framework and prove it to be classifier-consistent. Through experiments, we demonstrate the superiority of our proposed method over state-of-the-art methods. △ Less

Submitted 11 June, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Comments: ICML2021 camera-ready version

arXiv:2012.09619 [pdf, ps, other]

A Characteristic Polynomial for The Transition Probability Matrix of A Correlated Random Walk on A Graph

Authors: Takashi Komatsu, Norio Konno, Iwao Sato

Abstract: We define a correlated random walk (CRW) induced from the time evolution matrix (the Grover matrix) of the Grover walk on a graph $G$, and present a formula for the characteristic polynomial of the transition probability matrix of this CRW by using a determinant expression for the generalized weighted zeta function of $G$. As applications, we give the spectrum of the transition probability matrice… ▽ More We define a correlated random walk (CRW) induced from the time evolution matrix (the Grover matrix) of the Grover walk on a graph $G$, and present a formula for the characteristic polynomial of the transition probability matrix of this CRW by using a determinant expression for the generalized weighted zeta function of $G$. As applications, we give the spectrum of the transition probability matrices for the CRWs induced from the Grover matrices of regular graphs and semiregular bipartite graphs. Furthermore, we consider another type of the CRW on a graph. △ Less

Submitted 18 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: 16 pages. arXiv admin note: text overlap with arXiv:2011.14162

MSC Class: 05C50; 15A15

arXiv:2011.14162 [pdf, ps, other]

A note on the Grover walk and the generalized Ihara zeta function of the one-dimensional integer lattice

Authors: Takashi Komatsu, Norio Konno, Iwao Sato

Abstract: Chinta, Jorgenson and Karlsson introduced a generalized version of the determinant formula for the Ihara zeta function associated to finite or infinite regular graphs. On the other hand, Konno and Sato obtained a formula of the characteristic polynomial of the Grover matrix by using the determinant expression for the second weighted zeta function of a finite graph. In this paper, we focus on a rel… ▽ More Chinta, Jorgenson and Karlsson introduced a generalized version of the determinant formula for the Ihara zeta function associated to finite or infinite regular graphs. On the other hand, Konno and Sato obtained a formula of the characteristic polynomial of the Grover matrix by using the determinant expression for the second weighted zeta function of a finite graph. In this paper, we focus on a relationship between the Grover walk and the generalized Ihara zeta function. That is to say, we treat the generalized Ihara zeta function of the one-dimensional integer lattice as a limit of the Ihara zeta function of the cycle graph. △ Less

Submitted 15 December, 2021; v1 submitted 28 November, 2020; originally announced November 2020.

Comments: 8 pages, Yokohama Mathematical Journal (in press)

arXiv:2011.11152 [pdf, other]

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

Authors: Zeke Xie, Zhiqiang Xu, **gzhao Zhang, Issei Sato, Masashi Sugiyama

Abstract: Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the fina… ▽ More Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam). △ Less

Submitted 19 October, 2023; v1 submitted 22 November, 2020; originally announced November 2020.

Comments: NeurIPS 2023, 21 pages, 20 figures. Keywords: Weight Decay, Regularization, Optimization, Deep Learning

arXiv:2011.06220 [pdf, other]

Artificial Neural Variability for Deep Learning: On Overfitting, Noise Memorization, and Catastrophic Forgetting

Authors: Zeke Xie, Fengxiang He, Shaopeng Fu, Issei Sato, Dacheng Tao, Masashi Sugiyama

Abstract: Deep learning is often criticized by two serious issues which rarely exist in natural nervous systems: overfitting and catastrophic forgetting. It can even memorize randomly labelled data, which has little knowledge behind the instance-label pairs. When a deep network continually learns over time by accommodating new tasks, it usually quickly overwrites the knowledge learned from previous tasks. R… ▽ More Deep learning is often criticized by two serious issues which rarely exist in natural nervous systems: overfitting and catastrophic forgetting. It can even memorize randomly labelled data, which has little knowledge behind the instance-label pairs. When a deep network continually learns over time by accommodating new tasks, it usually quickly overwrites the knowledge learned from previous tasks. Referred to as the {\it neural variability}, it is well-known in neuroscience that human brain reactions exhibit substantial variability even in response to the same stimulus. This mechanism balances accuracy and plasticity/flexibility in the motor learning of natural nervous systems. Thus it motivates us to design a similar mechanism named {\it artificial neural variability} (ANV), which helps artificial neural networks learn some advantages from ``natural'' neural networks. We rigorously prove that ANV plays as an implicit regularizer of the mutual information between the training data and the learned model. This result theoretically guarantees ANV a strictly improved generalizability, robustness to label noise, and robustness to catastrophic forgetting. We then devise a {\it neural variable risk minimization} (NVRM) framework and {\it neural variable optimizers} to achieve ANV for conventional network architectures in practice. The empirical studies demonstrate that NVRM can effectively relieve overfitting, label noise memorization, and catastrophic forgetting at negligible costs. \footnote{Code: \url{https://github.com/zeke-xie/artificial-neural-variability-for-deep-learning}. △ Less

Submitted 10 May, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: Accepted by Neural Computation, MIT Press;20 pages; 13 figures; Key Words: Neural Variability, Neuroscience, Deep Learning, Label Noise, Catastrophic Forgetting

arXiv:2008.00645 [pdf, other]

Active Classification with Uncertainty Comparison Queries

Authors: Zhenghang Cui, Issei Sato

Abstract: Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The \textit{positivity comparison oracle} is used to provide feedback on which is more likely to be positive given a pair of data points. Because it is impossible to infer accurate labels using this oracle alone \textit{without knowing the classification th… ▽ More Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The \textit{positivity comparison oracle} is used to provide feedback on which is more likely to be positive given a pair of data points. Because it is impossible to infer accurate labels using this oracle alone \textit{without knowing the classification threshold}, existing methods still rely on the traditional \textit{explicit labeling oracle}, which directly answers the label given a data point. Existing methods conduct sorting on all data points and use explicit labeling oracle to find the classification threshold. The current methods, however, have two drawbacks: (1) they needs unnecessary sorting for label inference; (2) quick sort is naively adapted to noisy feedback and negatively affects practical performance. In order to avoid this inefficiency and acquire information of the classification threshold, we propose a new pairwise comparison oracle concerning uncertainties. This oracle receives two data points as input and answers which one has higher uncertainty. We then propose an efficient adaptive labeling algorithm using the proposed oracle and the positivity comparison oracle. In addition, we also address the situation where the labeling budget is insufficient compared to the dataset size, which can be dealt with by plugging the proposed algorithm into an active learning algorithm. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed algorithm theoretically and empirically. △ Less

Submitted 28 October, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

Comments: Code and Dataset: https://github.com/zchenry/uncertainty-comparison

arXiv:2007.01659 [pdf, other]

Diagnostic Uncertainty Calibration: Towards Reliable Machine Predictions in Medical Domain

Authors: Takahiro Mimori, Keiko Sasada, Hirotaka Matsui, Issei Sato

Abstract: We propose an evaluation framework for class probability estimates (CPEs) in the presence of label uncertainty, which is commonly observed as diagnosis disagreement between experts in the medical domain. We also formalize evaluation metrics for higher-order statistics, including inter-rater disagreement, to assess predictions on label uncertainty. Moreover, we propose a novel post-hoc method calle… ▽ More We propose an evaluation framework for class probability estimates (CPEs) in the presence of label uncertainty, which is commonly observed as diagnosis disagreement between experts in the medical domain. We also formalize evaluation metrics for higher-order statistics, including inter-rater disagreement, to assess predictions on label uncertainty. Moreover, we propose a novel post-hoc method called $alpha$-calibration, that equips neural network classifiers with calibrated distributions over CPEs. Using synthetic experiments and a large-scale medical imaging application, we show that our approach significantly enhances the reliability of uncertainty estimates: disagreement probabilities and posterior CPEs. △ Less

Submitted 22 March, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 31 pages, 6 figures

arXiv:2006.15815 [pdf, other]

Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum

Authors: Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama

Abstract: Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks. However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of this paper is to unveil the mystery of this behavior in the diffusion theoretical framewo… ▽ More Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks. However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of this paper is to unveil the mystery of this behavior in the diffusion theoretical framework. Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point esca** and flat minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection. This partly explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD. Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods. △ Less

Submitted 14 June, 2022; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: ICML2022, Long Oral Presentation, 30 pages, 14 figures, Key Words: Deep Learning Theory, Optimization, Adam, Adaptive Inertia, Flat Minima

arXiv:2006.08306 [pdf, other]

LFD-ProtoNet: Prototypical Network Based on Local Fisher Discriminant Analysis for Few-shot Learning

Authors: Kei Mukaiyama, Issei Sato, Masashi Sugiyama

Abstract: The prototypical network (ProtoNet) is a few-shot learning framework that performs metric learning and classification using the distance to prototype representations of each class. It has attracted a great deal of attention recently since it is simple to implement, highly extensible, and performs well in experiments. However, it only takes into account the mean of the support vectors as prototypes… ▽ More The prototypical network (ProtoNet) is a few-shot learning framework that performs metric learning and classification using the distance to prototype representations of each class. It has attracted a great deal of attention recently since it is simple to implement, highly extensible, and performs well in experiments. However, it only takes into account the mean of the support vectors as prototypes and thus it performs poorly when the support set has high variance. In this paper, we propose to combine ProtoNet with local Fisher discriminant analysis to reduce the local within-class covariance and increase the local between-class covariance of the support set. We show the usefulness of the proposed method by theoretically providing an expected risk bound and empirically demonstrating its superior classification accuracy on miniImageNet and tieredImageNet. △ Less

Submitted 25 September, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 20 pages

MSC Class: 68T01(Primary); 68T05(Secondary)

arXiv:2006.07571 [pdf, other]

$γ$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator

Authors: Masahiro Fujisawa, Takeshi Teshima, Issei Sato, Masashi Sugiyama

Abstract: Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $γ$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical ro… ▽ More Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $γ$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear-time computational complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures. △ Less

Submitted 5 March, 2021; v1 submitted 13 June, 2020; originally announced June 2020.

Comments: The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021); 48 pages, 22 figures

arXiv:2006.06207 [pdf, other]

Pairwise Supervision Can Provably Elicit a Decision Boundary

Authors: Han Bao, Takuya Shimada, Liyuan Xu, Issei Sato, Masashi Sugiyama

Abstract: Similarity learning is a general problem to elicit useful representations by predicting the relationship between a pair of patterns. This problem is related to various important preprocessing tasks such as metric learning, kernel learning, and contrastive learning. A classifier built upon the representations is expected to perform well in downstream classification; however, little theory has been… ▽ More Similarity learning is a general problem to elicit useful representations by predicting the relationship between a pair of patterns. This problem is related to various important preprocessing tasks such as metric learning, kernel learning, and contrastive learning. A classifier built upon the representations is expected to perform well in downstream classification; however, little theory has been given in literature so far and thereby the relationship between similarity and classification has remained elusive. Therefore, we tackle a fundamental question: can similarity information provably leads a model to perform well in downstream classification? In this paper, we reveal that a product-type formulation of similarity learning is strongly related to an objective of binary classification. We further show that these two different problems are explicitly connected by an excess risk bound. Consequently, our results elucidate that similarity learning is capable of solving binary classification by directly eliciting a decision boundary. △ Less

Submitted 28 February, 2022; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: In Proceedings of AISTATS2021

arXiv:2005.09341 [pdf, ps, other]

The limit theorem with respect to the matrices on non-backtracking paths of a graph

Authors: Takehiro Hasegawa, Takashi Komatsu, Norio Konno, Hayato Saigo, Seiken Saito, Iwao Sato, Shingo Sugiyama

Abstract: We give a limit theorem with respect to the matrices related to non-backtracking paths of a regular graph. The limit obtained closely resembles the $k$th moments of the arcsine law. Furthermore, we obtain the asymptotics of the averages of the $p^m$th Fourier coefficients of the cusp forms related to the Ramanujan graphs defined by A. Lubotzky, R. Phillips and P. Sarnak. We give a limit theorem with respect to the matrices related to non-backtracking paths of a regular graph. The limit obtained closely resembles the $k$th moments of the arcsine law. Furthermore, we obtain the asymptotics of the averages of the $p^m$th Fourier coefficients of the cusp forms related to the Ramanujan graphs defined by A. Lubotzky, R. Phillips and P. Sarnak. △ Less

Submitted 16 October, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: The draft is improved: Typos are fixed, Corollaries 4.3 and 4.5 are improved, and Proposition 4.6 and Remark 4.7 are added

MSC Class: 05C38 (Primary); 05C50; 11F30 (Secondary)

arXiv:2005.04107 [pdf, other]

doi 10.1145/3386569.3392444

Sequential Gallery for Interactive Visual Design Optimization

Authors: Yuki Koyama, Issei Sato, Masataka Goto

Abstract: Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional design space through much easier two-dimensional… ▽ More Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional design space through much easier two-dimensional search subtasks. This method, called sequential plane search, is based on Bayesian optimization to keep necessary queries to users as few as possible. To help users respond to plane-search queries, we also propose using a gallery-based interface that provides options in the two-dimensional subspace arranged in an adaptive grid view. We call this interactive framework Sequential Gallery since users sequentially select the best option from the options provided by the interface. Our experiment with synthetic functions shows that our sequential plane search can find satisfactory solutions in fewer iterations than baselines. We also conducted a preliminary user study, results of which suggest that novices can effectively complete search tasks with Sequential Gallery in a photo-enhancement scenario. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: To be published at ACM Trans. Graph. (Proc. SIGGRAPH 2020); Project page available at https://koyama.xyz/project/sequential_gallery/

Journal ref: ACM Trans. Graph. 39, 4 (July 2020), pp.88:1-88:12

arXiv:2003.04691 [pdf, other]

Time-varying Gaussian Process Bandit Optimization with Non-constant Evaluation Time

Authors: Hideaki Imamura, Nontawat Charoenphakdee, Futoshi Futami, Issei Sato, Junya Honda, Masashi Sugiyama

Abstract: The Gaussian process bandit is a problem in which we want to find a maximizer of a black-box function with the minimum number of function evaluations. If the black-box function varies with time, then time-varying Bayesian optimization is a promising framework. However, a drawback with current methods is in the assumption that the evaluation time for every observation is constant, which can be unre… ▽ More The Gaussian process bandit is a problem in which we want to find a maximizer of a black-box function with the minimum number of function evaluations. If the black-box function varies with time, then time-varying Bayesian optimization is a promising framework. However, a drawback with current methods is in the assumption that the evaluation time for every observation is constant, which can be unrealistic for many practical applications, e.g., recommender systems and environmental monitoring. As a result, the performance of current methods can be degraded when this assumption is violated. To cope with this problem, we propose a novel time-varying Bayesian optimization algorithm that can effectively handle the non-constant evaluation time. Furthermore, we theoretically establish a regret bound of our algorithm. Our bound elucidates that a pattern of the evaluation time sequence can hugely affect the difficulty of the problem. We also provide experimental results to validate the practical effectiveness of the proposed method. △ Less

Submitted 10 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

arXiv:2002.03497 [pdf, other]

Few-shot Domain Adaptation by Causal Mechanism Transfer

Authors: Takeshi Teshima, Issei Sato, Masashi Sugiyama

Abstract: We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these a… ▽ More We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these assumptions may preclude the possibility of adaptation from intricately shifted and apparently very different distributions. To overcome this problem, we propose mechanism transfer, a meta-distributional scenario in which a data generating mechanism is invariant among domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for DA. We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. Our method can be seen as the first attempt to fully leverage the structural causal models for DA. △ Less

Submitted 18 August, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: 33 pages, 3 figures. Camera-ready version for Thirty-seventh International Conference on Machine Learning (ICML 2020)

arXiv:2002.03495 [pdf, other]

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Authors: Zeke Xie, Issei Sato, Masashi Sugiyama

Abstract: Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection qua… ▽ More Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time. △ Less

Submitted 15 January, 2021; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: ICLR 2021; 28 pages; 19 figures

arXiv:2001.07847 [pdf, other]

A versatile anomaly detection method for medical images with a flow-based generative model in semi-supervision setting

Authors: H. Shibata, S. Hanaoka, Y. Nomura, T. Nakao, I. Sato, D. Sato, N. Hayashi, O. Abe

Abstract: Oversight in medical images is a crucial problem, and timely reporting of medical images is desired. Therefore, an all-purpose anomaly detection method that can detect virtually all types of lesions/diseases in a given image is strongly desired. However, few commercially available and versatile anomaly detection methods for medical images have been provided so far. Recently, anomaly detection meth… ▽ More Oversight in medical images is a crucial problem, and timely reporting of medical images is desired. Therefore, an all-purpose anomaly detection method that can detect virtually all types of lesions/diseases in a given image is strongly desired. However, few commercially available and versatile anomaly detection methods for medical images have been provided so far. Recently, anomaly detection methods built upon deep learning methods have been rapidly growing in popularity, and these methods seem to provide reasonable solutions to the problem. However, the workload to label the images necessary for training in deep learning remains heavy. In this study, we present an anomaly detection method based on two trained flow-based generative models. With this method, the posterior probability can be computed as a normality metric for any given image. The training of the generative models requires two sets of images: a set containing only normal images and another set containing both normal and abnormal images without any labels. In the latter set, each sample does not have to be labeled as normal or abnormal; therefore, any mixture of images (e.g., all cases in a hospital) can be used as the dataset without cumbersome manual labeling. The method was validated with two types of medical images: chest X-ray radiographs (CXRs) and brain computed tomographies (BCTs). The areas under the receiver operating characteristic curves for logarithm posterior probabilities of CXRs (0.868 for pneumonia-like opacities) and BCTs (0.904 for infarction) were comparable to those in previous studies with other anomaly detection methods. This result showed the versatility of our method. △ Less

Submitted 20 October, 2020; v1 submitted 21 January, 2020; originally announced January 2020.

arXiv:1911.09011 [pdf, other]

Bayesian interpretation of SGD as Ito process

Authors: Soma Yokoi, Issei Sato

Abstract: The current interpretation of stochastic gradient descent (SGD) as a stochastic process lacks generality in that its numerical scheme restricts continuous-time dynamics as well as the loss function and the distribution of gradient noise. We introduce a simplified scheme with milder conditions that flexibly interprets SGD as a discrete-time approximation of an Ito process. The scheme also works as… ▽ More The current interpretation of stochastic gradient descent (SGD) as a stochastic process lacks generality in that its numerical scheme restricts continuous-time dynamics as well as the loss function and the distribution of gradient noise. We introduce a simplified scheme with milder conditions that flexibly interprets SGD as a discrete-time approximation of an Ito process. The scheme also works as a common foundation of SGD and stochastic gradient Langevin dynamics (SGLD), providing insights into their asymptotic properties. We investigate the convergence of SGD with biased gradient in terms of the equilibrium mode and the overestimation problem of the second moment of SGLD. △ Less

Submitted 20 November, 2019; originally announced November 2019.

arXiv:1911.06181 [pdf, other]

Adversarial Transformations for Semi-Supervised Learning

Authors: Teppei Suzuki, Ikuro Sato

Abstract: We propose a Regularization framework based on Adversarial Transformations (RAT) for semi-supervised learning. RAT is designed to enhance robustness of the output distribution of class prediction for a given data against input perturbation. RAT is an extension of Virtual Adversarial Training (VAT) in such a way that RAT adversarialy transforms data along the underlying data distribution by a rich… ▽ More We propose a Regularization framework based on Adversarial Transformations (RAT) for semi-supervised learning. RAT is designed to enhance robustness of the output distribution of class prediction for a given data against input perturbation. RAT is an extension of Virtual Adversarial Training (VAT) in such a way that RAT adversarialy transforms data along the underlying data distribution by a rich set of data transformation functions that leave class label invariant, whereas VAT simply produces adversarial additive noises. In addition, we verified that a technique of gradually increasing of perturbation region further improve the robustness. In experiments, we show that RAT significantly improves classification performance on CIFAR-10 and SVHN compared to existing regularization methods under standard semi-supervised image classification settings. △ Less

Submitted 18 November, 2019; v1 submitted 13 November, 2019; originally announced November 2019.

Comments: Accepted by AAAI 2020

arXiv:1911.06060 [pdf, ps, other]

A zeta function related to the transition matrix of the discrete-time quantum walk on a graph

Authors: Norio Konno, Iwao Sato, Etsuo Segawa

Abstract: We present the structure theorem for the positive support of the cube of the Grover transition matrix of the discrete-time quantum walk (the Grover walk) on a general graph $G$ under same condition. Thus, we introduce a zeta function on the positive support of the cube of the Grover transition matrix of $G$, and present its Euler product and its determinant expression. As a corollary, we give the… ▽ More We present the structure theorem for the positive support of the cube of the Grover transition matrix of the discrete-time quantum walk (the Grover walk) on a general graph $G$ under same condition. Thus, we introduce a zeta function on the positive support of the cube of the Grover transition matrix of $G$, and present its Euler product and its determinant expression. As a corollary, we give the characteristic polynomial for the positive support of the cube of the Grover transition matrix of a regular graph, and so obtain its spectra. Finally, we present the poles and the radius of the convergence of this zeta function. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: arXiv admin note: text overlap with arXiv:1103.0079

MSC Class: 60F50; 05C50; 15A15; 05C60

arXiv:1910.12782 [pdf, ps, other]

Zeta functions with respect to general coined quantum walk of periodic graphs

Authors: Takashi Komatsu, Norio Konno, Iwao Sato

Abstract: We define a zeta function of a graph by using the time evolution matrix of a general coined quantum walk on it, and give a determinant expression for the zeta function of a finite graph. Furthermore, we present a determinant expression for the zeta function of an (infinite) periodic graph. We define a zeta function of a graph by using the time evolution matrix of a general coined quantum walk on it, and give a determinant expression for the zeta function of a finite graph. Furthermore, we present a determinant expression for the zeta function of an (infinite) periodic graph. △ Less

Submitted 28 October, 2019; originally announced October 2019.

Comments: 14 pages

MSC Class: 60F50; 05C50; 15A15; 05C60

arXiv:1908.09051 [pdf, ps, other]

A walk on max-plus algebra

Authors: Sennosuke Watanabe, Akiko Fukuda, Etsuo Segawa, Iwao Sato

Abstract: Max-plus algebra is a kind of idempotent semiring over $\mathbb{R}_{\max}:=\mathbb{R}\cup\{-\infty\}$ with two operations $\oplus := \max$ and $\otimes := +$.In this paper, we introduce a new model of a walk on one dimensional lattice on $\mathbb{Z}$, as an analogue of the quantum walk, over the max-plus algebra and we call it max-plus walk. In the conventional quantum walk, the summation of the… ▽ More Max-plus algebra is a kind of idempotent semiring over $\mathbb{R}_{\max}:=\mathbb{R}\cup\{-\infty\}$ with two operations $\oplus := \max$ and $\otimes := +$.In this paper, we introduce a new model of a walk on one dimensional lattice on $\mathbb{Z}$, as an analogue of the quantum walk, over the max-plus algebra and we call it max-plus walk. In the conventional quantum walk, the summation of the $\ell^2$-norm of the states over all the positions is a conserved quantity. In contrast, the summation of eigenvalues of state decision matrices is a conserved quantity in the max-plus walk.Moreover, spectral analysis on the total time evolution operator is also given. △ Less

Submitted 29 August, 2019; v1 submitted 23 August, 2019; originally announced August 2019.

Comments: 17 pages, 1 figures

arXiv:1907.10225 [pdf, ps, other]

Classification from Triplet Comparison Data

Authors: Zhenghang Cui, Nontawat Charoenphakdee, Issei Sato, Masashi Sugiyama

Abstract: Learning from triplet comparison data has been extensively studied in the context of metric learning, where we want to learn a distance metric between two instances, and ordinal embedding, where we want to learn an embedding in an Euclidean space of the given instances that preserves the comparison order as well as possible. Unlike fully-labeled data, triplet comparison data can be collected in a… ▽ More Learning from triplet comparison data has been extensively studied in the context of metric learning, where we want to learn a distance metric between two instances, and ordinal embedding, where we want to learn an embedding in an Euclidean space of the given instances that preserves the comparison order as well as possible. Unlike fully-labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Although learning from triplet comparison data has been considered in many applications, an important fundamental question of whether we can learn a classifier only from triplet comparison data has remained unanswered. In this paper, we give a positive answer to this important question by proposing an unbiased estimator for the classification risk under the empirical risk minimization framework. Since the proposed method is based on the empirical risk minimization framework, it inherently has the advantage that any surrogate loss function and any model, including neural networks, can be easily applied. Furthermore, we theoretically establish an estimation error bound for the proposed empirical risk minimizer. Finally, we provide experimental results to show that our method empirically works well and outperforms various baseline methods. △ Less

Submitted 18 April, 2020; v1 submitted 23 July, 2019; originally announced July 2019.

Comments: Code: https://github.com/zchenry/triplet_classification

arXiv:1906.09840 [pdf, other]

Interactive Optimization of Generative Image Modeling using Sequential Subspace Search and Content-based Guidance

Authors: Toby Chong Long Hin, I-Chao Shen, Issei Sato, Takeo Igarashi

Abstract: Generative image modeling techniques such as GAN demonstrate highly convincing image generation result. However, user interaction is often necessary to obtain the desired results. Existing attempts add interactivity but require either tailored architectures or extra data. We present a human-in-the-optimization method that allows users to directly explore and search the latent vector space of gener… ▽ More Generative image modeling techniques such as GAN demonstrate highly convincing image generation result. However, user interaction is often necessary to obtain the desired results. Existing attempts add interactivity but require either tailored architectures or extra data. We present a human-in-the-optimization method that allows users to directly explore and search the latent vector space of generative image modeling. Our system provides multiple candidates by sampling the latent vector space, and the user selects the best blending weights within the subspace using multiple sliders. In addition, the user can express their intention through image editing tools. The system samples latent vectors based on inputs and presents new candidates to the user iteratively. An advantage of our formulation is that one can apply our method to arbitrary pre-trained model without develo** specialized architecture or data. We demonstrate our method with various generative image modeling applications, and show superior performance in a comparative user study with prior art iGAN. △ Less

Submitted 29 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: 13 pages, Toby Chong Long Hin and I-Chao Shen contributed equally to the paper

arXiv:1906.01150 [pdf, other]

Breaking Inter-Layer Co-Adaptation by Classifier Anonymization

Authors: Ikuro Sato, Kohta Ishikawa, Guoqing Liu, Masayuki Tanaka

Abstract: This study addresses an issue of co-adaptation between a feature extractor and a classifier in a neural network. A naive joint optimization of a feature extractor and a classifier often brings situations in which an excessively complex feature distribution adapted to a very specific classifier degrades the test performance. We introduce a method called Feature-extractor Optimization through Classi… ▽ More This study addresses an issue of co-adaptation between a feature extractor and a classifier in a neural network. A naive joint optimization of a feature extractor and a classifier often brings situations in which an excessively complex feature distribution adapted to a very specific classifier degrades the test performance. We introduce a method called Feature-extractor Optimization through Classifier Anonymization (FOCA), which is designed to avoid an explicit co-adaptation between a feature extractor and a particular classifier by using many randomly-generated, weak classifiers during optimization. We put forth a mathematical proposition that states the FOCA features form a point-like distribution within the same class in a class-separable fashion under special conditions. Real-data experiments under more general conditions provide supportive evidences. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: 9 pages. Accepted to ICML 2019

arXiv:1905.11623 [pdf, other]

Solving NP-Hard Problems on Graphs with Extended AlphaGo Zero

Authors: Kenshin Abe, Zijian Xu, Issei Sato, Masashi Sugiyama

Abstract: There have been increasing challenges to solve combinatorial optimization problems by machine learning. Khalil et al. proposed an end-to-end reinforcement learning framework, S2V-DQN, which automatically learns graph embeddings to construct solutions to a wide range of problems. To improve the generalization ability of their Q-learning method, we propose a novel learning strategy based on AlphaGo… ▽ More There have been increasing challenges to solve combinatorial optimization problems by machine learning. Khalil et al. proposed an end-to-end reinforcement learning framework, S2V-DQN, which automatically learns graph embeddings to construct solutions to a wide range of problems. To improve the generalization ability of their Q-learning method, we propose a novel learning strategy based on AlphaGo Zero which is a Go engine that achieved a superhuman level without the domain knowledge of the game. Our framework is redesigned for combinatorial problems, where the final reward might take any real number instead of a binary response, win/lose. In experiments conducted for five kinds of NP-hard problems including {\sc MinimumVertexCover} and {\sc MaxCut}, our method is shown to generalize better to various graphs than S2V-DQN. Furthermore, our method can be combined with recently-developed graph neural network (GNN) models such as the \emph{Graph Isomorphism Network}, resulting in even better performance. This experiment also gives an interesting insight into a suitable choice of GNN models for each task. △ Less

Submitted 7 March, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.00593 [pdf, other]

Directing DNNs Attention for Facial Attribution Classification using Gradient-weighted Class Activation Map**

Authors: Xi Yang, Bojian Wu, Issei Sato, Takeo Igarashi

Abstract: Deep neural networks (DNNs) have a high accuracy on image classification tasks. However, DNNs trained by such dataset with co-occurrence bias may rely on wrong features while making decisions for classification. It will greatly affect the transferability of pre-trained DNNs. In this paper, we propose an interactive method to direct classifiers paying attentions to the regions that are manually spe… ▽ More Deep neural networks (DNNs) have a high accuracy on image classification tasks. However, DNNs trained by such dataset with co-occurrence bias may rely on wrong features while making decisions for classification. It will greatly affect the transferability of pre-trained DNNs. In this paper, we propose an interactive method to direct classifiers paying attentions to the regions that are manually specified by the users, in order to mitigate the influence of co-occurrence bias. We test on CelebA dataset, the pre-trained AlexNet is fine-tuned to focus on the specific facial attributes based on the results of Grad-CAM. △ Less

Submitted 2 May, 2019; originally announced May 2019.

Comments: CVPR-19 Workshop on Explainable AI

arXiv:1904.11717 [pdf, other]

Classification from Pairwise Similarities/Dissimilarities and Unlabeled Data via Empirical Risk Minimization

Authors: Takuya Shimada, Han Bao, Issei Sato, Masashi Sugiyama

Abstract: Pairwise similarities and dissimilarities between data points might be easier to obtain than fully labeled data in real-world classification problems, e.g., in privacy-aware situations. To handle such pairwise information, an empirical risk minimization approach has been proposed, giving an unbiased estimator of the classification risk that can be computed only from pairwise similarities and unlab… ▽ More Pairwise similarities and dissimilarities between data points might be easier to obtain than fully labeled data in real-world classification problems, e.g., in privacy-aware situations. To handle such pairwise information, an empirical risk minimization approach has been proposed, giving an unbiased estimator of the classification risk that can be computed only from pairwise similarities and unlabeled data. However, this direction cannot handle pairwise dissimilarities so far. On the other hand, semi-supervised clustering is one of the methods which can use both similarities and dissimilarities. Nevertheless, they typically require strong geometrical assumptions on the data distribution such as the manifold assumption, which may deteriorate the performance. In this paper, we derive an unbiased risk estimator which can handle all of similarities/dissimilarities and unlabeled data. We theoretically establish estimation error bounds and experimentally demonstrate the practical usefulness of our empirical risk minimization method. △ Less

Submitted 26 April, 2019; originally announced April 2019.

arXiv:1903.12053 [pdf]

Imaging cytometry without image reconstruction (ghost cytometry)

Authors: Sadao Ota, Ryoichi Horisaki, Yoko Kawamura, Issei Sato, Hiroyuki Noji

Abstract: Imaging and analysis of many single cells hold great potential in our understanding of heterogeneous and complex life systems and in enabling biomedical applications. We here introduce a recently realized image-free "imaging" cytometry technology, which we call ghost cytometry. While a compressive ghost imaging technique utilizing object's motion relative to a projected static light pattern allows… ▽ More Imaging and analysis of many single cells hold great potential in our understanding of heterogeneous and complex life systems and in enabling biomedical applications. We here introduce a recently realized image-free "imaging" cytometry technology, which we call ghost cytometry. While a compressive ghost imaging technique utilizing object's motion relative to a projected static light pattern allows recovery of their images, a key of this ghost cytometry is to achieve ultrafast cell classification by directly applying machine learning methods to the compressive imaging signals in a temporal domain. We show the applicability of our method in the analysis of flowing objects based on the reconstructed images as well as in that based on the imaging waveform without image production. △ Less

Submitted 27 March, 2019; originally announced March 2019.

arXiv:1903.09538 [pdf]

Use of Ghost Cytometry to Differentiate Cells with Similar Gross Morphologic Characteristics

Authors: Hiroaki Adachi, Yoko Kawamura, Keiji Nakagawa, Ryoichi Horisaki, Issei Sato, Satoko Yamaguchi, Katsuhito Fujiu, Kayo Waki, Hiroyuki Noji, Sadao Ota

Abstract: Imaging flow cytometry shows significant potential for increasing our understanding of heterogeneous and complex life systems and is useful for biomedical applications. Ghost cytometry is a recently proposed approach for directly analyzing compressively measured signals, thereby relieving the computational bottleneck observed in high-throughput cytometry based on morphological information. While t… ▽ More Imaging flow cytometry shows significant potential for increasing our understanding of heterogeneous and complex life systems and is useful for biomedical applications. Ghost cytometry is a recently proposed approach for directly analyzing compressively measured signals, thereby relieving the computational bottleneck observed in high-throughput cytometry based on morphological information. While this image-free approach could distinguish different cell types using the same fluorescence staining method, further strict controls are sometimes required to clearly demonstrate that the classification is based on detailed morphologic analysis. In this study, we show that ghost cytometry can be used to classify cell populations of the same type but with different fluorescence distributions in space, supporting the strength of our image-free approach for morphologic cell analysis. △ Less

Submitted 22 March, 2019; originally announced March 2019.

arXiv:1903.06009 [pdf, other]

On Learning from Ghost Imaging without Imaging

Authors: Issei Sato

Abstract: Computational ghost imaging is an imaging technique in which an object is imaged from light collected using a single-pixel detector with no spatial resolution. Recently, ghost cytometry has been proposed for a high-speed cell-classification method that involves ghost imaging and machine learning in flow cytometry. Ghost cytometry skips the reconstruction of cell images from signals and directly us… ▽ More Computational ghost imaging is an imaging technique in which an object is imaged from light collected using a single-pixel detector with no spatial resolution. Recently, ghost cytometry has been proposed for a high-speed cell-classification method that involves ghost imaging and machine learning in flow cytometry. Ghost cytometry skips the reconstruction of cell images from signals and directly used signals for cell-classification because this reconstruction is what creates the bottleneck in the high-speed analysis. In this paper, we provide theoretical analysis for learning from ghost imaging without imaging. △ Less

Submitted 29 May, 2019; v1 submitted 14 March, 2019; originally announced March 2019.

arXiv:1903.02750 [pdf, other]

On Transformations in Stochastic Gradient MCMC

Authors: Soma Yokoi, Takuma Otsuka, Issei Sato

Abstract: Stochastic gradient Langevin dynamics (SGLD) is a computationally efficient sampler for Bayesian posterior inference given a large scale dataset. Although SGLD is designed for unbounded random variables, many practical models incorporate variables with boundaries such as non-negative ones or those in a finite interval. To bridge this gap, we consider map** unbounded samples into the target inter… ▽ More Stochastic gradient Langevin dynamics (SGLD) is a computationally efficient sampler for Bayesian posterior inference given a large scale dataset. Although SGLD is designed for unbounded random variables, many practical models incorporate variables with boundaries such as non-negative ones or those in a finite interval. To bridge this gap, we consider map** unbounded samples into the target interval. This paper reveals that several map** approaches commonly used in the literature produces erroneous samples from theoretical and empirical perspectives. We show that the change of random variable using an invertible Lipschitz map** function overcomes the pitfall as well as attains the weak convergence. Experiments demonstrate its efficacy for widely-used models with bounded latent variables including Bayesian non-negative matrix factorization and binary neural networks. △ Less

Submitted 20 June, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1902.04247 [pdf, ps, other]

PAC-Bayes Analysis of Sentence Representation

Authors: Kento Nozawa, Issei Sato

Abstract: Learning sentence vectors from an unlabeled corpus has attracted attention because such vectors can represent sentences in a lower dimensional and continuous space. Simple heuristics using pre-trained word vectors are widely applied to machine learning tasks. However, they are not well understood from a theoretical perspective. We analyze learning sentence vectors from a transfer learning perspect… ▽ More Learning sentence vectors from an unlabeled corpus has attracted attention because such vectors can represent sentences in a lower dimensional and continuous space. Simple heuristics using pre-trained word vectors are widely applied to machine learning tasks. However, they are not well understood from a theoretical perspective. We analyze learning sentence vectors from a transfer learning perspective by using a PAC-Bayes bound that enables us to understand existing heuristics. We show that simple heuristics such as averaging and inverse document frequency weighted averaging are derived by our formulation. Moreover, we propose novel sentence vector learning algorithms on the basis of our PAC-Bayes analysis. △ Less

Submitted 13 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: fix styles

arXiv:1902.01056 [pdf, other]

Online Multiclass Classification Based on Prediction Margin for Partial Feedback

Authors: Takuo Kaneko, Issei Sato, Masashi Sugiyama

Abstract: We consider the problem of online multiclass classification with partial feedback, where an algorithm predicts a class for a new instance in each round and only receives its correctness. Although several methods have been developed for this problem, recent challenging real-world applications require further performance improvement. In this paper, we propose a novel online learning algorithm inspir… ▽ More We consider the problem of online multiclass classification with partial feedback, where an algorithm predicts a class for a new instance in each round and only receives its correctness. Although several methods have been developed for this problem, recent challenging real-world applications require further performance improvement. In this paper, we propose a novel online learning algorithm inspired by recent work on learning from complementary labels, where a complementary label indicates a class to which an instance does not belong. This allows us to handle partial feedback deterministically in a margin-based way, where the prediction margin has been recognized as a key to superior empirical performance. We provide a theoretical guarantee based on a cumulative loss bound and experimentally demonstrate that our method outperforms existing methods which are non-margin-based and stochastic. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1902.00468 [pdf, other]

Multilevel Monte Carlo Variational Inference

Authors: Masahiro Fujisawa, Issei Sato

Abstract: We propose a variance reduction framework for variational inference using the Multilevel Monte Carlo (MLMC) method. Our framework is built on reparameterized gradient estimators and "recycles" parameters obtained from past update history in optimization. In addition, our framework provides a new optimization algorithm based on stochastic gradient descent (SGD) that adaptively estimates the sample… ▽ More We propose a variance reduction framework for variational inference using the Multilevel Monte Carlo (MLMC) method. Our framework is built on reparameterized gradient estimators and "recycles" parameters obtained from past update history in optimization. In addition, our framework provides a new optimization algorithm based on stochastic gradient descent (SGD) that adaptively estimates the sample size used for gradient estimation according to the ratio of the gradient variance. We theoretically show that, with our method, the variance of the gradient estimator decreases as optimization proceeds and that a learning rate scheduler function helps improve the convergence. We also show that, in terms of the \textit{signal-to-noise} ratio, our method can improve the quality of gradient estimation by the learning rate scheduler function without increasing the initial sample size. Finally, we confirm that our method achieves faster convergence and reduces the variance of the gradient estimator compared with other methods through experimental comparisons with baseline methods using several benchmark datasets. △ Less

Submitted 2 December, 2021; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: 44pages, 10 figures; Journal of Machine Learning Research (JMLR)

arXiv:1901.11351 [pdf, other]

Semi-Supervised Ordinal Regression Based on Empirical Risk Minimization

Authors: Taira Tsuchiya, Nontawat Charoenphakdee, Issei Sato, Masashi Sugiyama

Abstract: Ordinal regression is aimed at predicting an ordinal class label. In this paper, we consider its semi-supervised formulation, in which we have unlabeled data along with ordinal-labeled data to train an ordinal regressor. There are several metrics to evaluate the performance of ordinal regression, such as the mean absolute error, mean zero-one error, and mean squared error. However, the existing st… ▽ More Ordinal regression is aimed at predicting an ordinal class label. In this paper, we consider its semi-supervised formulation, in which we have unlabeled data along with ordinal-labeled data to train an ordinal regressor. There are several metrics to evaluate the performance of ordinal regression, such as the mean absolute error, mean zero-one error, and mean squared error. However, the existing studies do not take the evaluation metric into account, have a restriction on the model choice, and have no theoretical guarantee. To overcome these problems, we propose a novel generic framework for semi-supervised ordinal regression based on the empirical risk minimization principle that is applicable to optimizing all of the metrics mentioned above. Besides, our framework has flexible choices of models, surrogate losses, and optimization algorithms without the common geometric assumption on unlabeled data such as the cluster assumption or manifold assumption. We further provide an estimation error bound to show that our risk estimator is consistent. Finally, we conduct experiments to show the usefulness of our framework. △ Less

Submitted 10 June, 2021; v1 submitted 31 January, 2019; originally announced January 2019.

Comments: 38 pages, 9 figures

arXiv:1901.04653 [pdf, other]

Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

Authors: Yusuke Tsuzuku, Issei Sato, Masashi Sugiyama

Abstract: The notion of flat minima has played a key role in the generalization studies of deep learning models. However, existing definitions of the flatness are known to be sensitive to the rescaling of parameters. The issue suggests that the previous definitions of the flatness might not be a good measure of generalization, because generalization is invariant to such rescalings. In this paper, from the P… ▽ More The notion of flat minima has played a key role in the generalization studies of deep learning models. However, existing definitions of the flatness are known to be sensitive to the rescaling of parameters. The issue suggests that the previous definitions of the flatness might not be a good measure of generalization, because generalization is invariant to such rescalings. In this paper, from the PAC-Bayesian perspective, we scrutinize the discussion concerning the flat minima and introduce the notion of normalized flat minima, which is free from the known scale dependence issues. Additionally, we highlight the scale dependence of existing matrix-norm based generalization error bounds similar to the existing flat minima definitions. Our modified notion of the flatness does not suffer from the insufficiency, either, suggesting it might provide better hierarchy in the hypothesis class. △ Less

Submitted 28 January, 2019; v1 submitted 14 January, 2019; originally announced January 2019.

arXiv:1812.02640 [pdf, other]

doi 10.1609/aaai.v33i01.33011093

Pathological Evidence Exploration in Deep Retinal Image Diagnosis

Authors: Yuhao Niu, Lin Gu, Feng Lu, Feifan Lv, Zongji Wang, Imari Sato, Zijian Zhang, Yangyan Xiao, Xunzhang Dai, Tingting Cheng

Abstract: Though deep learning has shown successful performance in classifying the label and severity stage of certain disease, most of them give few evidence on how to make prediction. Here, we propose to exploit the interpretability of deep learning application in medical diagnosis. Inspired by Koch's Postulates, a well-known strategy in medical research to identify the property of pathogen, we define a p… ▽ More Though deep learning has shown successful performance in classifying the label and severity stage of certain disease, most of them give few evidence on how to make prediction. Here, we propose to exploit the interpretability of deep learning application in medical diagnosis. Inspired by Koch's Postulates, a well-known strategy in medical research to identify the property of pathogen, we define a pathological descriptor that can be extracted from the activated neurons of a diabetic retinopathy detector. To visualize the symptom and feature encoded in this descriptor, we propose a GAN based method to synthesize pathological retinal image given the descriptor and a binary vessel segmentation. Besides, with this descriptor, we can arbitrarily manipulate the position and quantity of lesions. As verified by a panel of 5 licensed ophthalmologists, our synthesized images carry the symptoms that are directly related to diabetic retinopathy diagnosis. The panel survey also shows that our generated images is both qualitatively and quantitatively superior to existing methods. △ Less

Submitted 6 December, 2018; originally announced December 2018.

Comments: to appear in AAAI (2019). The first two authors contributed equally to the paper. Corresponding Author: Feng Lu

Journal ref: AAAI 2019: 1093-1101

arXiv:1811.12104 [pdf, other]

Generating Easy-to-Understand Referring Expressions for Target Identifications

Authors: Mikihiro Tanaka, Takayuki Itamochi, Kenichi Narioka, Ikuro Sato, Yoshitaka Ushiku, Tatsuya Harada

Abstract: This paper addresses the generation of referring expressions that not only refer to objects correctly but also let humans find them quickly. As a target becomes relatively less salient, identifying referred objects itself becomes more difficult. However, the existing studies regarded all sentences that refer to objects correctly as equally good, ignoring whether they are easily understood by human… ▽ More This paper addresses the generation of referring expressions that not only refer to objects correctly but also let humans find them quickly. As a target becomes relatively less salient, identifying referred objects itself becomes more difficult. However, the existing studies regarded all sentences that refer to objects correctly as equally good, ignoring whether they are easily understood by humans. If the target is not salient, humans utilize relationships with the salient contexts around it to help listeners to comprehend it better. To derive this information from human annotations, our model is designed to extract information from the target and from the environment. Moreover, we regard that sentences that are easily understood are those that are comprehended correctly and quickly by humans. We optimized this by using the time required to locate the referred objects by humans and their accuracies. To evaluate our system, we created a new referring expression dataset whose images were acquired from Grand Theft Auto V (GTA V), limiting targets to persons. Experimental results show the effectiveness of our approach. Our code and dataset are available at https://github.com/mikittt/easy-to-understand-REG. △ Less

Submitted 29 August, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

arXiv:1811.02116 [pdf, ps, other]

Eigenbasis of the Evolution Operator of 2-Tessellable Quantum Walks

Authors: Yusuke Higuchi, Renato Portugal, Iwao Sato, Etsuo Segawa

Abstract: Staggered quantum walks on graphs are based on the concept of graph tessellation and generalize some well-known discrete-time quantum walk models. In this work, we address the class of 2-tessellable quantum walks with the goal of obtaining an eigenbasis of the evolution operator. By interpreting the evolution operator as a quantum Markov chain on an underlying multigraph, we define the concept of… ▽ More Staggered quantum walks on graphs are based on the concept of graph tessellation and generalize some well-known discrete-time quantum walk models. In this work, we address the class of 2-tessellable quantum walks with the goal of obtaining an eigenbasis of the evolution operator. By interpreting the evolution operator as a quantum Markov chain on an underlying multigraph, we define the concept of quantum detailed balance, which helps to obtain the eigenbasis. A subset of the eigenvectors is obtained from the eigenvectors of the double discriminant matrix of the quantum Markov chain. To obtain the remaining eigenvectors, we have to use the quantum detailed balance conditions. If the quantum Markov chain has a quantum detailed balance, there is an eigenvector for each fundamental cycle of the underlying multigraph. If the quantum Markov chain does not have a quantum detailed balance, we have to use two fundamental cycles linked by a path in order to find the remaining eigenvectors. We exemplify the process of obtaining the eigenbasis of the evolution operator using the kagome lattice (the line graph of the hexagonal lattice), which has symmetry properties that help in the calculation process. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: 21 pages, 3 figures

arXiv:1809.04997 [pdf, other]

Clipped Matrix Completion: A Remedy for Ceiling Effects

Authors: Takeshi Teshima, Miao Xu, Issei Sato, Masashi Sugiyama

Abstract: We consider the problem of recovering a low-rank matrix from its clipped observations. Clip** is conceivable in many scientific areas that obstructs statistical analyses. On the other hand, matrix completion (MC) methods can recover a low-rank matrix from various information deficits by using the principle of low-rank completion. However, the current theoretical guarantees for low-rank MC do not… ▽ More We consider the problem of recovering a low-rank matrix from its clipped observations. Clip** is conceivable in many scientific areas that obstructs statistical analyses. On the other hand, matrix completion (MC) methods can recover a low-rank matrix from various information deficits by using the principle of low-rank completion. However, the current theoretical guarantees for low-rank MC do not apply to clipped matrices, as the deficit depends on the underlying values. Therefore, the feasibility of clipped matrix completion (CMC) is not trivial. In this paper, we first provide a theoretical guarantee for the exact recovery of CMC by using a trace-norm minimization algorithm. Furthermore, we propose practical CMC algorithms by extending ordinary MC methods. Our extension is to use the squared hinge loss in place of the squared loss for reducing the penalty of over-estimation on clipped entries. We also propose a novel regularization term tailored for CMC. It is a combination of two trace-norm terms, and we theoretically bound the recovery error under the regularization. We demonstrate the effectiveness of the proposed methods through experiments using both synthetic and benchmark data for recommendation systems. △ Less

Submitted 4 March, 2019; v1 submitted 13 September, 2018; originally announced September 2018.

Comments: 36 pages, 3 figures, The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

arXiv:1809.04820 [pdf, other]

Canonical and Compact Point Cloud Representation for Shape Classification

Authors: Kent Fujiwara, Ikuro Sato, Mitsuru Ambai, Yuichi Yoshida, Yoshiaki Sakakura

Abstract: We present a novel compact point cloud representation that is inherently invariant to scale, coordinate change and point permutation. The key idea is to parametrize a distance field around an individual shape into a unique, canonical, and compact vector in an unsupervised manner. We firstly project a distance field to a $4$D canonical space using singular value decomposition. We then train a neura… ▽ More We present a novel compact point cloud representation that is inherently invariant to scale, coordinate change and point permutation. The key idea is to parametrize a distance field around an individual shape into a unique, canonical, and compact vector in an unsupervised manner. We firstly project a distance field to a $4$D canonical space using singular value decomposition. We then train a neural network for each instance to non-linearly embed its distance field into network parameters. We employ a bias-free Extreme Learning Machine (ELM) with ReLU activation units, which has scale-factor commutative property between layers. We demonstrate the descriptiveness of the instance-wise, shape-embedded network parameters by using them to classify shapes in $3$D datasets. Our learning-based representation requires minimal augmentation and simple neural networks, where previous approaches demand numerous representations to handle coordinate change and point permutation. △ Less

Submitted 13 September, 2018; originally announced September 2018.

Comments: 16 pages, 5 figures

arXiv:1809.04098 [pdf, other]

On the Structural Sensitivity of Deep Convolutional Networks to the Directions of Fourier Basis Functions

Authors: Yusuke Tsuzuku, Issei Sato

Abstract: Data-agnostic quasi-imperceptible perturbations on inputs are known to degrade recognition accuracy of deep convolutional networks severely. This phenomenon is considered to be a potential security issue. Moreover, some results on statistical generalization guarantees indicate that the phenomenon can be a key to improve the networks' generalization. However, the characteristics of the shared direc… ▽ More Data-agnostic quasi-imperceptible perturbations on inputs are known to degrade recognition accuracy of deep convolutional networks severely. This phenomenon is considered to be a potential security issue. Moreover, some results on statistical generalization guarantees indicate that the phenomenon can be a key to improve the networks' generalization. However, the characteristics of the shared directions of such harmful perturbations remain unknown. Our primal finding is that convolutional networks are sensitive to the directions of Fourier basis functions. We derived the property by specializing a hypothesis of the cause of the sensitivity, known as the linearity of neural networks, to convolutional networks and empirically validated it. As a by-product of the analysis, we propose an algorithm to create shift-invariant universal adversarial perturbations available in black-box settings. △ Less

Submitted 17 April, 2019; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: CVPR 2019

arXiv:1809.03839 [pdf, other]

Unsupervised Domain Adaptation Based on Source-guided Discrepancy

Authors: Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, Masashi Sugiyama

Abstract: Unsupervised domain adaptation is the problem setting where data generating distributions in the source and target domains are different, and labels in the target domain are unavailable. One important question in unsupervised domain adaptation is how to measure the difference between the source and target domains. A previously proposed discrepancy that does not use the source domain labels require… ▽ More Unsupervised domain adaptation is the problem setting where data generating distributions in the source and target domains are different, and labels in the target domain are unavailable. One important question in unsupervised domain adaptation is how to measure the difference between the source and target domains. A previously proposed discrepancy that does not use the source domain labels requires high computational cost to estimate and may lead to a loose generalization error bound in the target domain. To mitigate these problems, we propose a novel discrepancy called source-guided discrepancy (S-disc), which exploits labels in the source domain. As a consequence, S-disc can be computed efficiently with a finite sample convergence guarantee. In addition, we show that S-disc can provide a tighter generalization error bound than the one based on an existing discrepancy. Finally, we report experimental results that demonstrate the advantages of S-disc over the existing discrepancies. △ Less

Submitted 19 November, 2018; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: To appear in AAAI-19

arXiv:1805.07912 [pdf, ps, other]

Bayesian posterior approximation via greedy particle optimization

Authors: Futoshi Futami, Zhenghang Cui, Issei Sato, Masashi Sugiyama

Abstract: In Bayesian inference, the posterior distributions are difficult to obtain analytically for complex models such as neural networks. Variational inference usually uses a parametric distribution for approximation, from which we can easily draw samples. Recently discrete approximation by particles has attracted attention because of its high expression ability. An example is Stein variational gradient… ▽ More In Bayesian inference, the posterior distributions are difficult to obtain analytically for complex models such as neural networks. Variational inference usually uses a parametric distribution for approximation, from which we can easily draw samples. Recently discrete approximation by particles has attracted attention because of its high expression ability. An example is Stein variational gradient descent (SVGD), which iteratively optimizes particles. Although SVGD has been shown to be computationally efficient empirically, its theoretical properties have not been clarified yet and no finite sample bound of the convergence rate is known. Another example is the Stein points (SP) method, which minimizes kernelized Stein discrepancy directly. Although a finite sample bound is assured theoretically, SP is computationally inefficient empirically, especially in high-dimensional problems. In this paper, we propose a novel method named maximum mean discrepancy minimization by the Frank-Wolfe algorithm (MMD-FW), which minimizes MMD in a greedy way by the FW algorithm. Our method is computationally efficient empirically and we show that its finite sample convergence bound is in a linear order in finite dimensions. △ Less

Submitted 31 January, 2019; v1 submitted 21 May, 2018; originally announced May 2018.

arXiv:1803.04232 [pdf, other]

Variational Inference for Gaussian Process with Panel Count Data

Authors: Hongyi Ding, Young Lee, Issei Sato, Masashi Sugiyama

Abstract: We present the first framework for Gaussian-process-modulated Poisson processes when the temporal data appear in the form of panel counts. Panel count data frequently arise when experimental subjects are observed only at discrete time points and only the numbers of occurrences of the events between subsequent observation times are available. The exact occurrence timestamps of the events are unknow… ▽ More We present the first framework for Gaussian-process-modulated Poisson processes when the temporal data appear in the form of panel counts. Panel count data frequently arise when experimental subjects are observed only at discrete time points and only the numbers of occurrences of the events between subsequent observation times are available. The exact occurrence timestamps of the events are unknown. The method of conducting the efficient variational inference is presented, based on the assumption of a Gaussian-process-modulated intensity function. We derive a tractable lower bound to alleviate the problems of the intractable evidence lower bound inherent in the variational inference framework. Our algorithm outperforms classical methods on both synthetic and three real panel count sets. △ Less

Submitted 12 March, 2018; originally announced March 2018.

arXiv:1802.04551 [pdf, other]

Analysis of Minimax Error Rate for Crowdsourcing and Its Application to Worker Clustering Model

Authors: Hideaki Imamura, Issei Sato, Masashi Sugiyama

Abstract: While crowdsourcing has become an important means to label data, there is great interest in estimating the ground truth from unreliable labels produced by crowdworkers. The Dawid and Skene (DS) model is one of the most well-known models in the study of crowdsourcing. Despite its practical popularity, theoretical error analysis for the DS model has been conducted only under restrictive assumptions… ▽ More While crowdsourcing has become an important means to label data, there is great interest in estimating the ground truth from unreliable labels produced by crowdworkers. The Dawid and Skene (DS) model is one of the most well-known models in the study of crowdsourcing. Despite its practical popularity, theoretical error analysis for the DS model has been conducted only under restrictive assumptions on class priors, confusion matrices, or the number of labels each worker provides. In this paper, we derive a minimax error rate under more practical setting for a broader class of crowdsourcing models including the DS model as a special case. We further propose the worker clustering model, which is more practical than the DS model under real crowdsourcing settings. The wide applicability of our theoretical analysis allows us to immediately investigate the behavior of this proposed model, which can not be analyzed by existing studies. Experimental results showed that there is a strong similarity between the lower bound of the minimax error rate derived by our theoretical analysis and the empirical error of the estimated value. △ Less

Submitted 9 June, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

Comments: Accepted to ICML2018 (International Conference on Machine Learning)

arXiv:1802.04034 [pdf, other]

Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks

Authors: Yusuke Tsuzuku, Issei Sato, Masashi Sugiyama

Abstract: High sensitivity of neural networks against malicious perturbations on inputs causes security concerns. To take a steady step towards robust classifiers, we aim to create neural network models provably defended from perturbations. Prior certification work requires strong assumptions on network structures and massive computational costs, and thus the range of their applications was limited. From th… ▽ More High sensitivity of neural networks against malicious perturbations on inputs causes security concerns. To take a steady step towards robust classifiers, we aim to create neural network models provably defended from perturbations. Prior certification work requires strong assumptions on network structures and massive computational costs, and thus the range of their applications was limited. From the relationship between the Lipschitz constants and prediction margins, we present a computationally efficient calculation technique to lower-bound the size of adversarial perturbations that can deceive networks, and that is widely applicable to various complicated networks. Moreover, we propose an efficient training procedure that robustifies networks and significantly improves the provably guarded areas around data points. In experimental evaluations, our method showed its ability to provide a non-trivial guarantee and enhance robustness for even large networks. △ Less

Submitted 31 October, 2018; v1 submitted 12 February, 2018; originally announced February 2018.

Comments: To appear in NIPS2018

arXiv:1802.03877 [pdf, other]

Gaussian Process Classification with Privileged Information by Soft-to-Hard Labeling Transfer

Authors: Ryosuke Kamesawa, Issei Sato, Masashi Sugiyama

Abstract: Learning using privileged information is an attractive problem setting that helps many learning scenarios in the real world. A state-of-the-art method of Gaussian process classification (GPC) with privileged information is GPC+, which incorporates privileged information into a noise term of the likelihood. A drawback of GPC+ is that it requires numerical quadrature to calculate the posterior distr… ▽ More Learning using privileged information is an attractive problem setting that helps many learning scenarios in the real world. A state-of-the-art method of Gaussian process classification (GPC) with privileged information is GPC+, which incorporates privileged information into a noise term of the likelihood. A drawback of GPC+ is that it requires numerical quadrature to calculate the posterior distribution of the latent function, which is extremely time-consuming. To overcome this limitation, we propose a novel classification method with privileged information based on Gaussian processes, called "soft-label-transferred Gaussian process (SLT-GP)." Our basic idea is that we construct another learning task of predicting soft labels (continuous values) obtained from privileged information and we perform transfer learning from this task to the target task of predicting hard labels. We derive a PAC-Bayesian bound of our proposed method, which justifies optimizing hyperparameters by the empirical Bayes method. We also experimentally show the usefulness of our proposed method compared with GPC and GPC+. △ Less

Submitted 11 February, 2018; originally announced February 2018.

Showing 51–100 of 154 results for author: Sato, I