Search | arXiv e-print repository

Partial Gromov-Wasserstein Metric

Authors: Yikun Bai, Rocio Diaz Martin, Abihith Kothapalli, Hengrong Du, Xinran Liu, Soheil Kolouri

Abstract: The Gromov-Wasserstein (GW) distance has gained increasing interest in the machine learning community in recent years, as it allows for the comparison of measures in different metric spaces. To overcome the limitations imposed by the equal mass requirements of the classical GW problem, researchers have begun exploring its application in unbalanced settings. However, Unbalanced GW (UGW) can only be… ▽ More The Gromov-Wasserstein (GW) distance has gained increasing interest in the machine learning community in recent years, as it allows for the comparison of measures in different metric spaces. To overcome the limitations imposed by the equal mass requirements of the classical GW problem, researchers have begun exploring its application in unbalanced settings. However, Unbalanced GW (UGW) can only be regarded as a discrepancy rather than a rigorous metric/distance between two metric measure spaces (mm-spaces). In this paper, we propose a particular case of the UGW problem, termed Partial Gromov-Wasserstein (PGW). We establish that PGW is a well-defined metric between mm-spaces and discuss its theoretical properties, including the existence of a minimizer for the PGW problem and the relationship between PGW and GW, among others. We then propose two variants of the Frank-Wolfe algorithm for solving the PGW problem and show that they are mathematically and computationally equivalent. Moreover, based on our PGW metric, we introduce the analogous concept of barycenters for mm-spaces. Finally, we validate the effectiveness of our PGW metric and related solvers in applications such as shape matching, shape retrieval, and shape interpolation, comparing them against existing baselines. △ Less

Submitted 28 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.02345 [pdf, other]

Stereographic Spherical Sliced Wasserstein Distances

Authors: Huy Tran, Yikun Bai, Abihith Kothapalli, Ashkan Shahbazi, Xinran Liu, Rocio Diaz Martin, Soheil Kolouri

Abstract: Comparing spherical probability distributions is of great interest in various fields, including geology, medical domains, computer vision, and deep representation learning. The utility of optimal transport-based distances, such as the Wasserstein distance, for comparing probability measures has spurred active research in develo** computationally efficient variations of these distances for spheri… ▽ More Comparing spherical probability distributions is of great interest in various fields, including geology, medical domains, computer vision, and deep representation learning. The utility of optimal transport-based distances, such as the Wasserstein distance, for comparing probability measures has spurred active research in develo** computationally efficient variations of these distances for spherical probability measures. This paper introduces a high-speed and highly parallelizable distance for comparing spherical measures using the stereographic projection and the generalized Radon transform, which we refer to as the Stereographic Spherical Sliced Wasserstein (S3W) distance. We carefully address the distance distortion caused by the stereographic projection and provide an extensive theoretical analysis of our proposed metric and its rotationally invariant variation. Finally, we evaluate the performance of the proposed metrics and compare them with recent baselines in terms of both speed and accuracy through a wide range of numerical studies, including gradient flows and self-supervised learning. Our code is available at https://github.com/mint-vu/s3wd. △ Less

Submitted 9 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Published at ICML 2024 (Spotlight). Project page: https://abi-kothapalli.github.io/s3w/

arXiv:2212.11110 [pdf, other]

Lifelong Reinforcement Learning with Modulating Masks

Authors: Eseoghene Ben-Iwhiwhu, Saptarshi Nath, Praveen K. Pilly, Soheil Kolouri, Andrea Soltoggio

Abstract: Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple supervised classification tasks that involve changes in t… ▽ More Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple supervised classification tasks that involve changes in the input distribution, lifelong reinforcement learning (LRL) must deal with variations in the state and transition distributions, and in the reward functions. Modulating masks with a fixed backbone network, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning. △ Less

Submitted 1 August, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: Code available at https://github.com/dlpbc/mask-lrl

Journal ref: Transactions on Machine Learning Research (2023)

arXiv:2212.08049 [pdf, other]

Sliced Optimal Partial Transport

Authors: Yikun Bai, Berhnard Schmitzer, Mathew Thorpe, Soheil Kolouri

Abstract: Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a lin… ▽ More Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration. △ Less

Submitted 7 August, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: modify the link of Github page

arXiv:2006.09430 [pdf, other]

Wasserstein Embedding for Graph Learning

Authors: Soheil Kolouri, Navid Naderializadeh, Gustavo K. Rohde, Heiko Hoffmann

Abstract: We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein… ▽ More We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Unlike prior work, we avoid pairwise calculation of distances between graphs and reduce the computational complexity from quadratic to linear in the number of graphs. WEGL calculates Monge maps from a reference distribution to each node embedding and, based on these maps, creates a fixed-sized vector representation of the graph. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks, showing state-of-the-art classification performance while having superior computational efficiency. The code is available at https://github.com/navid-naderi/WEGL. △ Less

Submitted 1 March, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Final version to be presented at the Ninth International Conference on Learning Representations (ICLR 2021)

arXiv:2003.05783 [pdf, other]

Statistical and Topological Properties of Sliced Probability Divergences

Authors: Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, Umut Şimşekli

Abstract: The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique hav… ▽ More The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments. △ Less

Submitted 4 January, 2022; v1 submitted 12 March, 2020; originally announced March 2020.

Comments: Published at NeurIPS 2020 (Spotlight)

arXiv:2002.12537 [pdf, other]

Generalized Sliced Distances for Probability Distributions

Authors: Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Shahin Shahrampour

Abstract: Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this… ▽ More Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this paper, we introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs), that are deeply rooted in the generalized Radon transform. We first verify that GSPMs are metrics. Then, we identify a subset of GSPMs that are equivalent to maximum mean discrepancy (MMD) with novel positive definite kernels, which come with a unique geometric interpretation. Finally, by exploiting this connection, we consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions, the gradient flow converges to the global optimum. We illustrate the utility of our approach on both real and synthetic problems. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:1909.09902 [pdf, other]

Deep Reinforcement Learning with Modulated Hebbian plus Q Network Architecture

Authors: Pawel Ladosz, Eseoghene Ben-Iwhiwhu, Jeffery Dick, Yang Hu, Nicholas Ketz, Soheil Kolouri, Jeffrey L. Krichmar, Praveen Pilly, Andrea Soltoggio

Abstract: This paper presents a new neural architecture that combines a modulated Hebbian network (MOHN) with DQN, which we call modulated Hebbian plus Q network architecture (MOHQA). The hypothesis is that such a combination allows MOHQA to solve difficult partially observable Markov decision process (POMDP) problems which impair temporal difference (TD)-based RL algorithms such as DQN, as the TD error can… ▽ More This paper presents a new neural architecture that combines a modulated Hebbian network (MOHN) with DQN, which we call modulated Hebbian plus Q network architecture (MOHQA). The hypothesis is that such a combination allows MOHQA to solve difficult partially observable Markov decision process (POMDP) problems which impair temporal difference (TD)-based RL algorithms such as DQN, as the TD error cannot be easily derived from observations. The key idea is to use a Hebbian network with bio-inspired neural traces in order to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low level features and control, while the MOHN contributes to the high-level decisions by associating rewards with past states and actions. Thus the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the MALMO environment show that the proposed algorithm improved DQN's results and even outperformed control tests with A2C, QRDQN+LSTM and REINFORCE algorithms on some POMDPs with confounding stimuli and sparse rewards. △ Less

Submitted 14 October, 2021; v1 submitted 21 September, 2019; originally announced September 2019.

arXiv:1907.02271 [pdf, other]

Learning a Domain-Invariant Embedding for Unsupervised Domain Adaptation Using Class-Conditioned Distribution Alignment

Authors: Alex Gabourie, Mohammad Rostami, Philip Pope, Soheil Kolouri, Kyungnam Kim

Abstract: We address the problem of unsupervised domain adaptation (UDA) by learning a cross-domain agnostic embedding space, where the distance between the probability distributions of the two source and target visual domains is minimized. We use the output space of a shared cross-domain deep encoder to model the embedding space anduse the Sliced-Wasserstein Distance (SWD) to measure and minimize the dista… ▽ More We address the problem of unsupervised domain adaptation (UDA) by learning a cross-domain agnostic embedding space, where the distance between the probability distributions of the two source and target visual domains is minimized. We use the output space of a shared cross-domain deep encoder to model the embedding space anduse the Sliced-Wasserstein Distance (SWD) to measure and minimize the distance between the embedded distributions of two source and target domains to enforce the embedding to be domain-agnostic.Additionally, we use the source domain labeled data to train a deep classifier from the embedding space to the label space to enforce the embedding space to be discriminative.As a result of this training scheme, we provide an effective solution to train the deep classification network on the source domain such that it will generalize well on the target domain, where only unlabeled training data is accessible. To mitigate the challenge of class matching, we also align corresponding classes in the embedding space by using high confidence pseudo-labels for the target domain, i.e. assigning the class for which the source classifier has a high prediction probability. We provide experimental results on UDA benchmark tasks to demonstrate that our method is effective and leads to state-of-the-art performance. △ Less

Submitted 24 September, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

arXiv:1907.02220 [pdf, other]

Neural Networks, Hypersurfaces, and Radon Transforms

Authors: Soheil Kolouri, Xuwang Yin, Gustavo K. Rohde

Abstract: Connections between integration along hypersufaces, Radon transforms, and neural networks are exploited to highlight an integral geometric mathematical interpretation of neural networks. By analyzing the properties of neural networks as operators on probability distributions for observed data, we show that the distribution of outputs for any node in a neural network can be interpreted as a nonline… ▽ More Connections between integration along hypersufaces, Radon transforms, and neural networks are exploited to highlight an integral geometric mathematical interpretation of neural networks. By analyzing the properties of neural networks as operators on probability distributions for observed data, we show that the distribution of outputs for any node in a neural network can be interpreted as a nonlinear projection along hypersurfaces defined by level surfaces over the input data space. We utilize these descriptions to provide new interpretation for phenomena such as nonlinearity, pooling, activation functions, and adversarial examples in neural network-based learning problems. △ Less

Submitted 4 July, 2019; originally announced July 2019.

arXiv:1906.10509 [pdf, other]

Zero-Shot Image Classification Using Coupled Dictionary Embedding

Authors: Mohammad Rostami, Soheil Kolouri, Zak Murez, Yuri Owekcho, Eric Eaton, Kuyngnam Kim

Abstract: Zero-shot learning (ZSL) is a framework to classify images belonging to unseen classes based on solely semantic information about these unseen classes. In this paper, we propose a new ZSL algorithm using coupled dictionary learning. The core idea is that the visual features and the semantic attributes of an image can share the same sparse representation in an intermediate space. We use images from… ▽ More Zero-shot learning (ZSL) is a framework to classify images belonging to unseen classes based on solely semantic information about these unseen classes. In this paper, we propose a new ZSL algorithm using coupled dictionary learning. The core idea is that the visual features and the semantic attributes of an image can share the same sparse representation in an intermediate space. We use images from seen classes and semantic attributes from seen and unseen classes to learn two dictionaries that can represent sparsely the visual and semantic feature vectors of an image. In the ZSL testing stage and in the absence of labeled data, images from unseen classes can be mapped into the attribute space by finding the joint sparse representation using solely the visual data. The image is then classified in the attribute space given semantic descriptions of unseen classes. We also provide an attribute-aware formulation to tackle domain shift and hubness problems in ZSL. Extensive experiments are provided to demonstrate the superior performance of our approach against the state of the art ZSL algorithms on benchmark ZSL datasets. △ Less

Submitted 23 October, 2021; v1 submitted 9 June, 2019; originally announced June 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1709.03688

arXiv:1906.03744 [pdf, other]

Generative Continual Concept Learning

Authors: Mohammad Rostami, Soheil Kolouri, James McClelland, Praveen Pilly

Abstract: After learning a concept, humans are also able to continually generalize their learned concepts to new domains by observing only a few labeled instances without any interference with the past learned knowledge. In contrast, learning concepts efficiently in a continual learning setting remains an open challenge for current Artificial Intelligence algorithms as persistent model retraining is necessa… ▽ More After learning a concept, humans are also able to continually generalize their learned concepts to new domains by observing only a few labeled instances without any interference with the past learned knowledge. In contrast, learning concepts efficiently in a continual learning setting remains an open challenge for current Artificial Intelligence algorithms as persistent model retraining is necessary. Inspired by the Parallel Distributed Processing learning and the Complementary Learning Systems theories, we develop a computational model that is able to expand its previously learned concepts efficiently to new domains using a few labeled samples. We couple the new form of a concept to its past learned forms in an embedding space for effective continual learning. Doing so, a generative distribution is learned such that it is shared across the tasks in the embedding space and models the abstract concepts. This procedure enables the model to generate pseudo-data points to replay the past experience to tackle catastrophic forgetting. △ Less

Submitted 7 September, 2019; v1 submitted 9 June, 2019; originally announced June 2019.

arXiv:1905.11475 [pdf, other]

GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification

Authors: Xuwang Yin, Soheil Kolouri, Gustavo K. Rohde

Abstract: The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we propose a princi… ▽ More The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we propose a principled adversarial example detection method that can withstand norm-constrained white-box attacks. Inspired by one-versus-the-rest classification, in a K class classification problem, we train K binary classifiers where the i-th binary classifier is used to distinguish between clean data of class i and adversarially perturbed samples of other classes. At test time, we first use a trained classifier to get the predicted label (say k) of the input, and then use the k-th binary classifier to determine whether the input is a clean sample (of class k) or an adversarially perturbed example (of other classes). We further devise a generative approach to detecting/classifying adversarial examples by interpreting each binary classifier as an unnormalized density model of the class-conditional data. We provide comprehensive evaluation of the above adversarial example detection/classification methods, and demonstrate their competitive performances and compelling properties. △ Less

Submitted 1 October, 2022; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: ICLR 2020, code is available at https://github.com/xuwangyin/GAT-Generative-Adversarial-Training; v4 fixed error in Figure 2

arXiv:1903.08329 [pdf, other]

On Sampling Random Features From Empirical Leverage Scores: Implementation and Theoretical Guarantees

Authors: Shahin Shahrampour, Soheil Kolouri

Abstract: Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral oper… ▽ More Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral operator (depending on input distribution), which is impractical to sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound, revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. Our experiments verify that the empirical algorithm consistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitive to supervised data-dependent kernel learning, without using the output (label) information. △ Less

Submitted 19 March, 2019; originally announced March 2019.

Comments: 23 pages

arXiv:1903.06070 [pdf, other]

Attention-Based Structural-Plasticity

Authors: Soheil Kolouri, Nicholas Ketz, Xinyun Zou, Jeffrey Krichmar, Praveen Pilly

Abstract: Catastrophic forgetting/interference is a critical problem for lifelong learning machines, which impedes the agents from maintaining their previously learned knowledge while learning new tasks. Neural networks, in particular, suffer plenty from the catastrophic forgetting phenomenon. Recently there has been several efforts towards overcoming catastrophic forgetting in neural networks. Here, we pro… ▽ More Catastrophic forgetting/interference is a critical problem for lifelong learning machines, which impedes the agents from maintaining their previously learned knowledge while learning new tasks. Neural networks, in particular, suffer plenty from the catastrophic forgetting phenomenon. Recently there has been several efforts towards overcoming catastrophic forgetting in neural networks. Here, we propose a biologically inspired method toward overcoming catastrophic forgetting. Specifically, we define an attention-based selective plasticity of synapses based on the cholinergic neuromodulatory system in the brain. We define synaptic importance parameters in addition to synaptic weights and then use Hebbian learning in parallel with backpropagation algorithm to learn synaptic importances in an online and seamless manner. We test our proposed method on benchmark tasks including the Permuted MNIST and the Split MNIST problems and show competitive performance compared to the state-of-the-art methods. △ Less

Submitted 2 March, 2019; originally announced March 2019.

arXiv:1903.04566 [pdf, other]

Complementary Learning for Overcoming Catastrophic Forgetting Using Experience Replay

Authors: Mohammad Rostami, Soheil Kolouri, Praveen K. Pilly

Abstract: Despite huge success, deep networks are unable to learn effectively in sequential multitask learning settings as they forget the past learned tasks after learning new tasks. Inspired from complementary learning systems theory, we address this challenge by learning a generative model that couples the current task to the past learned tasks through a discriminative embedding space. We learn an abstra… ▽ More Despite huge success, deep networks are unable to learn effectively in sequential multitask learning settings as they forget the past learned tasks after learning new tasks. Inspired from complementary learning systems theory, we address this challenge by learning a generative model that couples the current task to the past learned tasks through a discriminative embedding space. We learn an abstract level generative distribution in the embedding that allows the generation of data points to represent the experience. We sample from this distribution and utilize experience replay to avoid forgetting and simultaneously accumulate new knowledge to the abstract distribution in order to couple the current task with past experience. We demonstrate theoretically and empirically that our framework learns a distribution in the embedding that is shared across all task and as a result tackles catastrophic forgetting. △ Less

Submitted 31 May, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

arXiv:1903.02647 [pdf, other]

Continual Learning Using World Models for Pseudo-Rehearsal

Authors: Nicholas Ketz, Soheil Kolouri, Praveen Pilly

Abstract: The utility of learning a dynamics/world model of the environment in reinforcement learning has been shown in a many ways. When using neural networks, however, these models suffer catastrophic forgetting when learned in a lifelong or continual fashion. Current solutions to the continual learning problem require experience to be segmented and labeled as discrete tasks, however, in continuous experi… ▽ More The utility of learning a dynamics/world model of the environment in reinforcement learning has been shown in a many ways. When using neural networks, however, these models suffer catastrophic forgetting when learned in a lifelong or continual fashion. Current solutions to the continual learning problem require experience to be segmented and labeled as discrete tasks, however, in continuous experience it is generally unclear what a sufficient segmentation of tasks would be. Here we propose a method to continually learn these internal world models through the interleaving of internally generated episodes of past experiences (i.e., pseudo-rehearsal). We show this method can sequentially learn unsupervised temporal prediction, without task labels, in a disparate set of Atari games. Empirically, this interleaving of the internally generated rollouts with the external environment's observations leads to a consistent reduction in temporal prediction loss compared to non-interleaved learning and is preserved over repeated random exposures to various tasks. Similarly, using a network distillation approach, we show that modern policy gradient based reinforcement learning algorithms can use this internal model to continually learn to optimize reward based on the world model's representation of the environment. △ Less

Submitted 11 June, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

MSC Class: 68T05 91E40

arXiv:1902.00434 [pdf, other]

Generalized Sliced Wasserstein Distances

Authors: Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, Gustavo K. Rohde

Abstract: The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, specifically, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervise… ▽ More The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, specifically, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervised learning. In this paper, we first clarify the mathematical connection between the SW distance and the Radon transform. We then utilize the generalized Radon transform to define a new family of distances for probability measures, which we call generalized sliced-Wasserstein (GSW) distances. We also show that, similar to the SW distance, the GSW distance can be extended to a maximum GSW (max-GSW) distance. We then provide the conditions under which GSW and max-GSW distances are indeed distances. Finally, we compare the numerical performance of the proposed distances on several generative modeling tasks, including SW flows and SW auto-encoders. △ Less

Submitted 1 February, 2019; originally announced February 2019.

arXiv:1812.00265 [pdf, other]

Discovering Molecular Functional Groups Using Graph Convolutional Neural Networks

Authors: Phillip Pope, Soheil Kolouri, Mohammad Rostrami, Charles Martin, Heiko Hoffmann

Abstract: Functional groups (FGs) are molecular substructures that are served as a foundation for analyzing and predicting chemical properties of molecules. Automatic discovery of FGs will impact various fields of research, including medicinal chemistry and material sciences, by reducing the amount of lab experiments required for discovery or synthesis of new molecules. In this paper, we investigate methods… ▽ More Functional groups (FGs) are molecular substructures that are served as a foundation for analyzing and predicting chemical properties of molecules. Automatic discovery of FGs will impact various fields of research, including medicinal chemistry and material sciences, by reducing the amount of lab experiments required for discovery or synthesis of new molecules. In this paper, we investigate methods based on graph convolutional neural networks (GCNNs) for localizing FGs that contribute to specific chemical properties of interest. In our framework, molecules are modeled as undirected relational graphs with atoms as nodes and bonds as edges. Using this relational graph structure, we trained GCNNs in a supervised way on experimentally-validated molecular training sets to predict specific chemical properties, e.g., toxicity. Upon learning a GCNN, we analyzed its activation patterns to automatically identify FGs using four different explainability methods that we have developed: gradient-based saliency maps, Class Activation Map** (CAM), gradient-weighted CAM (Grad-CAM), and Excitation Back-Propagation. Although these methods are originally derived for convolutional neural networks (CNNs), we adapt them to develop the corresponding suitable versions for GCNNs. We evaluated the contrastive power of these methods with respect to the specificity of the identified molecular substructures and their relevance for chemical functions. Grad-CAM had the highest contrastive power and generated qualitatively the best FGs. This work paves the way for automatic analysis and design of new molecules. △ Less

Submitted 9 October, 2019; v1 submitted 1 December, 2018; originally announced December 2018.

arXiv:1804.01947 [pdf, other]

Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model

Authors: Soheil Kolouri, Phillip E. Pope, Charles E. Martin, Gustavo K. Rohde

Abstract: In this paper we study generative modeling via autoencoders while using the elegant geometric properties of the optimal transport (OT) problem and the Wasserstein distances. We introduce Sliced-Wasserstein Autoencoders (SWAE), which are generative models that enable one to shape the distribution of the latent space into any samplable probability distribution without the need for training an advers… ▽ More In this paper we study generative modeling via autoencoders while using the elegant geometric properties of the optimal transport (OT) problem and the Wasserstein distances. We introduce Sliced-Wasserstein Autoencoders (SWAE), which are generative models that enable one to shape the distribution of the latent space into any samplable probability distribution without the need for training an adversarial network or defining a closed-form for the distribution. In short, we regularize the autoencoder loss with the sliced-Wasserstein distance between the distribution of the encoded training samples and a predefined samplable distribution. We show that the proposed formulation has an efficient numerical solution that provides similar capabilities to Wasserstein Autoencoders (WAE) and Variational Autoencoders (VAE), while benefiting from an embarrassingly simple implementation. △ Less

Submitted 26 June, 2018; v1 submitted 5 April, 2018; originally announced April 2018.

arXiv:1711.05376 [pdf, other]

Sliced Wasserstein Distance for Learning Gaussian Mixture Models

Authors: Soheil Kolouri, Gustavo K. Rohde, Heiko Hoffmann

Abstract: Gaussian mixture models (GMM) are powerful parametric tools with many applications in machine learning and computer vision. Expectation maximization (EM) is the most popular algorithm for estimating the GMM parameters. However, EM guarantees only convergence to a stationary point of the log-likelihood function, which could be arbitrarily worse than the optimal solution. Inspired by the relationshi… ▽ More Gaussian mixture models (GMM) are powerful parametric tools with many applications in machine learning and computer vision. Expectation maximization (EM) is the most popular algorithm for estimating the GMM parameters. However, EM guarantees only convergence to a stationary point of the log-likelihood function, which could be arbitrarily worse than the optimal solution. Inspired by the relationship between the negative log-likelihood function and the Kullback-Leibler (KL) divergence, we propose an alternative formulation for estimating the GMM parameters using the sliced Wasserstein distance, which gives rise to a new algorithm. Specifically, we propose minimizing the sliced-Wasserstein distance between the mixture model and the data distribution with respect to the GMM parameters. In contrast to the KL-divergence, the energy landscape for the sliced-Wasserstein distance is more well-behaved and therefore more suitable for a stochastic gradient descent scheme to obtain the optimal GMM parameters. We show that our formulation results in parameter estimates that are more robust to random initializations and demonstrate that it can estimate high-dimensional data distributions more faithfully than the EM algorithm. △ Less

Submitted 15 November, 2017; v1 submitted 14 November, 2017; originally announced November 2017.

arXiv:1511.03198 [pdf, other]

Sliced Wasserstein Kernels for Probability Distributions

Authors: Soheil Kolouri, Yang Zou, Gustavo K. Rohde

Abstract: Optimal transport distances, otherwise known as Wasserstein distances, have recently drawn ample attention in computer vision and machine learning as a powerful discrepancy measure for probability distributions. The recent developments on alternative formulations of the optimal transport have allowed for faster solutions to the problem and has revamped its practical applications in machine learnin… ▽ More Optimal transport distances, otherwise known as Wasserstein distances, have recently drawn ample attention in computer vision and machine learning as a powerful discrepancy measure for probability distributions. The recent developments on alternative formulations of the optimal transport have allowed for faster solutions to the problem and has revamped its practical applications in machine learning. In this paper, we exploit the widely used kernel methods and provide a family of provably positive definite kernels based on the Sliced Wasserstein distance and demonstrate the benefits of these kernels in a variety of learning tasks. Our work provides a new perspective on the application of optimal transport flavored distances through kernel methods in machine learning tasks. △ Less

Submitted 10 November, 2015; originally announced November 2015.

Showing 1–22 of 22 results for author: Kolouri, S