-
Kan Extensions in Data Science and Machine Learning
Authors:
Dan Shiebler
Abstract:
A common problem in data science is "use this function defined over this small set to generate predictions over that larger set." Extrapolation, interpolation, statistical inference and forecasting all reduce to this problem. The Kan extension is a powerful tool in category theory that generalizes this notion. In this work we explore several applications of Kan extensions to data science. We begin…
▽ More
A common problem in data science is "use this function defined over this small set to generate predictions over that larger set." Extrapolation, interpolation, statistical inference and forecasting all reduce to this problem. The Kan extension is a powerful tool in category theory that generalizes this notion. In this work we explore several applications of Kan extensions to data science. We begin by deriving a simple classification algorithm as a Kan extension and experimenting with this algorithm on real data. Next, we use the Kan extension to derive a procedure for learning clustering algorithms from labels and explore the performance of this procedure on real data. We then investigate how Kan extensions can be used to learn a general map** from datasets of labeled examples to functions and to approximate a complex function with a simpler one.
△ Less
Submitted 26 July, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
Generalized Optimization: A First Step Towards Category Theoretic Learning Theory
Authors:
Dan Shiebler
Abstract:
The Cartesian reverse derivative is a categorical generalization of reverse-mode automatic differentiation. We use this operator to generalize several optimization algorithms, including a straightforward generalization of gradient descent and a novel generalization of Newton's method. We then explore which properties of these algorithms are preserved in this generalized setting. First, we show tha…
▽ More
The Cartesian reverse derivative is a categorical generalization of reverse-mode automatic differentiation. We use this operator to generalize several optimization algorithms, including a straightforward generalization of gradient descent and a novel generalization of Newton's method. We then explore which properties of these algorithms are preserved in this generalized setting. First, we show that the transformation invariances of these algorithms are preserved: while generalized Newton's method is invariant to all invertible linear transformations, generalized gradient descent is invariant only to orthogonal linear transformations. Next, we show that we can express the change in loss of generalized gradient descent with an inner product-like expression, thereby generalizing the non-increasing and convergence properties of the gradient descent optimization flow. Finally, we include several numerical experiments to illustrate the ideas in the paper and demonstrate how we can use them to optimize polynomial functions over an ordered ring.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Category Theory in Machine Learning
Authors:
Dan Shiebler,
Bruno Gavranović,
Paul Wilson
Abstract:
Over the past two decades machine learning has permeated almost every realm of technology. At the same time, many researchers have begun using category theory as a unifying language, facilitating communication between different scientific disciplines. It is therefore unsurprising that there is a burgeoning interest in applying category theory to machine learning. We aim to document the motivations…
▽ More
Over the past two decades machine learning has permeated almost every realm of technology. At the same time, many researchers have begun using category theory as a unifying language, facilitating communication between different scientific disciplines. It is therefore unsurprising that there is a burgeoning interest in applying category theory to machine learning. We aim to document the motivations, goals and common themes across these applications. We touch on gradient-based learning, probability, and equivariant learning.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
Lessons Learned Addressing Dataset Bias in Model-Based Candidate Generation at Twitter
Authors:
Alim Virani,
Jay Baxter,
Dan Shiebler,
Philip Gautier,
Shivam Verma,
Yan Xia,
Apoorv Sharma,
Sumit Binnani,
Linlin Chen,
Chenguang Yu
Abstract:
Traditionally, heuristic methods are used to generate candidates for large scale recommender systems. Model-based candidate generation promises multiple potential advantages, primarily that we can explicitly optimize the same objective as the downstream ranking model. However, large scale model-based candidate generation approaches suffer from dataset bias problems caused by the infeasibility of o…
▽ More
Traditionally, heuristic methods are used to generate candidates for large scale recommender systems. Model-based candidate generation promises multiple potential advantages, primarily that we can explicitly optimize the same objective as the downstream ranking model. However, large scale model-based candidate generation approaches suffer from dataset bias problems caused by the infeasibility of obtaining representative data on very irrelevant candidates. Popular techniques to correct dataset bias, such as inverse propensity scoring, do not work well in the context of candidate generation. We first explore the dynamics of the dataset bias problem and then demonstrate how to use random sampling techniques to mitigate it. Finally, in a novel application of fine-tuning, we show performance gains when applying our candidate generation system to Twitter's home timeline.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
Flattening Multiparameter Hierarchical Clustering Functors
Authors:
Dan Shiebler
Abstract:
We bring together topological data analysis, applied category theory, and machine learning to study multiparameter hierarchical clustering. We begin by introducing a procedure for flattening multiparameter hierarchical clusterings. We demonstrate that this procedure is a functor from a category of multiparameter hierarchical partitions to a category of binary integer programs. We also include empi…
▽ More
We bring together topological data analysis, applied category theory, and machine learning to study multiparameter hierarchical clustering. We begin by introducing a procedure for flattening multiparameter hierarchical clusterings. We demonstrate that this procedure is a functor from a category of multiparameter hierarchical partitions to a category of binary integer programs. We also include empirical results demonstrating its effectiveness. Next, we introduce a Bayesian update algorithm for learning clustering parameters from data. We demonstrate that the composition of this algorithm with our flattening procedure satisfies a consistency property.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Functorial Manifold Learning
Authors:
Dan Shiebler
Abstract:
We adapt previous research on category theory and topological unsupervised learning to develop a functorial perspective on manifold learning, also known as nonlinear dimensionality reduction. We first characterize manifold learning algorithms as functors that map pseudometric spaces to optimization objectives and that factor through hierarchical clustering functors. We then use this characteriza…
▽ More
We adapt previous research on category theory and topological unsupervised learning to develop a functorial perspective on manifold learning, also known as nonlinear dimensionality reduction. We first characterize manifold learning algorithms as functors that map pseudometric spaces to optimization objectives and that factor through hierarchical clustering functors. We then use this characterization to prove refinement bounds on manifold learning loss functions and construct a hierarchy of manifold learning algorithms based on their equivariants. We express several popular manifold learning algorithms as functors at different levels of this hierarchy, including Metric Multidimensional Scaling, IsoMap, and UMAP. Next, we use interleaving distance to study the stability of a broad class of manifold learning algorithms. We present bounds on how closely the embeddings these algorithms produce from noisy data approximate the embeddings they would learn from noiseless data. Finally, we use our framework to derive a set of novel manifold learning algorithms, which we experimentally demonstrate are competitive with the state of the art.
△ Less
Submitted 3 November, 2022; v1 submitted 14 November, 2020;
originally announced November 2020.
-
Tuning Word2vec for Large Scale Recommendation Systems
Authors:
Benjamin P. Chamberlain,
Emanuele Rossi,
Dan Shiebler,
Suvash Sedhain,
Michael M. Bronstein
Abstract:
Word2vec is a powerful machine learning tool that emerged from Natural Lan-guage Processing (NLP) and is now applied in multiple domains, including recom-mender systems, forecasting, and network analysis. As Word2vec is often used offthe shelf, we address the question of whether the default hyperparameters are suit-able for recommender systems. The answer is emphatically no. In this paper, wefirst…
▽ More
Word2vec is a powerful machine learning tool that emerged from Natural Lan-guage Processing (NLP) and is now applied in multiple domains, including recom-mender systems, forecasting, and network analysis. As Word2vec is often used offthe shelf, we address the question of whether the default hyperparameters are suit-able for recommender systems. The answer is emphatically no. In this paper, wefirst elucidate the importance of hyperparameter optimization and show that un-constrained optimization yields an average 221% improvement in hit rate over thedefault parameters. However, unconstrained optimization leads to hyperparametersettings that are very expensive and not feasible for large scale recommendationtasks. To this end, we demonstrate 138% average improvement in hit rate with aruntime budget-constrained hyperparameter optimization. Furthermore, to makehyperparameter optimization applicable for large scale recommendation problemswhere the target dataset is too large to search over, we investigate generalizinghyperparameters settings from samples. We show that applying constrained hy-perparameter optimization using only a 10% sample of the data still yields a 91%average improvement in hit rate over the default parameters when applied to thefull datasets. Finally, we apply hyperparameters learned using our method of con-strained optimization on a sample to the Who To Follow recommendation serviceat Twitter and are able to increase follow rates by 15%.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
Categorical Stochastic Processes and Likelihood
Authors:
Dan Shiebler
Abstract:
In this work we take a Category Theoretic perspective on the relationship between probabilistic modeling and function approximation. We begin by defining two extensions of function composition to stochastic process subordination: one based on the co-Kleisli category under the comonad (Omega x -) and one based on the parameterization of a category with a Lawvere theory. We show how these extensions…
▽ More
In this work we take a Category Theoretic perspective on the relationship between probabilistic modeling and function approximation. We begin by defining two extensions of function composition to stochastic process subordination: one based on the co-Kleisli category under the comonad (Omega x -) and one based on the parameterization of a category with a Lawvere theory. We show how these extensions relate to the category Stoch and other Markov Categories. Next, we apply the Para construction to extend stochastic processes to parameterized statistical models and we define a way to compose the likelihood functions of these models. We conclude with a demonstration of how the Maximum Likelihood Estimation procedure defines an identity-on-objects functor from the category of statistical models to the category of Learners. Code to accompany this paper can be found at https://github.com/dshieble/Categorical_Stochastic_Processes_and_Likelihood
△ Less
Submitted 9 January, 2022; v1 submitted 10 May, 2020;
originally announced May 2020.
-
Incremental Monoidal Grammars
Authors:
Dan Shiebler,
Alexis Toumi,
Mehrnoosh Sadrzadeh
Abstract:
In this work we define formal grammars in terms of free monoidal categories, along with a functor from the category of formal grammars to the category of automata. Generalising from the Booleans to arbitrary semirings, we extend our construction to weighted formal grammars and weighted automata. This allows us to link the categorical viewpoint on natural language to the standard machine learning n…
▽ More
In this work we define formal grammars in terms of free monoidal categories, along with a functor from the category of formal grammars to the category of automata. Generalising from the Booleans to arbitrary semirings, we extend our construction to weighted formal grammars and weighted automata. This allows us to link the categorical viewpoint on natural language to the standard machine learning notion of probabilistic language model.
△ Less
Submitted 10 January, 2020; v1 submitted 2 January, 2020;
originally announced January 2020.
-
Fighting Redundancy and Model Decay with Embeddings
Authors:
Dan Shiebler,
Luca Belli,
Jay Baxter,
Hanchen Xiong,
Abhishek Tayal
Abstract:
Every day, hundreds of millions of new Tweets containing over 40 languages of ever-shifting vernacular flow through Twitter. Models that attempt to extract insight from this firehose of information must face the torrential covariate shift that is endemic to the Twitter platform. While regularly-retrained algorithms can maintain performance in the face of this shift, fixed model features that fail…
▽ More
Every day, hundreds of millions of new Tweets containing over 40 languages of ever-shifting vernacular flow through Twitter. Models that attempt to extract insight from this firehose of information must face the torrential covariate shift that is endemic to the Twitter platform. While regularly-retrained algorithms can maintain performance in the face of this shift, fixed model features that fail to represent new trends and tokens can quickly become stale, resulting in performance degradation. To mitigate this problem we employ learned features, or embedding models, that can efficiently represent the most relevant aspects of a data distribution. Sharing these embedding models across teams can also reduce redundancy and multiplicatively increase cross-team modeling productivity. In this paper, we detail the commoditized tools, algorithms and pipelines that we have developed and are develo** at Twitter to regularly generate high quality, up-to-date embeddings and share them broadly across the company.
△ Less
Submitted 18 September, 2018;
originally announced September 2018.
-
A Correlation Maximization Approach for Cross Domain Co-Embeddings
Authors:
Dan Shiebler
Abstract:
Although modern recommendation systems can exploit the structure in users' item feedback, most are powerless in the face of new users who provide no structure for them to exploit. In this paper we introduce ImplicitCE, an algorithm for recommending items to new users during their sign-up flow. ImplicitCE works by transforming users' implicit feedback towards auxiliary domain items into an embeddin…
▽ More
Although modern recommendation systems can exploit the structure in users' item feedback, most are powerless in the face of new users who provide no structure for them to exploit. In this paper we introduce ImplicitCE, an algorithm for recommending items to new users during their sign-up flow. ImplicitCE works by transforming users' implicit feedback towards auxiliary domain items into an embedding in the target domain item embedding space. ImplicitCE learns these embedding spaces and transformation function in an end-to-end fashion and can co-embed users and items with any differentiable similarity function. To train ImplicitCE we explore methods for maximizing the correlations between model predictions and users' affinities and introduce Sample Correlation Update, a novel and extremely simple training strategy. Finally, we show that ImplicitCE trained with Sample Correlation Update outperforms a variety of state of the art algorithms and loss functions on both a large scale Twitter dataset and the DBLP dataset.
△ Less
Submitted 10 September, 2018;
originally announced September 2018.
-
Learning what and where to attend
Authors:
Drew Linsley,
Dan Shiebler,
Sven Eberhardt,
Thomas Serre
Abstract:
Most recent gains in visual recognition have originated from the inclusion of attention mechanisms in deep convolutional networks (DCNs). Because these networks are optimized for object recognition, they learn where to attend using only a weak form of supervision derived from image class labels. Here, we demonstrate the benefit of using stronger supervisory signals by teaching DCNs to attend to im…
▽ More
Most recent gains in visual recognition have originated from the inclusion of attention mechanisms in deep convolutional networks (DCNs). Because these networks are optimized for object recognition, they learn where to attend using only a weak form of supervision derived from image class labels. Here, we demonstrate the benefit of using stronger supervisory signals by teaching DCNs to attend to image regions that humans deem important for object recognition. We first describe a large-scale online experiment (ClickMe) used to supplement ImageNet with nearly half a million human-derived "top-down" attention maps. Using human psychophysics, we confirm that the identified top-down features from ClickMe are more diagnostic than "bottom-up" saliency features for rapid image categorization. As a proof of concept, we extend a state-of-the-art attention network and demonstrate that adding ClickMe supervision significantly improves its accuracy and yields visual features that are more interpretable and more similar to those used by human observers.
△ Less
Submitted 11 June, 2019; v1 submitted 22 May, 2018;
originally announced May 2018.