Search | arXiv e-print repository

Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers

Authors: Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cassandra Burdziak, Tal Nawy, Anna Gilbert, Dana Pe'er

Abstract: Optimal transport (OT) and the related Wasserstein metric (W) are powerful and ubiquitous tools for comparing distributions. However, computing pairwise Wasserstein distances rapidly becomes intractable as cohort size grows. An attractive alternative would be to find an embedding space in which pairwise Euclidean distances map to OT distances, akin to standard multidimensional scaling (MDS). We pr… ▽ More Optimal transport (OT) and the related Wasserstein metric (W) are powerful and ubiquitous tools for comparing distributions. However, computing pairwise Wasserstein distances rapidly becomes intractable as cohort size grows. An attractive alternative would be to find an embedding space in which pairwise Euclidean distances map to OT distances, akin to standard multidimensional scaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoder that embeds empirical distributions into a latent space wherein Euclidean distances approximate OT distances. Extending MDS theory, we show that our objective function implies a bound on the error incurred when embedding non-Euclidean distances. Empirically, distances between Wormhole embeddings closely match Wasserstein distances, enabling linear time computation of OT distances. Along with an encoder that maps distributions to embeddings, Wasserstein Wormhole includes a decoder that maps embeddings back to distributions, allowing for operations in the embedding space to generalize to OT spaces, such as Wasserstein barycenter estimation and OT interpolation. By lending scalability and interpretability to OT approaches, Wasserstein Wormhole unlocks new avenues for data analysis in the fields of computational geometry and single-cell biology. △ Less

Submitted 3 June, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: Published at the Forty-first International Conference on Machine Learning (ICML2024)

arXiv:2208.06124 [pdf, other]

Gradient Estimation for Binary Latent Variables via Gradient Variance Clip**

Authors: Russell Z. Kunes, Mingzhang Yin, Max Land, Doron Haviv, Dana Pe'er, Simon Tavaré

Abstract: Gradient estimation is often necessary for fitting generative models with discrete latent variables, in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have pot… ▽ More Gradient estimation is often necessary for fitting generative models with discrete latent variables, in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have potentially exploding variance near the boundary of the parameter space, where solutions tend to lie. To ameliorate this issue, we propose a new gradient estimator \textit{bitflip}-1 that has lower variance at the boundaries of the parameter space. As bitflip-1 has complementary properties to existing estimators, we introduce an aggregated estimator, \textit{unbiased gradient variance clip**} (UGC) that uses either a bitflip-1 or a DisARM gradient update for each coordinate. We theoretically prove that UGC has uniformly lower variance than DisARM. Empirically, we observe that UGC achieves the optimal value of the optimization objectives in toy experiments, discrete VAE training, and in a best subset selection problem. △ Less

Submitted 12 August, 2022; originally announced August 2022.

arXiv:1907.00205 [pdf, other]

doi 10.1038/s41586-021-03229-4

The Ramanujan Machine: Automatically Generated Conjectures on Fundamental Constants

Authors: Gal Raayoni, Shahar Gottlieb, George Pisha, Yoav Harris, Yahel Manor, Uri Mendlovic, Doron Haviv, Yaron Hadad, Ido Kaminer

Abstract: Fundamental mathematical constants like $e$ and $π$ are ubiquitous in diverse fields of science, from abstract mathematics to physics, biology and chemistry. For centuries, new formulas relating fundamental constants have been scarce and usually discovered sporadically. Here we propose a novel and systematic approach that leverages algorithms for deriving mathematical formulas for fundamental cons… ▽ More Fundamental mathematical constants like $e$ and $π$ are ubiquitous in diverse fields of science, from abstract mathematics to physics, biology and chemistry. For centuries, new formulas relating fundamental constants have been scarce and usually discovered sporadically. Here we propose a novel and systematic approach that leverages algorithms for deriving mathematical formulas for fundamental constants and help reveal their underlying structure. Our algorithms find dozens of well-known as well as previously unknown continued fraction representations of $π$, $e$, Catalan's constant, and values of the Riemann zeta function. Two example conjectures found by our algorithm and so far unproven are: \begin{equation*} \frac{24}{π^2} = 2 + 7\cdot 0\cdot 1+ \frac{8\cdot1^4}{2 + 7\cdot 1\cdot 2 + \frac{8\cdot2^4}{2 + 7\cdot 2\cdot 3 + \frac{8\cdot3^4}{2 + 7\cdot 3\cdot 4 + \frac{8\cdot4^4}{..}}}} \quad\quad,\quad\quad \frac{8}{7 ζ(3)} = 1\cdot 1 - \frac{1^6}{3\cdot 7 - \frac{2^6}{5\cdot 19 - \frac{3^6}{7\cdot 37 - \frac{4^6}{..}}}} \end{equation*} We present two algorithms that proved useful in finding conjectures: a Meet-In-The-Middle (MITM) algorithm and a Gradient Descent (GD) tailored to the recurrent structure of continued fractions. Both algorithms are based on matching numerical values and thus they conjecture formulas without providing proofs and without requiring prior knowledge on any underlying mathematical structure. This approach is especially attractive for constants for which no mathematical structure is known, as it reverses the conventional approach of sequential logic in formal proofs. Instead, our work supports a different approach for research: algorithms utilizing numerical data to unveil mathematical structures, thus trying to play the role of intuition of great mathematicians of the past, providing leads to new mathematical research. △ Less

Submitted 30 April, 2020; v1 submitted 29 June, 2019; originally announced July 2019.

Comments: 5 figures, 6 tables, 28 pages including the supplementary information

Journal ref: Nature 590, 67-73 (2021)

arXiv:1902.07275 [pdf, other]

Understanding and Controlling Memory in Recurrent Neural Networks

Authors: Doron Haviv, Alexander Rivkind, Omri Barak

Abstract: To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. While the relation between memories and the network's hidden state dynamics was established over the last decade, previous works in this direction were of a predominantly descriptive nature focusing mainly on locating the dynamical objects of interest. In… ▽ More To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. While the relation between memories and the network's hidden state dynamics was established over the last decade, previous works in this direction were of a predominantly descriptive nature focusing mainly on locating the dynamical objects of interest. In particular, it remained unclear how dynamical observables affect the performance, how they form and whether they can be manipulated. Here, we utilize different training protocols, datasets and architectures to obtain a range of networks solving a delayed classification task with similar performance, alongside substantial differences in their ability to extrapolate for longer delays. We analyze the dynamics of the network's hidden state, and uncover the reasons for this difference. Each memory is found to be associated with a nearly steady state of the dynamics which we refer to as a 'slow point'. Slow point speeds predict extrapolation performance across all datasets, protocols and architectures tested. Furthermore, by tracking the formation of the slow points we are able to understand the origin of differences between training protocols. Finally, we propose a novel regularization technique that is based on the relation between hidden state speeds and memory longevity. Our technique manipulates these speeds, thereby leading to a dramatic improvement in memory robustness over time, and could pave the way for a new class of regularization methods. △ Less

Submitted 16 September, 2019; v1 submitted 19 February, 2019; originally announced February 2019.

Comments: The link to the code was changed due to technical issues with the original repository. We no longer refer to the process shown in Figure 5C as a bifurcation diagram, but describe it in a more precise manner. We thank an anonymous reviewer for pointing this out

Showing 1–4 of 4 results for author: Haviv, D