Search | arXiv e-print repository

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Authors: Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho

Abstract: Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datas… ▽ More Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term. △ Less

Submitted 30 May, 2024; originally announced June 2024.

arXiv:2402.19455 [pdf, other]

Listening to the Noise: Blind Denoising with Gibbs Diffusion

Authors: David Heurtel-Depeiges, Charles C. Margossian, Ruben Ohana, Bruno Régaldo-Saint Blancard

Abstract: In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. W… ▽ More In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe. △ Less

Submitted 25 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: 12+9 pages, 7+5 figures, 1+1 tables; accepted to 2024 International Conference on Machine Learning; code: https://github.com/rubenohana/Gibbs-Diffusion

arXiv:2310.16285 [pdf, other]

Removing Dust from CMB Observations with Diffusion Models

Authors: David Heurtel-Depeiges, Blakesley Burkhart, Ruben Ohana, Bruno Régaldo-Saint Blancard

Abstract: In cosmology, the quest for primordial $B$-modes in cosmic microwave background (CMB) observations has highlighted the critical need for a refined model of the Galactic dust foreground. We investigate diffusion-based modeling of the dust foreground and its interest for component separation. Under the assumption of a Gaussian CMB with known cosmology (or covariance matrix), we show that diffusion m… ▽ More In cosmology, the quest for primordial $B$-modes in cosmic microwave background (CMB) observations has highlighted the critical need for a refined model of the Galactic dust foreground. We investigate diffusion-based modeling of the dust foreground and its interest for component separation. Under the assumption of a Gaussian CMB with known cosmology (or covariance matrix), we show that diffusion models can be trained on examples of dust emission maps such that their sampling process directly coincides with posterior sampling in the context of component separation. We illustrate this on simulated mixtures of dust emission and CMB. We show that common summary statistics (power spectrum, Minkowski functionals) of the components are well recovered by this process. We also introduce a model conditioned by the CMB cosmology that outperforms models trained using a single cosmology on component separation. Such a model will be used in future work for diffusion-based cosmological inference. △ Less

Submitted 11 December, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: 5+6 pages, 2+3 figures, accepted at the NeurIPS 2023 workshop on "Machine Learning and the Physical Sciences" and selected for a spotlight talk

arXiv:2310.03024 [pdf, other]

doi 10.1093/mnras/stae1450

AstroCLIP: A Cross-Modal Foundation Model for Galaxies

Authors: Liam Parker, Francois Lanusse, Siavash Golkar, Leopoldo Sarra, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Geraud Krawezik, Michael McCabe, Ruben Ohana, Mariel Pettee, Bruno Regaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

Abstract: We present AstroCLIP, a single, versatile model that can embed both galaxy images and spectra into a shared, physically meaningful latent space. These embeddings can then be used - without any model fine-tuning - for a variety of downstream tasks including (1) accurate in-modality and cross-modality semantic similarity search, (2) photometric redshift estimation, (3) galaxy property estimation fro… ▽ More We present AstroCLIP, a single, versatile model that can embed both galaxy images and spectra into a shared, physically meaningful latent space. These embeddings can then be used - without any model fine-tuning - for a variety of downstream tasks including (1) accurate in-modality and cross-modality semantic similarity search, (2) photometric redshift estimation, (3) galaxy property estimation from both images and spectra, and (4) morphology classification. Our approach to implementing AstroCLIP consists of two parts. First, we embed galaxy images and spectra separately by pretraining separate transformer-based image and spectrum encoders in self-supervised settings. We then align the encoders using a contrastive loss. We apply our method to spectra from the Dark Energy Spectroscopic Instrument and images from its corresponding Legacy Imaging Survey. Overall, we find remarkable performance on all downstream tasks, even relative to supervised baselines. For example, for a task like photometric redshift prediction, we find similar performance to a specifically-trained ResNet18, and for additional tasks like physical property estimation (stellar mass, age, metallicity, and sSFR), we beat this supervised baseline by 19\% in terms of $R^2$. We also compare our results to a state-of-the-art self-supervised single-modal model for galaxy images, and find that our approach outperforms this benchmark by roughly a factor of two on photometric redshift estimation and physical property prediction in terms of $R^2$, while remaining roughly in-line in terms of morphology classification. Ultimately, our approach represents the first cross-modal self-supervised model for galaxies, and the first self-supervised transformer-based architectures for galaxy images and spectra. △ Less

Submitted 14 June, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: 18 pages, accepted in Monthly Notices of the Royal Astronomical Society, Presented at the NeurIPS 2023 AI4Science Workshop

arXiv:2310.02994 [pdf, other]

Multiple Physics Pretraining for Physical Surrogate Models

Authors: Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Holden Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, Mariel Pettee, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

Abstract: We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a… ▽ More We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a shared embedding and normalization strategy that projects the fields of multiple systems into a single shared embedding space. We validate the efficacy of our approach on both pretraining and downstream tasks over a broad fluid mechanics-oriented benchmark. We show that a single MPP-pretrained transformer is able to match or outperform task-specific baselines on all pretraining sub-tasks without the need for finetuning. For downstream tasks, we demonstrate that finetuning MPP-trained models results in more accurate predictions across multiple time-steps on new physics compared to training from scratch or finetuning pretrained video foundation models. We open-source our code and model weights trained at multiple scales for reproducibility and community experimentation. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.02989 [pdf, other]

xVal: A Continuous Number Encoding for Large Language Models

Authors: Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

Abstract: Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference… ▽ More Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 10 pages 7 figures. Supplementary: 5 pages 2 figures

arXiv:2305.12988 [pdf, ps, other]

doi 10.1364/OE.496224

Linear Optical Random Projections Without Holography

Authors: Ruben Ohana, Daniel Hesslow, Daniel Brunner, Sylvain Gigan, Kilian Müller

Abstract: We introduce a novel method to perform linear optical random projections without the need for holography. Our method consists of a computationally trivial combination of multiple intensity measurements to mitigate the information loss usually associated with the absolute-square non-linearity imposed by optical intensity measurements. Both experimental and numerical findings demonstrate that the re… ▽ More We introduce a novel method to perform linear optical random projections without the need for holography. Our method consists of a computationally trivial combination of multiple intensity measurements to mitigate the information loss usually associated with the absolute-square non-linearity imposed by optical intensity measurements. Both experimental and numerical findings demonstrate that the resulting matrix consists of real-valued, independent, and identically distributed (i.i.d.) Gaussian random entries. Our optical setup is simple and robust, as it does not require interference between two beams. We demonstrate the practical applicability of our method by performing dimensionality reduction on high-dimensional data, a common task in randomized numerical linear algebra with relevant applications in machine learning. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: 7 pages, 4 figures

Journal ref: Opt. Express 31, 25881-25888 (2023)

arXiv:2305.07583 [pdf, other]

MoMo: Momentum Models for Adaptive Learning Rates

Authors: Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

Abstract: Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent wi… ▽ More Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. Our model makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The model is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by develo** MoMo-Adam, which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $\mathcal{O}(1/\sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We show that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model. △ Less

Submitted 5 June, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

MSC Class: 90C53; 74S60; 90C06; 62L20; 68W20; 15B52; 65Y20; 68W40 ACM Class: G.1.6

arXiv:2206.03230 [pdf, other]

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances

Authors: Ruben Ohana, Kimia Nadjahi, Alain Rakotomamonjy, Liva Ralaivola

Abstract: The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-B… ▽ More The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and a central observation that SW may be interpreted as an average risk, the quantity PAC-Bayesian bounds have been designed to characterize. We provide three types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. SW defined with respect to arbitrary distributions of slices (among which data-dependent distributions), ii) a principled procedure to learn the distribution of slices that yields maximally discriminative SW, by optimizing our theoretical bounds, and iii) empirical illustrations of our theoretical findings. △ Less

Submitted 31 May, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2202.02031 [pdf, other]

Complex-to-Real Sketches for Tensor Products with Applications to the Polynomial Kernel

Authors: Jonas Wacker, Ruben Ohana, Maurizio Filippone

Abstract: Randomized sketches of a tensor product of $p$ vectors follow a tradeoff between statistical efficiency and computational acceleration. Commonly used approaches avoid computing the high-dimensional tensor product explicitly, resulting in a suboptimal dependence of $\mathcal{O}(3^p)$ in the embedding dimension. We propose a simple Complex-to-Real (CtR) modification of well-known sketches that repla… ▽ More Randomized sketches of a tensor product of $p$ vectors follow a tradeoff between statistical efficiency and computational acceleration. Commonly used approaches avoid computing the high-dimensional tensor product explicitly, resulting in a suboptimal dependence of $\mathcal{O}(3^p)$ in the embedding dimension. We propose a simple Complex-to-Real (CtR) modification of well-known sketches that replaces real random projections by complex ones, incurring a lower $\mathcal{O}(2^p)$ factor in the embedding dimension. The output of our sketches is real-valued, which renders their downstream use straightforward. In particular, we apply our sketches to $p$-fold self-tensored inputs corresponding to the feature maps of the polynomial kernel. We show that our method achieves state-of-the-art performance in terms of accuracy and speed compared to other randomized approximations from the literature. △ Less

Submitted 30 April, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: 32 pages

arXiv:2108.04217 [pdf, other]

ROPUST: Improving Robustness through Fine-tuning with Photonic Processors and Synthetic Gradients

Authors: Alessandro Cappelli, Julien Launay, Laurent Meunier, Ruben Ohana, Iacopo Poli

Abstract: Robustness to adversarial attacks is typically obtained through expensive adversarial training with Projected Gradient Descent. Here we introduce ROPUST, a remarkably simple and efficient method to leverage robust pre-trained models and further increase their robustness, at no cost in natural accuracy. Our technique relies on the use of an Optical Processing Unit (OPU), a photonic co-processor, an… ▽ More Robustness to adversarial attacks is typically obtained through expensive adversarial training with Projected Gradient Descent. Here we introduce ROPUST, a remarkably simple and efficient method to leverage robust pre-trained models and further increase their robustness, at no cost in natural accuracy. Our technique relies on the use of an Optical Processing Unit (OPU), a photonic co-processor, and a fine-tuning step performed with Direct Feedback Alignment, a synthetic gradient training scheme. We test our method on nine different models against four attacks in RobustBench, consistently improving over state-of-the-art performance. We perform an ablation study on the single components of our defense, showing that robustness arises from parameter obfuscation and the alternative training method. We also introduce phase retrieval attacks, specifically designed to increase the threat level of attackers against our own defense. We show that even with state-of-the-art phase retrieval techniques, ROPUST remains an effective defense. △ Less

Submitted 6 July, 2021; originally announced August 2021.

Comments: 12 pages, 7 figures

arXiv:2107.11814 [pdf, other]

LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor

Authors: Charles Brossollet, Alessandro Cappelli, Igor Carron, Charidimos Chaintoutis, Amélie Chatelain, Laurent Daudet, Sylvain Gigan, Daniel Hesslow, Florent Krzakala, Julien Launay, Safa Mokaadi, Fabien Moreau, Kilian Müller, Ruben Ohana, Gustave Pariente, Iacopo Poli, Elena Tommasone

Abstract: We introduce LightOn's Optical Processing Unit (OPU), the first photonic AI accelerator chip available on the market for at-scale Non von Neumann computations, reaching 1500 TeraOPS. It relies on a combination of free-space optics with off-the-shelf components, together with a software API allowing a seamless integration within Python-based processing pipelines. We discuss a variety of use cases… ▽ More We introduce LightOn's Optical Processing Unit (OPU), the first photonic AI accelerator chip available on the market for at-scale Non von Neumann computations, reaching 1500 TeraOPS. It relies on a combination of free-space optics with off-the-shelf components, together with a software API allowing a seamless integration within Python-based processing pipelines. We discuss a variety of use cases and hybrid network architectures, with the OPU used in combination of CPU/GPU, and draw a pathway towards "optical advantage". △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: Proceedings IEEE Hot Chips 33, 2021

arXiv:2106.03645 [pdf, other]

Photonic Differential Privacy with Direct Feedback Alignment

Authors: Ruben Ohana, Hamlet J. Medina Ruiz, Julien Launay, Alessandro Cappelli, Iacopo Poli, Liva Ralaivola, Alain Rakotomamonjy

Abstract: Optical Processing Units (OPUs) -- low-power photonic chips dedicated to large scale random projections -- have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs… ▽ More Optical Processing Units (OPUs) -- low-power photonic chips dedicated to large scale random projections -- have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs a solution of choice to provide a private-by-design training. We provide a theoretical analysis of our adaptive privacy mechanism, carefully measuring how the noise of optical random projections propagates in the process and gives rise to provable Differential Privacy. Finally, we conduct experiments demonstrating the ability of our learning procedure to achieve solid end-task performance. △ Less

Submitted 25 March, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

Journal ref: NeurIPS 2021

arXiv:2104.14429 [pdf, other]

Photonic co-processors in HPC: using LightOn OPUs for Randomized Numerical Linear Algebra

Authors: Daniel Hesslow, Alessandro Cappelli, Igor Carron, Laurent Daudet, Raphaël Lafargue, Kilian Müller, Ruben Ohana, Gustave Pariente, Iacopo Poli

Abstract: Randomized Numerical Linear Algebra (RandNLA) is a powerful class of methods, widely used in High Performance Computing (HPC). RandNLA provides approximate solutions to linear algebra functions applied to large signals, at reduced computational costs. However, the randomization step for dimensionality reduction may itself become the computational bottleneck on traditional hardware. Leveraging near… ▽ More Randomized Numerical Linear Algebra (RandNLA) is a powerful class of methods, widely used in High Performance Computing (HPC). RandNLA provides approximate solutions to linear algebra functions applied to large signals, at reduced computational costs. However, the randomization step for dimensionality reduction may itself become the computational bottleneck on traditional hardware. Leveraging near constant-time linear random projections delivered by LightOn Optical Processing Units we show that randomization can be significantly accelerated, at negligible precision loss, in a wide range of important RandNLA algorithms, such as RandSVD or trace estimators. △ Less

Submitted 7 May, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: Add "This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 860830"

arXiv:2101.02115 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746671

Adversarial Robustness by Design through Analog Computing and Synthetic Gradients

Authors: Alessandro Cappelli, Ruben Ohana, Julien Launay, Laurent Meunier, Iacopo Poli, Florent Krzakala

Abstract: We propose a new defense mechanism against adversarial attacks inspired by an optical co-processor, providing robustness without compromising natural accuracy in both white-box and black-box settings. This hardware co-processor performs a nonlinear fixed random transformation, where the parameters are unknown and impossible to retrieve with sufficient precision for large enough dimensions. In the… ▽ More We propose a new defense mechanism against adversarial attacks inspired by an optical co-processor, providing robustness without compromising natural accuracy in both white-box and black-box settings. This hardware co-processor performs a nonlinear fixed random transformation, where the parameters are unknown and impossible to retrieve with sufficient precision for large enough dimensions. In the white-box setting, our defense works by obfuscating the parameters of the random projection. Unlike other defenses relying on obfuscated gradients, we find we are unable to build a reliable backward differentiable approximation for obfuscated parameters. Moreover, while our model reaches a good natural accuracy with a hybrid backpropagation - synthetic gradient method, the same approach is suboptimal if employed to generate adversarial examples. We find the combination of a random projection and binarization in the optical system also improves robustness against various types of black-box attacks. Finally, our hybrid training method builds robust features against transfer attacks. We demonstrate our approach on a VGG-like architecture, placing the defense on top of the convolutional features, on CIFAR-10 and CIFAR-100. Code is available at https://github.com/lightonai/adversarial-robustness-by-design. △ Less

Submitted 6 January, 2021; originally announced January 2021.

Journal ref: ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing,

arXiv:2011.12428 [pdf, other]

Align, then memorise: the dynamics of learning with feedback alignment

Authors: Maria Refinetti, Stéphane d'Ascoli, Ruben Ohana, Sebastian Goldt

Abstract: Direct Feedback Alignment (DFA) is emerging as an efficient and biologically plausible alternative to the ubiquitous backpropagation algorithm for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understand… ▽ More Direct Feedback Alignment (DFA) is emerging as an efficient and biologically plausible alternative to the ubiquitous backpropagation algorithm for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory for the success of DFA. We first show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on fitting the data. This two-step process has a degeneracy breaking effect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorise process occurs sequentially from the bottom layers of the network to the top. △ Less

Submitted 10 June, 2021; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: The accompanying code for this paper is available at https://github.com/sdascoli/dfa-dynamics

Journal ref: Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139, 2021

arXiv:2010.13278 [pdf, other]

doi 10.1103/PhysRevA.103.062220

Experimental Approach to Demonstrating Contextuality for Qudits

Authors: Adel Sohbi, Ruben Ohana, Isabelle Zaquine, Eleni Diamanti, Damian Markham

Abstract: We propose a method to experimentally demonstrate contextuality with a family of tests for qudits. The experiment we propose uses a qudit encoded in the path of a single photon and its temporal degrees of freedom. We consider the impact of noise on the effectiveness of these tests, taking the approach of ontologically faithful non-contextuality. In this approach, imperfections in the experimental… ▽ More We propose a method to experimentally demonstrate contextuality with a family of tests for qudits. The experiment we propose uses a qudit encoded in the path of a single photon and its temporal degrees of freedom. We consider the impact of noise on the effectiveness of these tests, taking the approach of ontologically faithful non-contextuality. In this approach, imperfections in the experimental set up must be taken into account in any faithful ontological (classical) model, which limits how much the statistics can deviate within different contexts. In this way we bound the precision of the experimental setup under which ontologically faithful non-contextual models can be refuted. We further consider the noise tolerance through different types of decoherence models on different types of encodings of qudits. We quantify the effect of the decoherence on the required precision for the experimental setup in order to demonstrate contextuality in this broader sense. △ Less

Submitted 25 October, 2020; originally announced October 2020.

Journal ref: Phys. Rev. A 103, 062220 (2021)

arXiv:2006.07310 [pdf, other]

Reservoir Computing meets Recurrent Kernels and Structured Transforms

Authors: Jonathan Dong, Ruben Ohana, Mushegh Rafayelyan, Florent Krzakala

Abstract: Reservoir Computing is a class of simple yet efficient Recurrent Neural Networks where internal weights are fixed at random and only a linear output layer is trained. In the large size limit, such random neural networks have a deep connection with kernel methods. Our contributions are threefold: a) We rigorously establish the recurrent kernel limit of Reservoir Computing and prove its convergence.… ▽ More Reservoir Computing is a class of simple yet efficient Recurrent Neural Networks where internal weights are fixed at random and only a linear output layer is trained. In the large size limit, such random neural networks have a deep connection with kernel methods. Our contributions are threefold: a) We rigorously establish the recurrent kernel limit of Reservoir Computing and prove its convergence. b) We test our models on chaotic time series prediction, a classic but challenging benchmark in Reservoir Computing, and show how the Recurrent Kernel is competitive and computationally efficient when the number of data points remains moderate. c) When the number of samples is too large, we leverage the success of structured Random Features for kernel approximation by introducing Structured Reservoir Computing. The two proposed methods, Recurrent Kernel and Structured Reservoir Computing, turn out to be much faster and more memory-efficient than conventional Reservoir Computing. △ Less

Submitted 21 October, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Journal ref: Advances in Neural Information Processing Systems, v33, pages 16785--16796, 2020

arXiv:2002.12503 [pdf, ps, other]

doi 10.1103/PhysRevB.101.075433

Impact of epitaxial strain on the topological-nontopological phase diagram and semimetallic behavior of InAs/GaSb composite quantum wells

Authors: H. Irie, T. Akiho, F. Couëdo, R. Ohana, K. Suzuki, K. Onomitsu, K. Muraki

Abstract: We study the influence of epitaxial strain on the electronic properties of InAs/GaSb composite quantum wells (CQWs), host structures for quantum spin Hall insulators, by transport measurements and eight-band $\mathbf{k\cdot p}$ calculations. Using different substrates and buffer layer structures for crystal growth, we prepare two types of samples with vastly different strain conditions. CQWs with… ▽ More We study the influence of epitaxial strain on the electronic properties of InAs/GaSb composite quantum wells (CQWs), host structures for quantum spin Hall insulators, by transport measurements and eight-band $\mathbf{k\cdot p}$ calculations. Using different substrates and buffer layer structures for crystal growth, we prepare two types of samples with vastly different strain conditions. CQWs with a nearly strain-free GaSb layer exhibit a resistance peak at the charge neutrality point that reflects the opening of a topological gap in the band-inverted regime. In contrast, for CQWs with 0.50\% biaxial tensile strain in the GaSb layer, semimetallic behavior indicating a gap closure is found for the same degree of band inversion. Additionally, with the tensile strain, the boundary between the topological and nontopological regimes is located at a larger InAs thickness. Eight-band $\mathbf{k\cdot p}$ calculations reveal that tensile strain in GaSb not only shifts the phase boundary but also significantly modifies the band structure, which can result in the closure of an indirect gap and make the system semimetallic even in the topological regime. Our results thus provide a global picture of the topological-nontopological phase diagram as a function of layer thicknesses and strain. △ Less

Submitted 27 February, 2020; originally announced February 2020.

Comments: 13 pages, 9 figures

Journal ref: Phys. Rev. B 101, 075433 (2020)

arXiv:1910.09880 [pdf, other]

doi 10.1109/ICASSP40776.2020.9053272

Kernel computations from large-scale random features obtained by Optical Processing Units

Authors: Ruben Ohana, Jonas Wacker, Jonathan Dong, Sébastien Marmin, Florent Krzakala, Maurizio Filippone, Laurent Daudet

Abstract: Approximating kernel functions with random features (RFs)has been a successful application of random projections for nonparametric estimation. However, performing random projections presents computational challenges for large-scale problems. Recently, a new optical hardware called Optical Processing Unit (OPU) has been developed for fast and energy-efficient computation of large-scale RFs in the a… ▽ More Approximating kernel functions with random features (RFs)has been a successful application of random projections for nonparametric estimation. However, performing random projections presents computational challenges for large-scale problems. Recently, a new optical hardware called Optical Processing Unit (OPU) has been developed for fast and energy-efficient computation of large-scale RFs in the analog domain. More specifically, the OPU performs the multiplication of input vectors by a large random matrix with complex-valued i.i.d. Gaussian entries, followed by the application of an element-wise squared absolute value operation - this last nonlinearity being intrinsic to the sensing process. In this paper, we show that this operation results in a dot-product kernel that has connections to the polynomial kernel, and we extend this computation to arbitrary powers of the feature map. Experiments demonstrate that the OPU kernel and its RF approximation achieve competitive performance in applications using kernel ridge regression and transfer learning for image classification. Crucially, thanks to the use of the OPU, these results are obtained with time and energy savings. △ Less

Submitted 2 December, 2019; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: 5 pages, 3 figures, submitted to ICASSP 2020

Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1202.6612 [pdf, ps, other]

A Database of Elliptic Curves over Q(sqrt(5)) - First Report

Authors: Jonathan Bober, Alyson Deines, Ariah Klages-Mundt, Benjamin LeVeque, R. Andrew Ohana, Ashwath Rabindranath, Paul Sharaba, William Stein

Abstract: We describe a tabulation of (conjecturally) modular elliptic curves over the field Q(sqrt(5)) up to the first curve of rank 2. Using an efficient implementation of an algorithm of Lassina Dembele, we computed tables of Hilbert modular forms of weight (2,2) over Q(sqrt(5)), and via a variety of methods we constructed corresponding elliptic curves, including (again, conjecturally) all elliptic curve… ▽ More We describe a tabulation of (conjecturally) modular elliptic curves over the field Q(sqrt(5)) up to the first curve of rank 2. Using an efficient implementation of an algorithm of Lassina Dembele, we computed tables of Hilbert modular forms of weight (2,2) over Q(sqrt(5)), and via a variety of methods we constructed corresponding elliptic curves, including (again, conjecturally) all elliptic curves over Q(sqrt(5)) that have conductor with norm less than or equal to 1831. △ Less

Submitted 9 July, 2012; v1 submitted 29 February, 2012; originally announced February 2012.

Comments: 17 pages

arXiv:1007.2667 [pdf, ps, other]

On well-rounded sublattices of the hexagonal lattice

Authors: Lenny Fukshansky, Daniel Moore, R. Andrew Ohana, Whitney Zeldow

Abstract: We produce an explicit parameterization of well-rounded sublattices of the hexagonal lattice in the plane, splitting them into similarity classes. We use this parameterization to study the number, the greatest minimal norm, and the highest signal-to-noise ratio of well-rounded sublattices of the hexagonal lattice of a fixed index. This investigation parallels earlier work by Bernstein, Sloane, and… ▽ More We produce an explicit parameterization of well-rounded sublattices of the hexagonal lattice in the plane, splitting them into similarity classes. We use this parameterization to study the number, the greatest minimal norm, and the highest signal-to-noise ratio of well-rounded sublattices of the hexagonal lattice of a fixed index. This investigation parallels earlier work by Bernstein, Sloane, and Wright where similar questions were addressed on the space of all sublattices of the hexagonal lattice. Our restriction is motivated by the importance of well-rounded lattices for discrete optimization problems. Finally, we also discuss the existence of a natural combinatorial structure on the set of similarity classes of well-rounded sublattices of the hexagonal lattice, induced by the action of a certain matrix monoid. △ Less

Submitted 29 July, 2010; v1 submitted 15 July, 2010; originally announced July 2010.

Comments: 21 pages (minor correction to the proof of Lemma 2.1); to appear in Discrete Mathematics

MSC Class: Primary: 11H31; 52C15; Secondary: 05B40; 11E45

Journal ref: Discrete Mathematics, vol. 310 no. 23 (2010), pg. 3287--3302

Showing 1–22 of 22 results for author: Ohana, R