Search | arXiv e-print repository

arXiv:2406.19320 [pdf, other]

Efficient World Models with Context-Aware Tokenization

Authors: Vincent Micheli, Eloi Alonso, François Fleuret

Abstract: Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environme… ▽ More Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $Δ$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $Δ$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2406.02313 [pdf, other]

Neural Thermodynamic Integration: Free Energies from Energy-based Diffusion Models

Authors: Bálint Máté, François Fleuret, Tristan Bereau

Abstract: Thermodynamic integration (TI) offers a rigorous method for estimating free-energy differences by integrating over a sequence of interpolating conformational ensembles. However, TI calculations are computationally expensive and typically limited to coupling a small number of degrees of freedom due to the need to sample numerous intermediate ensembles with sufficient conformational-space overlap. I… ▽ More Thermodynamic integration (TI) offers a rigorous method for estimating free-energy differences by integrating over a sequence of interpolating conformational ensembles. However, TI calculations are computationally expensive and typically limited to coupling a small number of degrees of freedom due to the need to sample numerous intermediate ensembles with sufficient conformational-space overlap. In this work, we propose to perform TI along an alchemical pathway represented by a trainable neural network, which we term Neural TI. Critically, we parametrize a time-dependent Hamiltonian interpolating between the interacting and non-interacting systems, and optimize its gradient using a denoising-diffusion objective. The ability of the resulting energy-based diffusion model to sample all intermediate ensembles allows us to perform TI from a single reference calculation. We apply our method to Lennard-Jones fluids, where we report accurate calculations of the excess chemical potential, demonstrating that Neural TI is capable of coupling hundreds of degrees of freedom at once. △ Less

Submitted 12 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.12399 [pdf, other]

Diffusion for World Modeling: Visual Details Matter in Atari

Authors: Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret

Abstract: World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffus… ▽ More World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. To foster future research on diffusion for world modeling, we release our code, agents and playable world models at https://github.com/eloialonso/diamond. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 25 pages, 11 figures, 10 tables

arXiv:2405.07813 [pdf, other]

Localizing Task Information for Improved Model Merging and Compression

Authors: Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard

Abstract: Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required t… ▽ More Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlap** sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted ICML 2024; The first two authors contributed equally to this work; Project website: https://tall-masks.github.io

arXiv:2404.09562 [pdf, other]

σ-GPTs: A New Approach to Autoregressive Models

Authors: Arnaud Pannatier, Evann Courdier, François Fleuret

Abstract: Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on… ▽ More Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on arbitrary subsets of tokens, and it also allows sampling in one shot multiple tokens dynamically according to a rejection strategy, leading to a sub-linear number of model evaluations. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction, decreasing the number of steps required for generation by an order of magnitude. △ Less

Submitted 1 July, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: 23 pages, 7 figures, accepted at ECML/PKDD 2024

arXiv:2402.02622 [pdf, other]

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Authors: Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

Abstract: The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B param… ▽ More The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time. △ Less

Submitted 21 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.00828 [pdf, other]

Multi-Lattice Sampling of Quantum Field Theories via Neural Operator-based Flows

Authors: Bálint Máté, François Fleuret

Abstract: We consider the problem of sampling discrete field configurations $φ$ from the Boltzmann distribution $[dφ] Z^{-1} e^{-S[φ]}$, where $S$ is the lattice-discretization of the continuous Euclidean action $\mathcal S$ of some quantum field theory. Since such densities arise as the approximation of the underlying functional density $[\mathcal Dφ(x)] \mathcal Z^{-1} e^{-\mathcal S[φ(x)]}$, we frame the… ▽ More We consider the problem of sampling discrete field configurations $φ$ from the Boltzmann distribution $[dφ] Z^{-1} e^{-S[φ]}$, where $S$ is the lattice-discretization of the continuous Euclidean action $\mathcal S$ of some quantum field theory. Since such densities arise as the approximation of the underlying functional density $[\mathcal Dφ(x)] \mathcal Z^{-1} e^{-\mathcal S[φ(x)]}$, we frame the task as an instance of operator learning. In particular, we propose to approximate a time-dependent operator $\mathcal V_t$ whose time integral provides a map** between the functional distributions of the free theory $[\mathcal Dφ(x)] \mathcal Z_0^{-1} e^{-\mathcal S_{0}[φ(x)]}$ and of the target theory $[\mathcal Dφ(x)]\mathcal Z^{-1}e^{-\mathcal S[φ(x)]}$. Whenever a particular lattice is chosen, the operator $\mathcal V_t$ can be discretized to a finite dimensional, time-dependent vector field $V_t$ which in turn induces a continuous normalizing flow between finite dimensional distributions over the chosen lattice. This flow can then be trained to be a diffeormorphism between the discretized free and target theories $[dφ] Z_0^{-1} e^{-S_{0}[φ]}$, $[dφ] Z^{-1}e^{-S[φ]}$. We run experiments on the $φ^4$-theory to explore to what extent such operator-based flow architectures generalize to lattice sizes they were not trained on and show that pretraining on smaller lattices can lead to speedup over training only a target lattice size. △ Less

Submitted 17 January, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

arXiv:2311.09998 [pdf, other]

DeepEMD: A Transformer-based Fast Estimation of the Earth Mover's Distance

Authors: Atul Kumar Sinha, Francois Fleuret

Abstract: The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the ne… ▽ More The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the necessary accurate estimation of the gradients we train our model to explicitly compute the matching between point clouds instead of EMD itself. We cast this new objective as the estimation of an attention matrix that approximates the ground truth matching matrix. Experiments show that this model provides an accurate estimate of the EMD and its gradient with a wall clock speed-up of more than two orders of magnitude with respect to the exact Hungarian matching algorithm and one order of magnitude with respect to the standard approximate Sinkhorn algorithm, allowing in particular to train a point cloud VAE with the EMD itself. Extensive evaluation show the remarkable behaviour of this model when operating out-of-distribution, a key requirement for a distance surrogate. Finally, the model generalizes very well to point clouds during inference several times larger than during training. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.00586 [pdf, other]

PAUMER: Patch Pausing Transformer for Semantic Segmentation

Authors: Evann Courdier, Prabhu Teja Sivaprasad, François Fleuret

Abstract: We study the problem of improving the efficiency of segmentation transformers by using disparate amounts of computation for different parts of the image. Our method, PAUMER, accomplishes this by pausing computation for patches that are deemed to not need any more computation before the final decoder. We use the entropy of predictions computed from intermediate activations as the pausing criterion,… ▽ More We study the problem of improving the efficiency of segmentation transformers by using disparate amounts of computation for different parts of the image. Our method, PAUMER, accomplishes this by pausing computation for patches that are deemed to not need any more computation before the final decoder. We use the entropy of predictions computed from intermediate activations as the pausing criterion, and find this aligns well with semantics of the image. Our method has a unique advantage that a single network trained with the proposed strategy can be effortlessly adapted at inference to various run-time requirements by modulating its pausing parameters. On two standard segmentation datasets, Cityscapes and ADE20K, we show that our method operates with about a $50\%$ higher throughput with an mIoU drop of about $0.65\%$ and $4.6\%$ respectively. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2306.01160 [pdf, other]

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Authors: Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret

Abstract: Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t. the sequence length -- becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead… ▽ More Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t. the sequence length -- becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementations concerns and end up imposing a simple and static structure over the attention matrix. Conversely, implementing more dynamic sparse attentions often results in runtimes significantly slower than computing the full attention using the Flash implementation from Dao et al. (2022). We extend FlashAttention to accommodate a large class of attention sparsity patterns that, in particular, encompass key/query drop** and hashing-based attention. This leads to implementations with no computational complexity overhead and a multi-fold runtime speedup on top of FlashAttention. Even with relatively low degrees of sparsity, our method improves visibly upon FlashAttention as the sequence length increases. Without sacrificing perplexity, we increase the training speed of a transformer language model by $2.0\times$ and $3.3\times$ for sequences of respectively $8k$ and $16k$ tokens. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2304.10857 [pdf, other]

SequeL: A Continual Learning Library in PyTorch and JAX

Authors: Nikolaos Dimitriadis, Francois Fleuret, Pascal Frossard

Abstract: Continual Learning is an important and challenging problem in machine learning, where models must adapt to a continuous stream of new data without forgetting previously acquired knowledge. While existing frameworks are built on PyTorch, the rising popularity of JAX might lead to divergent codebases, ultimately hindering reproducibility and progress. To address this problem, we introduce SequeL, a… ▽ More Continual Learning is an important and challenging problem in machine learning, where models must adapt to a continuous stream of new data without forgetting previously acquired knowledge. While existing frameworks are built on PyTorch, the rising popularity of JAX might lead to divergent codebases, ultimately hindering reproducibility and progress. To address this problem, we introduce SequeL, a flexible and extensible library for Continual Learning that supports both PyTorch and JAX frameworks. SequeL provides a unified interface for a wide range of Continual Learning algorithms, including regularization-based approaches, replay-based approaches, and hybrid approaches. The library is designed towards modularity and simplicity, making the API suitable for both researchers and practitioners. We release SequeL\footnote{\url{https://github.com/nik-dim/sequel}} as an open-source library, enabling researchers and developers to easily experiment and extend the library for their own purposes. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: 7 pages, 1 figure, 4 code listings

arXiv:2302.05282 [pdf, other]

Graph Neural Networks Go Forward-Forward

Authors: Daniele Paliotta, Mathieu Alain, Bálint Máté, François Fleuret

Abstract: We present the Graph Forward-Forward (GFF) algorithm, an extension of the Forward-Forward procedure to graphs, able to handle features distributed over a graph's nodes. This allows training graph neural networks with forward passes only, without backpropagation. Our method is agnostic to the message-passing scheme, and provides a more biologically plausible learning scheme than backpropagation, wh… ▽ More We present the Graph Forward-Forward (GFF) algorithm, an extension of the Forward-Forward procedure to graphs, able to handle features distributed over a graph's nodes. This allows training graph neural networks with forward passes only, without backpropagation. Our method is agnostic to the message-passing scheme, and provides a more biologically plausible learning scheme than backpropagation, while also carrying computational advantages. With GFF, graph neural networks are trained greedily layer by layer, using both positive and negative samples. We run experiments on 11 standard graph property prediction tasks, showing how GFF provides an effective alternative to backpropagation for training graph neural networks. This shows in particular that this procedure is remarkably efficient in spite of combining the per-layer training with the locality of the processing in a GNN. △ Less

Submitted 10 February, 2023; originally announced February 2023.

arXiv:2301.07388 [pdf, other]

Learning Interpolations between Boltzmann Densities

Authors: Bálint Máté, François Fleuret

Abstract: We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = ||x/σ||_p^p$. The interpolation of energy functions induces an… ▽ More We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = ||x/σ||_p^p$. The interpolation of energy functions induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along the family $p_t$ of densities. The condition of transporting samples along the family $p_t$ is equivalent to satisfying the continuity equation with $V_t$ and $p_t = Z_t^{-1}e^{-f_t}$. Consequently, we optimize $V_t$ and $f_t$ to satisfy this partial differential equation. We experimentally compare the proposed training objective to the reverse KL-divergence on Gaussian mixtures and on the Boltzmann density of a quantum mechanical particle in a double-well potential. △ Less

Submitted 30 May, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: TMLR

arXiv:2211.11704 [pdf, other]

ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields

Authors: Mohammad Mahdi Johari, Camilla Carta, François Fleuret

Abstract: We present ESLAM, an efficient implicit neural representation method for Simultaneous Localization and Map** (SLAM). ESLAM reads RGB-D frames with unknown camera poses in a sequential manner and incrementally reconstructs the scene representation while estimating the current camera position in the scene. We incorporate the latest advances in Neural Radiance Fields (NeRF) into a SLAM system, resu… ▽ More We present ESLAM, an efficient implicit neural representation method for Simultaneous Localization and Map** (SLAM). ESLAM reads RGB-D frames with unknown camera poses in a sequential manner and incrementally reconstructs the scene representation while estimating the current camera position in the scene. We incorporate the latest advances in Neural Radiance Fields (NeRF) into a SLAM system, resulting in an efficient and accurate dense visual SLAM method. Our scene representation consists of multi-scale axis-aligned perpendicular feature planes and shallow decoders that, for each point in the continuous space, decode the interpolated features into Truncated Signed Distance Field (TSDF) and RGB values. Our extensive experiments on three standard datasets, Replica, ScanNet, and TUM RGB-D show that ESLAM improves the accuracy of 3D reconstruction and camera localization of state-of-the-art dense visual SLAM methods by more than 50%, while it runs up to 10 times faster and does not require any pre-training. △ Less

Submitted 3 April, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: CVPR 2023 Highlight. Project page: https://www.idiap.ch/paper/eslam/

arXiv:2210.13772 [pdf, other]

Deformations of Boltzmann Distributions

Authors: Bálint Máté, François Fleuret

Abstract: Consider a one-parameter family of Boltzmann distributions $p_t(x) = \tfrac{1}{Z_t}e^{-S_t(x)}$. This work studies the problem of sampling from $p_{t_0}$ by first sampling from $p_{t_1}$ and then applying a transformation $Ψ_{t_1}^{t_0}$ so that the transformed samples follow $p_{t_0}$. We derive an equation relating $Ψ$ and the corresponding family of unnormalized log-likelihoods $S_t$. The utili… ▽ More Consider a one-parameter family of Boltzmann distributions $p_t(x) = \tfrac{1}{Z_t}e^{-S_t(x)}$. This work studies the problem of sampling from $p_{t_0}$ by first sampling from $p_{t_1}$ and then applying a transformation $Ψ_{t_1}^{t_0}$ so that the transformed samples follow $p_{t_0}$. We derive an equation relating $Ψ$ and the corresponding family of unnormalized log-likelihoods $S_t$. The utility of this idea is demonstrated on the $φ^4$ lattice field theory by extending its defining action $S_0$ to a family of actions $S_t$ and finding a $τ$ such that normalizing flows perform better at learning the Boltzmann distribution $p_τ$ than at learning $p_0$. △ Less

Submitted 14 November, 2022; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: Machine Learning for the Physical Sciences Workshop at NeurIPS '22

arXiv:2210.11269 [pdf, other]

Inference from Real-World Sparse Measurements

Authors: Arnaud Pannatier, Kyle Matoba, François Fleuret

Abstract: Real-world problems often involve complex and unstructured sets of measurements, which occur when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are m… ▽ More Real-world problems often involve complex and unstructured sets of measurements, which occur when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are methodologically difficult. Current state-of-the-art models are graph neural networks and require domain-specific knowledge for proper setup. We propose an attention-based model focused on robustness and practical applicability, with two key design contributions. First, we adopt a ViT-like transformer that takes both context points and read-out positions as inputs, eliminating the need for an encoder-decoder structure. Second, we use a unified method for encoding both context and read-out positions. This approach is intentionally straightforward and integrates well with other systems. Compared to existing approaches, our model is simpler, requires less specialized knowledge, and does not suffer from a problematic bottleneck effect, all of which contribute to superior performance. We conduct in-depth ablation studies that characterize this problematic bottleneck in the latent representations of alternative models that inhibit information utilization and impede training efficiency. We also perform experiments across various problem domains, including high-altitude wind nowcasting, two-day weather forecasting, fluid dynamics, and heat diffusion. Our attention-based model consistently outperforms state-of-the-art models in handling irregularly sampled data. Notably, our model reduces the root mean square error (RMSE) for wind nowcasting from 9.24 to 7.98 and for heat diffusion tasks from 0.126 to 0.084. △ Less

Submitted 15 April, 2024; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: 27 pages, 12 figures, Published at TMLR https://openreview.net/forum?id=y9IDfODRns

arXiv:2210.09759 [pdf, other]

Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models

Authors: Nikolaos Dimitriadis, Pascal Frossard, François Fleuret

Abstract: In Multi-Task Learning (MTL), tasks may compete and limit the performance achieved on each other, rather than guiding the optimization to a solution, superior to all its single-task trained counterparts. Since there is often not a unique solution optimal for all tasks, practitioners have to balance tradeoffs between tasks' performance, and resort to optimality in the Pareto sense. Most MTL methodo… ▽ More In Multi-Task Learning (MTL), tasks may compete and limit the performance achieved on each other, rather than guiding the optimization to a solution, superior to all its single-task trained counterparts. Since there is often not a unique solution optimal for all tasks, practitioners have to balance tradeoffs between tasks' performance, and resort to optimality in the Pareto sense. Most MTL methodologies either completely neglect this aspect, and instead of aiming at learning a Pareto Front, produce one solution predefined by their optimization schemes, or produce diverse but discrete solutions. Recent approaches parameterize the Pareto Front via neural networks, leading to complex map**s from tradeoff to objective space. In this paper, we conjecture that the Pareto Front admits a linear parameterization in parameter space, which leads us to propose \textit{Pareto Manifold Learning}, an ensembling method in weight space. Our approach produces a continuous Pareto Front in a single training run, that allows to modulate the performance on each task during inference. Experiments on multi-task learning benchmarks, ranging from image classification to tabular datasets and scene understanding, show that \textit{Pareto Manifold Learning} outperforms state-of-the-art single-point algorithms, while learning a better Pareto parameterization than multi-point baselines. △ Less

Submitted 14 June, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: Accepted ICML 2023

arXiv:2209.00588 [pdf, other]

Transformers are Sample-Efficient World Models

Authors: Vincent Micheli, Eloi Alonso, François Fleuret

Abstract: Deep reinforcement learning agents are notoriously sample inefficient, which considerably limits their application to real-world problems. Recently, many model-based methods have been designed to address this issue, with learning in the imagination of a world model being one of the most prominent approaches. However, while virtually unlimited interaction with a simulated environment sounds appeali… ▽ More Deep reinforcement learning agents are notoriously sample inefficient, which considerably limits their application to real-world problems. Recently, many model-based methods have been designed to address this issue, with learning in the imagination of a world model being one of the most prominent approaches. However, while virtually unlimited interaction with a simulated environment sounds appealing, the world model has to be accurate over extended periods of time. Motivated by the success of Transformers in sequence modeling tasks, we introduce IRIS, a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. With the equivalent of only two hours of gameplay in the Atari 100k benchmark, IRIS achieves a mean human normalized score of 1.046, and outperforms humans on 10 out of 26 games, setting a new state of the art for methods without lookahead search. To foster future research on Transformers and world models for sample-efficient reinforcement learning, we release our code and models at https://github.com/eloialonso/iris. △ Less

Submitted 1 March, 2023; v1 submitted 1 September, 2022; originally announced September 2022.

Comments: ICLR 2023 (notable top 5%)

arXiv:2206.07144 [pdf, other]

Efficiently Training Low-Curvature Neural Networks

Authors: Suraj Srinivas, Kyle Matoba, Himabindu Lakkaraju, Francois Fleuret

Abstract: The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of… ▽ More The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of non-linearity. Using this, we demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance, which leads to improved robustness and stable gradients, with only a marginally increased training time. To achieve this, we minimize a data-independent upper bound on the curvature of a neural network, which decomposes overall curvature in terms of curvatures and slopes of its constituent layers. To efficiently minimize this bound, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to their standard high-curvature counterparts, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network models. △ Less

Submitted 10 January, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2205.15209 [pdf, other]

Flowification: Everything is a Normalizing Flow

Authors: Bálint Máté, Samuel Klein, Tobias Golling, François Fleuret

Abstract: The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a… ▽ More The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets. △ Less

Submitted 26 January, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: NeurIPS 2022

arXiv:2203.03691 [pdf, other]

HyperMixer: An MLP-based Low Cost Alternative to Transformers

Authors: Florian Mai, Arnaud Pannatier, Fabio Fehr, Haolin Chen, Francois Marelli, Francois Fleuret, James Henderson

Abstract: Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token m… ▽ More Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning. △ Less

Submitted 13 November, 2023; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: Published at ACL 2023

Journal ref: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

arXiv:2203.01016 [pdf, other]

The Theoretical Expressiveness of Maxpooling

Authors: Kyle Matoba, Nikolaos Dimitriadis, François Fleuret

Abstract: Over the decade since deep neural networks became state of the art image classifiers there has been a tendency towards less use of max pooling: the function that takes the largest of nearby pixels in an image. Since max pooling featured prominently in earlier generations of image classifiers, we wish to understand this trend, and whether it is justified. We develop a theoretical framework analyzin… ▽ More Over the decade since deep neural networks became state of the art image classifiers there has been a tendency towards less use of max pooling: the function that takes the largest of nearby pixels in an image. Since max pooling featured prominently in earlier generations of image classifiers, we wish to understand this trend, and whether it is justified. We develop a theoretical framework analyzing ReLU based approximations to max pooling, and prove a sense in which max pooling cannot be efficiently replicated using ReLU activations. We analyze the error of a class of optimal approximations, and find that whilst the error can be made exponentially small in the kernel size, doing so requires an exponentially complex approximation. Our work gives a theoretical basis for understanding the trend away from max pooling in newer architectures. We conclude that the main cause of a difference between max pooling and an optimal approximation, a prevalent large difference between the max and other values within pools, can be overcome with other architectural decisions, or is not prevalent in natural images. △ Less

Submitted 2 March, 2022; originally announced March 2022.

Comments: 31 pages, 6 figures

arXiv:2202.10583 [pdf, other]

MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Authors: Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, Weijun Hong, Zhongyue Huang, Haicheng Chen, Guangjun Zeng, Yue Lin, Vincent Micheli, Eloi Alonso, François Fleuret, Alexander Nikulin, Yury Belousov, Oleg Svidchenko, Aleksei Shpilman

Abstract: Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restri… ▽ More Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: Under review for PMLR volume on NeurIPS 2021 competitions

arXiv:2202.05748 [pdf, other]

Borrowing from yourself: Faster future video segmentation with partial channel update

Authors: Evann Courdier, François Fleuret

Abstract: Semantic segmentation is a well-addressed topic in the computer vision literature, but the design of fast and accurate video processing networks remains challenging. In addition, to run on embedded hardware, computer vision models often have to make compromises on accuracy to run at the required speed, so that a latency/accuracy trade-off is usually at the heart of these real-time systems' design.… ▽ More Semantic segmentation is a well-addressed topic in the computer vision literature, but the design of fast and accurate video processing networks remains challenging. In addition, to run on embedded hardware, computer vision models often have to make compromises on accuracy to run at the required speed, so that a latency/accuracy trade-off is usually at the heart of these real-time systems' design. For the specific case of videos, models have the additional possibility to make use of computations made for previous frames to mitigate the accuracy loss while being real-time. In this work, we propose to tackle the task of fast future video segmentation prediction through the use of convolutional layers with time-dependent channel masking. This technique only updates a chosen subset of the feature maps at each time-step, bringing simultaneously less computation and latency, and allowing the network to leverage previously computed features. We apply this technique to several fast architectures and experimentally confirm its benefits for the future prediction subtask. △ Less

Submitted 17 June, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

arXiv:2202.05012 [pdf, other]

SUPA: A Lightweight Diagnostic Simulator for Machine Learning in Particle Physics

Authors: Atul Kumar Sinha, Daniele Paliotta, Bálint Máté, Sebastian Pina-Otey, John A. Raine, Tobias Golling, François Fleuret

Abstract: Deep learning methods have gained popularity in high energy physics for fast modeling of particle showers in detectors. Detailed simulation frameworks such as the gold standard Geant4 are computationally intensive, and current deep generative architectures work on discretized, lower resolution versions of the detailed simulation. The development of models that work at higher spatial resolutions is… ▽ More Deep learning methods have gained popularity in high energy physics for fast modeling of particle showers in detectors. Detailed simulation frameworks such as the gold standard Geant4 are computationally intensive, and current deep generative architectures work on discretized, lower resolution versions of the detailed simulation. The development of models that work at higher spatial resolutions is currently hindered by the complexity of the full simulation data, and by the lack of simpler, more interpretable benchmarks. Our contribution is SUPA, the SUrrogate PArticle propagation simulator, an algorithm and software package for generating data by simulating simplified particle propagation, scattering and shower development in matter. The generation is extremely fast and easy to use compared to Geant4, but still exhibits the key characteristics and challenges of the detailed simulation. We support this claim experimentally by showing that performance of generative models on data from our simulator reflects the performance on a dataset generated with Geant4. The proposed simulator generates thousands of particle showers per second on a desktop machine, a speed up of up to 6 orders of magnitudes over Geant4, and stores detailed geometric information about the shower propagation. SUPA provides much greater flexibility for setting initial conditions and defining multiple benchmarks for the development of models. Moreover, interpreting particle showers as point clouds creates a connection to geometric machine learning and provides challenging and fundamentally new datasets for the field. The code for SUPA is available at https://github.com/itsdaniele/SUPA. △ Less

Submitted 21 October, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

arXiv:2202.04414 [pdf, other]

Agree to Disagree: Diversity through Disagreement for Better Transferability

Authors: Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy

Abstract: Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of p… ▽ More Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution -- referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability. △ Less

Submitted 23 November, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: 23 pages, 17 figures

arXiv:2112.10408 [pdf, other]

Efficient Wind Speed Nowcasting with GPU-Accelerated Nearest Neighbors Algorithm

Authors: Arnaud Pannatier, Ricardo Picatoste, François Fleuret

Abstract: This paper proposes a simple yet efficient high-altitude wind nowcasting pipeline. It processes efficiently a vast amount of live data recorded by airplanes over the whole airspace and reconstructs the wind field with good accuracy. It creates a unique context for each point in the dataset and then extrapolates from it. As creating such context is computationally intensive, this paper proposes a n… ▽ More This paper proposes a simple yet efficient high-altitude wind nowcasting pipeline. It processes efficiently a vast amount of live data recorded by airplanes over the whole airspace and reconstructs the wind field with good accuracy. It creates a unique context for each point in the dataset and then extrapolates from it. As creating such context is computationally intensive, this paper proposes a novel algorithm that reduces the time and memory cost by efficiently fetching nearest neighbors in a data set whose elements are organized along smooth trajectories that can be approximated with piece-wise linear structures. We introduce an efficient and exact strategy implemented through algebraic tensorial operations, which is well-suited to modern GPU-based computing infrastructure. This method employs a scalable Euclidean metric and allows masking data points along one dimension. When applied, this method is more efficient than plain Euclidean k-NN and other well-known data selection methods such as KDTrees and provides a several-fold speedup. We provide an implementation in PyTorch and a novel data set to allow the replication of empirical results. △ Less

Submitted 22 February, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

Comments: 9 pages, 5 figures, accepted at Siam Data Mining 2022 (SDM 2022)

arXiv:2111.13539 [pdf, other]

GeoNeRF: Generalizing NeRF with Geometry Priors

Authors: Mohammad Mahdi Johari, Yann Lepoittevin, François Fleuret

Abstract: We present GeoNeRF, a generalizable photorealistic novel view synthesis method based on neural radiance fields. Our approach consists of two main stages: a geometry reasoner and a renderer. To render a novel view, the geometry reasoner first constructs cascaded cost volumes for each nearby source view. Then, using a Transformer-based attention mechanism and the cascaded cost volumes, the renderer… ▽ More We present GeoNeRF, a generalizable photorealistic novel view synthesis method based on neural radiance fields. Our approach consists of two main stages: a geometry reasoner and a renderer. To render a novel view, the geometry reasoner first constructs cascaded cost volumes for each nearby source view. Then, using a Transformer-based attention mechanism and the cascaded cost volumes, the renderer infers geometry and appearance, and renders detailed images via classical volume rendering techniques. This architecture, in particular, allows sophisticated occlusion reasoning, gathering information from consistent source views. Moreover, our method can easily be fine-tuned on a single scene, and renders competitive results with per-scene optimized neural rendering methods with a fraction of computational cost. Experiments show that GeoNeRF outperforms state-of-the-art generalizable neural rendering models on various synthetic and real datasets. Lastly, with a slight modification to the geometry reasoner, we also propose an alternative model that adapts to RGBD images. This model directly exploits the depth information often available thanks to depth sensors. The implementation code is available at https://www.idiap.ch/paper/geonerf. △ Less

Submitted 21 March, 2022; v1 submitted 26 November, 2021; originally announced November 2021.

Comments: CVPR2022

arXiv:2110.10232 [pdf, other]

Test time Adaptation through Perturbation Robustness

Authors: Prabhu Teja Sivaprasad, François Fleuret

Abstract: Data samples generated by several real world processes are dynamic in nature \textit{i.e.}, their characteristics vary with time. Thus it is not possible to train and tackle all possible distributional shifts between training and inference, using the host of transfer learning methods in literature. In this paper, we tackle this problem of adapting to domain shift at inference time \textit{i.e.}, w… ▽ More Data samples generated by several real world processes are dynamic in nature \textit{i.e.}, their characteristics vary with time. Thus it is not possible to train and tackle all possible distributional shifts between training and inference, using the host of transfer learning methods in literature. In this paper, we tackle this problem of adapting to domain shift at inference time \textit{i.e.}, we do not change the training process, but quickly adapt the model at test-time to handle any domain shift. For this, we propose to enforce consistency of predictions of data sampled in the vicinity of test sample on the image manifold. On a host of test scenarios like dealing with corruptions (CIFAR-10-C and CIFAR-100-C), and domain adaptation (VisDA-C), our method is at par or significantly outperforms previous methods. △ Less

Submitted 19 October, 2021; originally announced October 2021.

Comments: Under review

arXiv:2109.03709 [pdf, other]

Speeding up PCA with priming

Authors: Bálint Máté, François Fleuret

Abstract: We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computational… ▽ More We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame. △ Less

Submitted 20 May, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2104.07972 [pdf, other]

Language Models are Few-Shot Butlers

Authors: Vincent Micheli, François Fleuret

Abstract: Pretrained language models demonstrate strong performance in most NLP tasks when fine-tuned on small task-specific datasets. Hence, these autoregressive models constitute ideal agents to operate in text-based environments where language understanding and generative capabilities are essential. Nonetheless, collecting expert demonstrations in such environments is a time-consuming endeavour. We intro… ▽ More Pretrained language models demonstrate strong performance in most NLP tasks when fine-tuned on small task-specific datasets. Hence, these autoregressive models constitute ideal agents to operate in text-based environments where language understanding and generative capabilities are essential. Nonetheless, collecting expert demonstrations in such environments is a time-consuming endeavour. We introduce a two-stage procedure to learn from a small set of demonstrations and further improve by interacting with an environment. We show that language models fine-tuned with only 1.2% of the expert demonstrations and a simple reinforcement learning algorithm achieve a 51% absolute improvement in success rate over existing methods in the ALFWorld environment. △ Less

Submitted 20 September, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: EMNLP 2021

arXiv:2104.06045 [pdf, other]

Structural analysis of an all-purpose question answering model

Authors: Vincent Micheli, Quentin Heinrich, François Fleuret, Wacim Belblidia

Abstract: Attention is a key component of the now ubiquitous pre-trained language models. By learning to focus on relevant pieces of information, these Transformer-based architectures have proven capable of tackling several tasks at once and sometimes even surpass their single-task counterparts. To better understand this phenomenon, we conduct a structural analysis of a new all-purpose question answering mo… ▽ More Attention is a key component of the now ubiquitous pre-trained language models. By learning to focus on relevant pieces of information, these Transformer-based architectures have proven capable of tackling several tasks at once and sometimes even surpass their single-task counterparts. To better understand this phenomenon, we conduct a structural analysis of a new all-purpose question answering model that we introduce. Surprisingly, this model retains single-task performance even in the absence of a strong transfer effect between tasks. Through attention head importance scoring, we observe that attention heads specialize in a particular task and that some heads are more conducive to learning than others in both the multi-task and single-task settings. △ Less

Submitted 13 April, 2021; originally announced April 2021.

arXiv:2101.10983 [pdf, other]

Unsupervised clustering of series using dynamic programming and neural processes

Authors: Karthigan Sinnathamby, Chang-Yu Hou, Lalitha Venkataramanan, Vasileios-Marios Gkortsas, François Fleuret

Abstract: Following the work of arXiv:2101.09512, we are interested in clustering a given multi-variate series in an unsupervised manner. We would like to segment and cluster the series such that the resulting blocks present in each cluster are coherent with respect to a predefined model structure (e.g. a physics model with a functional form defined by a number of parameters). However, such approach might h… ▽ More Following the work of arXiv:2101.09512, we are interested in clustering a given multi-variate series in an unsupervised manner. We would like to segment and cluster the series such that the resulting blocks present in each cluster are coherent with respect to a predefined model structure (e.g. a physics model with a functional form defined by a number of parameters). However, such approach might have its limitation, partly because there may exist multiple models that describe the same data, and partly because the exact model behind the data may not immediately known. Hence, it is useful to establish a general framework that enables the integration of plausible models and also accommodates data-driven approach into one approximated model to assist the clustering task. Hence, in this work, we investigate the use of neural processes to build the approximated model while yielding the same assumptions required by the algorithm presented in arXiv:2101.09512. △ Less

Submitted 26 January, 2021; originally announced January 2021.

arXiv:2101.09512 [pdf, other]

Unsupervised clustering of series using dynamic programming

Authors: Karthigan Sinnathamby, Chang-Yu Hou, Lalitha Venkataramanan, Vasileios-Marios Gkortsas, François Fleuret

Abstract: We are interested in clustering parts of a given single multi-variate series in an unsupervised manner. We would like to segment and cluster the series such that the resulting blocks present in each cluster are coherent with respect to a known model (e.g. physics model). Data points are said to be coherent if they can be described using this model with the same parameters. We have designed an algo… ▽ More We are interested in clustering parts of a given single multi-variate series in an unsupervised manner. We would like to segment and cluster the series such that the resulting blocks present in each cluster are coherent with respect to a known model (e.g. physics model). Data points are said to be coherent if they can be described using this model with the same parameters. We have designed an algorithm based on dynamic programming with constraints on the number of clusters, the number of transitions as well as the minimal size of a block such that the clusters are coherent with this process. We present an use-case: clustering of petrophysical series using the Waxman-Smits equation. △ Less

Submitted 23 January, 2021; originally announced January 2021.

arXiv:2010.03813 [pdf, other]

On the importance of pre-training data volume for compact language models

Authors: Vincent Micheli, Martin d'Hoffschmidt, François Fleuret

Abstract: Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD),… ▽ More Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements. △ Less

Submitted 9 October, 2020; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: EMNLP 2020; typo corrected

arXiv:2007.04825 [pdf, other]

Fast Transformers with Clustered Attention

Authors: Apoorv Vyas, Angelos Katharopoulos, François Fleuret

Abstract: Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, grou… ▽ More Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance. △ Less

Submitted 29 September, 2020; v1 submitted 9 July, 2020; originally announced July 2020.

arXiv:2006.16236 [pdf, other]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Authors: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

Abstract: Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from… ▽ More Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences. △ Less

Submitted 31 August, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: ICML 2020, project at https://linear-transformers.com/

arXiv:2006.14567 [pdf, other]

Taming GANs with Lookahead-Minmax

Authors: Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi

Abstract: Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtrackin… ▽ More Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead-minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources. △ Less

Submitted 23 June, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

Journal ref: ICLR 2021

arXiv:2006.09128 [pdf, other]

Rethinking the Role of Gradient-Based Attribution Methods for Model Interpretability

Authors: Suraj Srinivas, Francois Fleuret

Abstract: Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_θ ( y \mid x)$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we… ▽ More Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_θ ( y \mid x)$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model $p_θ(x \mid y)$ implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model $p_θ(x \mid y)$ with that of the ground truth data distribution $p_{\text{data}} (x \mid y)$. We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this alignment we use score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models. △ Less

Submitted 3 March, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Oral Presentation at ICLR 2021

arXiv:2004.02574 [pdf, other]

Real-Time Segmentation Networks should be Latency Aware

Authors: Evann Courdier, Francois Fleuret

Abstract: As scene segmentation systems reach visually accurate results, many recent papers focus on making these network architectures faster, smaller and more efficient. In particular, studies often aim at designingreal-time'systems. Achieving this goal is particularly relevant in the context of real-time video understanding for autonomous vehicles, and robots. In this paper, we argue that the commonly us… ▽ More As scene segmentation systems reach visually accurate results, many recent papers focus on making these network architectures faster, smaller and more efficient. In particular, studies often aim at designingreal-time'systems. Achieving this goal is particularly relevant in the context of real-time video understanding for autonomous vehicles, and robots. In this paper, we argue that the commonly used performance metric of mean Intersection over Union (mIoU) does not fully capture the information required to estimate the true performance of these networks when they operate inreal-time'. We propose a change of objective in the segmentation task, and its associated metric that encapsulates this missing information in the following way: We propose to predict the future output segmentation map that will match the future input frame at the time when the network finishes the processing. We introduce the associated latency-aware metric, from which we can determine a ranking. We perform latency timing experiments of some recent networks on different hardware and assess the performances of these networks on our proposed task. We propose improvements to scene segmentation networks to better perform on our task by using multi-frames input and increasing capacity in the initial convolutional layers. △ Less

Submitted 20 April, 2022; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:2002.03240 [pdf, other]

Multi-task Reinforcement Learning with a Planning Quasi-Metric

Authors: Vincent Micheli, Karthigan Sinnathamby, François Fleuret

Abstract: We introduce a new reinforcement learning approach combining a planning quasi-metric (PQM) that estimates the number of steps required to go from any state to another, with task-specific "aimers" that compute a target state to reach a given goal. This decomposition allows the sharing across tasks of a task-agnostic model of the quasi-metric that captures the environment's dynamics and can be learn… ▽ More We introduce a new reinforcement learning approach combining a planning quasi-metric (PQM) that estimates the number of steps required to go from any state to another, with task-specific "aimers" that compute a target state to reach a given goal. This decomposition allows the sharing across tasks of a task-agnostic model of the quasi-metric that captures the environment's dynamics and can be learned in a dense and unsupervised manner. We achieve multiple-fold training speed-up compared to recently published methods on the standard bit-flip problem and in the MuJoCo robotic arm simulator. △ Less

Submitted 5 December, 2020; v1 submitted 8 February, 2020; originally announced February 2020.

Comments: Deep RL Workshop, NeurIPS 2020

arXiv:1910.11758 [pdf, other]

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

Authors: Prabhu Teja Sivaprasad, Florian Mai, Thijs Vogels, Martin Jaggi, François Fleuret

Abstract: The performance of optimizers, particularly in deep learning, depends considerably on their chosen hyperparameter configuration. The efficacy of optimizers is often studied under near-optimal problem-specific hyperparameters, and finding these settings may be prohibitively costly for practitioners. In this work, we argue that a fair assessment of optimizers' performance must take the computational… ▽ More The performance of optimizers, particularly in deep learning, depends considerably on their chosen hyperparameter configuration. The efficacy of optimizers is often studied under near-optimal problem-specific hyperparameters, and finding these settings may be prohibitively costly for practitioners. In this work, we argue that a fair assessment of optimizers' performance must take the computational cost of hyperparameter tuning into account, i.e., how easy it is to find good hyperparameter configurations using an automatic hyperparameter search. Evaluating a variety of optimizers on an extensive set of standard datasets and architectures, our results indicate that Adam is the most practical solution, particularly in low-budget scenarios. △ Less

Submitted 15 August, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: published at International Conference on Machine Learning (ICML 2020)

arXiv:1905.03711 [pdf, other]

Processing Megapixel Images with Deep Attention-Sampling Models

Authors: Angelos Katharopoulos, François Fleuret

Abstract: Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution… ▽ More Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images. △ Less

Submitted 17 July, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: Presented in ICML 2019. Code is available at https://github.com/idiap/attention-sampling

Journal ref: Proceedings of the 36th International Conference on Machine Learning, PMLR 97:3282-3291, 2019

arXiv:1905.00780 [pdf, other]

Full-Gradient Representation for Neural Network Visualization

Authors: Suraj Srinivas, Francois Fleuret

Abstract: We introduce a new tool for interpreting neural net responses, namely full-gradients, which decomposes the neural net response into input sensitivity and per-neuron sensitivity components. This is the first proposed representation which satisfies two key properties: completeness and weak dependence, which provably cannot be satisfied by any saliency map-based interpretability method. For convoluti… ▽ More We introduce a new tool for interpreting neural net responses, namely full-gradients, which decomposes the neural net response into input sensitivity and per-neuron sensitivity components. This is the first proposed representation which satisfies two key properties: completeness and weak dependence, which provably cannot be satisfied by any saliency map-based interpretability method. For convolutional nets, we also propose an approximate saliency map representation, called FullGrad, obtained by aggregating the full-gradient components. We experimentally evaluate the usefulness of FullGrad in explaining model behaviour with two quantitative tests: pixel perturbation and remove-and-retrain. Our experiments reveal that our method explains model behaviour correctly, and more comprehensively than other methods in the literature. Visual inspection also reveals that our saliency maps are sharper and more tightly confined to object regions than other methods. △ Less

Submitted 3 December, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: NeurIPS 2019

arXiv:1904.08598 [pdf, other]

Reducing Noise in GAN Training with Variance Reduced Extragradient

Authors: Tatjana Chavdarova, Gauthier Gidel, François Fleuret, Simon Lacoste-Julien

Abstract: We study the effect of the stochastic gradient noise on the training of generative adversarial networks (GANs) and show that it can prevent the convergence of standard game optimization methods, while the batch version converges. We address this issue with a novel stochastic variance-reduced extragradient (SVRE) optimization algorithm, which for a large class of games improves upon the previous co… ▽ More We study the effect of the stochastic gradient noise on the training of generative adversarial networks (GANs) and show that it can prevent the convergence of standard game optimization methods, while the batch version converges. We address this issue with a novel stochastic variance-reduced extragradient (SVRE) optimization algorithm, which for a large class of games improves upon the previous convergence rates proposed in the literature. We observe empirically that SVRE performs similarly to a batch method on MNIST while being computationally cheaper, and that SVRE yields more stable GAN training on standard datasets. △ Less

Submitted 25 June, 2020; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: latest NeurIPS'19 version

arXiv:1806.01677 [pdf, other]

Practical Deep Stereo (PDS): Toward applications-friendly deep stereo matching

Authors: Stepan Tulyakov, Anton Ivanov, Francois Fleuret

Abstract: End-to-end deep-learning networks recently demonstrated extremely good perfor- mance for stereo matching. However, existing networks are difficult to use for practical applications since (1) they are memory-hungry and unable to process even modest-size images, (2) they have to be trained for a given disparity range. The Practical Deep Stereo (PDS) network that we propose addresses both issues: Fir… ▽ More End-to-end deep-learning networks recently demonstrated extremely good perfor- mance for stereo matching. However, existing networks are difficult to use for practical applications since (1) they are memory-hungry and unable to process even modest-size images, (2) they have to be trained for a given disparity range. The Practical Deep Stereo (PDS) network that we propose addresses both issues: First, its architecture relies on novel bottleneck modules that drastically reduce the memory footprint in inference, and additional design choices allow to handle greater image size during training. This results in a model that leverages large image context to resolve matching ambiguities. Second, a novel sub-pixel cross- entropy loss combined with a MAP estimator make this network less sensitive to ambiguous matches, and applicable to any disparity range without re-training. We compare PDS to state-of-the-art methods published over the recent months, and demonstrate its superior performance on FlyingThings3D and KITTI sets. △ Less

Submitted 5 June, 2018; originally announced June 2018.

arXiv:1803.00942 [pdf, other]

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

Authors: Angelos Katharopoulos, François Fleuret

Abstract: Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to… ▽ More Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%. △ Less

Submitted 28 October, 2019; v1 submitted 2 March, 2018; originally announced March 2018.

Comments: Accepted at ICML 2018 (short oral)

arXiv:1803.00443 [pdf, other]

Knowledge Transfer with Jacobian Matching

Authors: Suraj Srinivas, Francois Fleuret

Abstract: Classical distillation methods transfer representations from a "teacher" neural network to a "student" network by matching their output activations. Recent methods also match the Jacobians, or the gradient of output activations with the input. However, this involves making some ad hoc decisions, in particular, the choice of the loss function. In this paper, we first establish an equivalence betw… ▽ More Classical distillation methods transfer representations from a "teacher" neural network to a "student" network by matching their output activations. Recent methods also match the Jacobians, or the gradient of output activations with the input. However, this involves making some ad hoc decisions, in particular, the choice of the loss function. In this paper, we first establish an equivalence between Jacobian matching and distillation with input noise, from which we derive appropriate loss functions for Jacobian matching. We then rely on this analysis to apply Jacobian matching to transfer learning by establishing equivalence of a recent transfer learning procedure to distillation. We then show experimentally on standard image datasets that Jacobian-based penalties improve distillation, robustness to noisy inputs, and transfer learning. △ Less

Submitted 1 March, 2018; originally announced March 2018.

arXiv:1802.04016 [pdf, other]

Geodesic Convolutional Shape Optimization

Authors: Pierre Baqué, Edoardo Remelli, François Fleuret, Pascal Fua

Abstract: Aerodynamic shape optimization has many industrial applications. Existing methods, however, are so computationally demanding that typical engineering practices are to either simply try a limited number of hand-designed shapes or restrict oneself to shapes that can be parameterized using only few degrees of freedom. In this work, we introduce a new way to optimize complex shapes fast and accurately… ▽ More Aerodynamic shape optimization has many industrial applications. Existing methods, however, are so computationally demanding that typical engineering practices are to either simply try a limited number of hand-designed shapes or restrict oneself to shapes that can be parameterized using only few degrees of freedom. In this work, we introduce a new way to optimize complex shapes fast and accurately. To this end, we train Geodesic Convolutional Neural Networks to emulate a fluidynamics simulator. The key to making this approach practical is remeshing the original shape using a polycube map, which makes it possible to perform the computations on GPUs instead of CPUs. The neural net is then used to formulate an objective function that is differentiable with respect to the shape parameters, which can then be optimized using a gradient-based technique. This outperforms state- of-the-art methods by 5 to 20% for standard problems and, even more importantly, our approach applies to cases that previous methods cannot handle. △ Less

Submitted 12 February, 2018; originally announced February 2018.

arXiv:1712.02330 [pdf, other]

SGAN: An Alternative Training of Generative Adversarial Networks

Authors: Tatjana Chavdarova, François Fleuret

Abstract: The Generative Adversarial Networks (GANs) have demonstrated impressive performance for data synthesis, and are now used in a wide range of computer vision tasks. In spite of this success, they gained a reputation for being difficult to train, what results in a time-consuming and human-involved development process to use them. We consider an alternative training process, named SGAN, in which sev… ▽ More The Generative Adversarial Networks (GANs) have demonstrated impressive performance for data synthesis, and are now used in a wide range of computer vision tasks. In spite of this success, they gained a reputation for being difficult to train, what results in a time-consuming and human-involved development process to use them. We consider an alternative training process, named SGAN, in which several adversarial "local" pairs of networks are trained independently so that a "global" supervising pair of networks can be trained against them. The goal is to train the global pair with the corresponding ensemble opponent for improved performances in terms of mode coverage. This approach aims at increasing the chances that learning will not stop for the global pair, preventing both to be trapped in an unsatisfactory local minimum, or to face oscillations often observed in practice. To guarantee the latter, the global pair never affects the local ones. The rules of SGAN training are thus as follows: the global generator and discriminator are trained using the local discriminators and generators, respectively, whereas the local networks are trained with their fixed local opponent. Experimental results on both toy and real-world problems demonstrate that this approach outperforms standard training in terms of better mitigating mode collapse, stability while converging and that it surprisingly, increases the convergence speed as well. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Showing 1–50 of 69 results for author: Fleuret, F