Skip to main content

Showing 1–50 of 113 results for author: Cord, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08113  [pdf, other

    cs.CV cs.RO

    Valeo4Cast: A Modular Approach to End-to-End Forecasting

    Authors: Yihong Xu, Éloi Zablocki, Alexandre Boulch, Gilles Puy, Mickael Chen, Florent Bartoccioni, Nermin Samet, Oriane Siméoni, Spyros Gidaris, Tuan-Hung Vu, Andrei Bursuc, Eduardo Valle, Renaud Marlet, Matthieu Cord

    Abstract: Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect from sensor data (cameras or LiDARs) the position and past trajectories of the different elements of the scene and predict their future location. We depart from the curren… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Winning solution of the Argoverse 2 "Unified Detection, Tracking, and Forecasting" challenge, held at CVPR 2024 WAD

  2. arXiv:2406.08074  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    A Concept-Based Explainability Framework for Large Multimodal Models

    Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord

    Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.02842  [pdf, other

    cs.CV

    Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features

    Authors: Paul Couairon, Mustafa Shukor, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

    Abstract: Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harn… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  4. arXiv:2405.02246  [pdf, other

    cs.CV cs.AI

    What matters when building vision-language models?

    Authors: Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh

    Abstract: The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices im… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  5. arXiv:2404.15736  [pdf, other

    cs.CV cs.AI

    What Makes Multimodal In-Context Learning Work?

    Authors: Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

    Abstract: Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal mo… ▽ More

    Submitted 25 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: 20 pages, 16 figures. Accepted to CVPR 2024 Workshop on Prompting in Vision. Project page: https://folbaeni.gitlab.io/multimodal-icl

  6. arXiv:2404.05468  [pdf, other

    q-bio.NC cs.CV cs.LG

    Mind-to-Image: Projecting Visual Mental Imagination of the Brain from fMRI

    Authors: Hugo Caselles-Dupré, Charles Mellerio, Paul Hérent, Alizée Lopez-Persem, Benoit Béranger, Mathieu Soularue, Pierre Fautrel, Gauthier Vernier, Matthieu Cord

    Abstract: The reconstruction of images observed by subjects from fMRI data collected during visual stimuli has made strong progress in the past decade, thanks to the availability of extensive fMRI datasets and advancements in generative models for image generation. However, the application of visual reconstruction has remained limited. Reconstructing visual imagination presents a greater challenge, with pot… ▽ More

    Submitted 28 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: Pre-print to be updated. Work in progress

  7. arXiv:2403.20105  [pdf, other

    cs.CV

    FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

    Authors: Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord

    Abstract: Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial represe… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  8. arXiv:2403.15098  [pdf, other

    cs.CV

    UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

    Authors: Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud Ben Amor, Éloi Zablocki, Matthieu Cord, Alexandre Alahi

    Abstract: Vehicle trajectory prediction has increasingly relied on data-driven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain under-explored. While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, e.g., in data formats, map resolution, and semantic annotation t… ▽ More

    Submitted 27 March, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  9. arXiv:2403.13499  [pdf, other

    cs.CV

    Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

    Authors: Théophane Vallaeys, Mustafa Shukor, Matthieu Cord, Jakob Verbeek

    Abstract: The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with ``… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  10. arXiv:2312.13863  [pdf, other

    cs.LG cs.CR cs.RO

    Manipulating Trajectory Prediction with Backdoors

    Authors: Kaouther Messaoud, Kathrin Grosse, Mickael Chen, Matthieu Cord, Patrick Pérez, Alexandre Alahi

    Abstract: Autonomous vehicles ought to predict the surrounding agents' trajectories to allow safe maneuvers in uncertain and complex traffic situations. As companies increasingly apply trajectory prediction in the real world, security becomes a relevant concern. In this paper, we focus on backdoors - a security threat acknowledged in other fields but so far overlooked for trajectory prediction. To this end,… ▽ More

    Submitted 3 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 9 pages, 7 figures

  11. arXiv:2312.09231  [pdf, other

    cs.CV cs.LG

    Reliability in Semantic Segmentation: Can We Use Synthetic Data?

    Authors: Thibaut Loiseau, Tuan-Hung Vu, Mickael Chen, Patrick Pérez, Matthieu Cord

    Abstract: Assessing the reliability of perception models to covariate shifts and out-of-distribution (OOD) detection is crucial for safety-critical applications such as autonomous vehicles. By nature of the task, however, the relevant data is difficult to collect and annotate. In this paper, we challenge cutting-edge generative models to automatically synthesize data for assessing reliability in semantic se… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Project Page: https://valeoai.github.io/blog/publications/GenVal

  12. arXiv:2312.06386  [pdf, other

    cs.CV cs.AI cs.LG

    ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

    Authors: Cédric Rommel, Victor Letzelter, Nermin Samet, Renaud Marlet, Matthieu Cord, Patrick Pérez, Eduardo Valle

    Abstract: Monocular 3D human pose estimation (3D-HPE) is an inherently ambiguous task, as a 2D pose in an image might originate from different possible 3D poses. Yet, most 3D-HPE methods rely on regression models, which assume a one-to-one map** between inputs and outputs. In this work, we provide theoretical and empirical evidence that, because of this ambiguity, common regression models are bound to pre… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  13. arXiv:2312.00703  [pdf, other

    cs.CV

    PointBeV: A Sparse Approach to BeV Predictions

    Authors: Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

    Abstract: Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV… ▽ More

    Submitted 23 May, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: https://github.com/valeoai/PointBeV

  14. arXiv:2311.14542  [pdf, other

    cs.CV

    ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

    Authors: Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny

    Abstract: Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. This paper introduces ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into simpler, interpretable stage… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  15. arXiv:2310.00647  [pdf, other

    cs.CV cs.MM

    Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

    Authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord

    Abstract: Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not… ▽ More

    Submitted 22 January, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. Project Page: https://evalign-icl.github.io/

  16. arXiv:2309.09614  [pdf, other

    cs.CV cs.AI cs.LG

    Gradpaint: Gradient-Guided Inpainting with Diffusion Models

    Authors: Asya Grechka, Guillaume Couairon, Matthieu Cord

    Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by guiding their iterative denoising process at inference time to satisfy additional constraints. For the specific task of image inpainting, the current guiding mec… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  17. arXiv:2309.01575  [pdf, other

    cs.CV cs.LG

    DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

    Authors: Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, Patrick Pérez

    Abstract: We present an innovative approach to 3D Human Pose Estimation (3D-HPE) by integrating cutting-edge diffusion models, which have revolutionized diverse fields, but are relatively unexplored in 3D-HPE. We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE, and demonstra… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted to 2023 International Conference on Computer Vision Workshop (Analysis and Modeling of Faces and Gestures)

  18. arXiv:2307.16184  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

    Authors: Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord

    Abstract: Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac e… ▽ More

    Submitted 22 December, 2023; v1 submitted 30 July, 2023; originally announced July 2023.

    Comments: Accepted at TMLR 2023. 40 pages. Project page: https://unival-model.github.io/

  19. arXiv:2307.09361  [pdf, other

    cs.CV cs.AI cs.LG

    MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

    Authors: Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez

    Abstract: Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  20. arXiv:2306.16527  [pdf, other

    cs.IR cs.CV

    OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

    Authors: Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

    Abstract: Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documen… ▽ More

    Submitted 21 August, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

  21. arXiv:2306.13754  [pdf, other

    cs.CV

    Zero-shot spatial layout conditioning for text-to-image diffusion models

    Authors: Guillaume Couairon, Marlène Careil, Matthieu Cord, Stéphane Lathuilière, Jakob Verbeek

    Abstract: Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accu… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

  22. arXiv:2306.09281  [pdf, other

    cs.RO cs.CV

    Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

    Authors: Yihong Xu, Loïck Chambon, Éloi Zablocki, Mickaël Chen, Alexandre Alahi, Matthieu Cord, Patrick Pérez

    Abstract: Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving map**, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, an… ▽ More

    Submitted 5 March, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted to ICRA 2024

  23. arXiv:2306.08751  [pdf, other

    cs.CV

    Improving Selective Visual Question Answering by Learning from Your Peers

    Authors: Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, Marcus Rohrbach

    Abstract: Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: CVPR 2023. Code available here: https://github.com/facebookresearch/selective-vqa_ood

  24. arXiv:2306.04488  [pdf, other

    cs.LG cs.AI cs.CV

    Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

    Authors: Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, Matthieu Cord

    Abstract: Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate th… ▽ More

    Submitted 16 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

  25. arXiv:2303.11403  [pdf, other

    cs.CV cs.CL cs.LG

    eP-ALM: Efficient Perceptual Augmentation of Language Models

    Authors: Mustafa Shukor, Corentin Dancette, Matthieu Cord

    Abstract: Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023. Project page: https://mshukor.github.io/eP-ALM.github.io/

  26. arXiv:2301.09858  [pdf, other

    cs.CV

    PowerQuant: Automorphism Search for Non-Uniform Quantization

    Authors: Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, Kevin Bailly

    Abstract: Deep neural networks (DNNs) are nowadays ubiquitous in many domains such as computer vision. However, due to their high latency, the deployment of DNNs hinges on the development of compression techniques such as quantization which consists in lowering the number of bits used to encode the weights and activations. Growing concerns for privacy and security have motivated the development of data-free… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

  27. arXiv:2212.10445  [pdf, other

    cs.LG cs.AI cs.CV

    Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization

    Authors: Alexandre Ramé, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, Léon Bottou, David Lopez-Paz

    Abstract: Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune the weights on the target task of interest. So, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks: these individual fine-tunings exist in isolation without ben… ▽ More

    Submitted 9 August, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: 24 pages, 10 tables, 21 figures

  28. arXiv:2212.04884  [pdf, other

    cs.CV

    Co-training $2^L$ Submodels for Visual Recognition

    Authors: Hugo Touvron, Matthieu Cord, Maxime Oquab, Piotr Bojanowski, Jakob Verbeek, Hervé Jégou

    Abstract: We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the reg… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

  29. arXiv:2212.04267  [pdf, other

    cs.CV cs.LG

    Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

    Authors: Mustafa Shukor, Nicolas Thome, Matthieu Cord

    Abstract: Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based comp… ▽ More

    Submitted 15 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: Code: https://github.com/mshukor/VLPCook

  30. arXiv:2211.13999  [pdf, other

    cs.CV

    CoMFormer: Continual Learning in Semantic and Panoptic Segmentation

    Authors: Fabio Cermelli, Matthieu Cord, Arthur Douillard

    Abstract: Continual learning for segmentation has recently seen increasing interest. However, all previous works focus on narrow semantic segmentation and disregard panoptic segmentation, an important task with real-world impacts. %a In this paper, we present the first continual learning model capable of operating on both semantic and panoptic segmentation. Inspired by recent transformer approaches that con… ▽ More

    Submitted 25 November, 2022; originally announced November 2022.

    Comments: Under submission

  31. arXiv:2211.12380  [pdf, other

    cs.CV cs.AI

    OCTET: Object-aware Counterfactual Explanations

    Authors: Mehdi Zemni, Mickaël Chen, Éloi Zablocki, Hédi Ben-Younes, Patrick Pérez, Matthieu Cord

    Abstract: Nowadays, deep vision models are being widely deployed in safety-critical applications, e.g., autonomous driving, and explainability of such models is becoming a pressing concern. Among explanation methods, counterfactual explanations aim to find minimal and interpretable changes to the input image that would also change the output of the model to be explained. Such explanations point end-users at… ▽ More

    Submitted 24 March, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: CVPR 2023

  32. arXiv:2210.11427  [pdf, other

    cs.CV

    DiffEdit: Diffusion-based semantic image editing with mask guidance

    Authors: Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord

    Abstract: Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of im… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: Preprint

  33. arXiv:2208.13628  [pdf, other

    cs.CV cs.LG

    Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

    Authors: Mustafa Shukor, Guillaume Couairon, Matthieu Cord

    Abstract: Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem reasonable in the long term to move toward sustainable solutions, and de facto excludes academic laboratories with limited resources. In this work, we propose a ne… ▽ More

    Submitted 5 October, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

    Comments: BMVC 2022

  34. arXiv:2207.04089  [pdf, other

    cs.CV

    SInGE: Sparsity via Integrated Gradients Estimation of Neuron Relevance

    Authors: Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, Kevin Bailly

    Abstract: The leap in performance in state-of-the-art computer vision methods is attributed to the development of deep neural networks. However it often comes at a computational price which may hinder their deployment. To alleviate this limitation, structured pruning is a well known technique which consists in removing channels, neurons or filters, and is commonly applied in order to produce more compact mo… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  35. arXiv:2206.13294  [pdf, other

    cs.CV cs.AI cs.RO

    LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation

    Authors: Florent Bartoccioni, Éloi Zablocki, Andrei Bursuc, Patrick Pérez, Matthieu Cord, Karteek Alahari

    Abstract: Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common topview grid. This is usually done with error-prone geometric operations (e.g., homography or back-project… ▽ More

    Submitted 26 November, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

    MSC Class: 68T45

    Journal ref: CoRL 2022 https://openreview.net/forum?id=abd_D-iVjk0

  36. arXiv:2205.10873  [pdf, other

    cs.CV

    Dynamic Query Selection for Fast Visual Perceiver

    Authors: Corentin Dancette, Matthieu Cord

    Abstract: Transformers have been matching deep convolutional networks for vision architectures in recent works. Most work is focused on getting the best results on large-scale benchmarks, and scaling laws seem to be the most successful strategy: bigger models, more data, and longer training result in higher performance. However, the reduction of network complexity and inference time remains under-explored.… ▽ More

    Submitted 21 March, 2023; v1 submitted 22 May, 2022; originally announced May 2022.

    Comments: Accepted at the Transformer for Vision workshop, CVPR 2022

  37. arXiv:2205.10158  [pdf, other

    cs.CV cs.LG

    Swap** Semantic Contents for Mixing Images

    Authors: Rémy Sun, Clément Masson, Gilles Hénaff, Nicolas Thome, Matthieu Cord

    Abstract: Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we in… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

    Comments: Accepted at ICPR 2022, 7 pages, 4 figures, 6 tables

  38. arXiv:2205.10139  [pdf, other

    cs.LG

    Towards efficient feature sharing in MIMO architectures

    Authors: Rémy Sun, Alexandre Ramé, Clément Masson, Nicolas Thome, Matthieu Cord

    Abstract: Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicab… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

    Comments: 7 pages, 6 figures, 1 table

  39. arXiv:2205.09739  [pdf, other

    cs.CV cs.AI cs.LG

    Diverse Weight Averaging for Out-of-Distribution Generalization

    Authors: Alexandre Ramé, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, Matthieu Cord

    Abstract: Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive DomainBed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In t… ▽ More

    Submitted 27 January, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: 36 pages, 16 figures, 15 tables

  40. arXiv:2204.11667  [pdf, other

    cs.CV

    Multi-Head Distillation for Continual Unsupervised Domain Adaptation in Semantic Segmentation

    Authors: Antoine Saporta, Arthur Douillard, Tuan-Hung Vu, Patrick Pérez, Matthieu Cord

    Abstract: Unsupervised Domain Adaptation (UDA) is a transfer learning task which aims at training on an unlabeled target domain by leveraging a labeled source domain. Beyond the traditional scope of UDA with a single source domain and a single target domain, real-world perception systems face a variety of scenarios to handle, from varying lighting conditions to many cities around the world. In this context,… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: Published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022 Workshop on Continual Learning

  41. arXiv:2204.09730  [pdf, other

    cs.CV

    Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

    Authors: Mustafa Shukor, Guillaume Couairon, Asya Grechka, Matthieu Cord

    Abstract: Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Re… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: Accepted at CVPR 2022, MULA Workshop. Code is available at https://github.com/mshukor/TFood

  42. arXiv:2204.07118  [pdf, other

    cs.CV

    DeiT III: Revenge of the ViT

    Authors: Hugo Touvron, Matthieu Cord, Hervé Jégou

    Abstract: A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this pape… ▽ More

    Submitted 14 April, 2022; originally announced April 2022.

  43. arXiv:2203.14645  [pdf, other

    cs.CV

    REx: Data-Free Residual Quantization Error Expansion

    Authors: Edouard Yvinec, Arnaud Dapgony, Matthieu Cord, Kevin Bailly

    Abstract: Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point operations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack… ▽ More

    Submitted 29 May, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

  44. arXiv:2203.14642  [pdf, other

    cs.CV

    SPIQ: Data-Free Per-Channel Static Input Quantization

    Authors: Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, Kevin Bailly

    Abstract: Computationally expensive neural networks are ubiquitous in computer vision and solutions for efficient inference have drawn a growing attention in the machine learning community. Examples of such solutions comprise quantization, i.e. converting the processing values (weights and inputs) from floating point into integers e.g. int8 or int4. Concurrently, the rise of privacy concerns motivated the s… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

  45. arXiv:2203.09795  [pdf, other

    cs.CV

    Three things everyone should know about Vision Transformers

    Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, Hervé Jégou

    Abstract: After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis. We offer three insights based on simple and easy to implement variants of vision transformers. (1) The residual layers of vision transformers, wh… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

  46. arXiv:2203.04705  [pdf, other

    cs.CV

    FlexIT: Towards Flexible Semantic Image Translation

    Authors: Guillaume Couairon, Asya Grechka, Jakob Verbeek, Holger Schwenk, Matthieu Cord

    Abstract: Deep generative models, like GANs, have considerably improved the state of the art in image synthesis, and are able to generate near photo-realistic images in structured domains such as human faces. Based on this success, recent work on image editing proceeds by projecting images to the GAN latent space and manipulating the latent vector. However, these approaches are limited in that only images f… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: accepted at CVPR 2022

  47. arXiv:2112.13692  [pdf, other

    cs.CV

    Augmenting Convolutional networks with attention-based aggregation

    Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou

    Abstract: We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parame… ▽ More

    Submitted 27 December, 2021; originally announced December 2021.

  48. arXiv:2112.03252  [pdf, other

    cs.CV

    CSG0: Continual Urban Scene Generation with Zero Forgetting

    Authors: Himalaya Jain, Tuan-Hung Vu, Patrick Pérez, Matthieu Cord

    Abstract: With the rapid advances in generative adversarial networks (GANs), the visual quality of synthesised scenes keeps improving, including for complex urban scenes with applications to automated driving. We address in this work a continual scene generation setup in which GANs are trained on a stream of distinct domains; ideally, the learned models should eventually be able to generate new scenes in al… ▽ More

    Submitted 2 May, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

    Comments: Published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022 Workshop on Continual Learning

  49. arXiv:2112.03162  [pdf, other

    cs.CV cs.CL

    Embedding Arithmetic of Multimodal Queries for Image Retrieval

    Authors: Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk

    Abstract: Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the… ▽ More

    Submitted 20 October, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

    Comments: accepted at O-DRUM (CVPR workshop 2022)

  50. arXiv:2111.11326  [pdf, other

    cs.CV cs.LG

    DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion

    Authors: Arthur Douillard, Alexandre Ramé, Guillaume Couairon, Matthieu Cord

    Abstract: Deep network architectures struggle to continually learn new tasks without forgetting the previous tasks. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning. However, existing approaches often require a task identifier at test-time, need complex tuning to balance the growing number of para… ▽ More

    Submitted 7 August, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

    Comments: CVPR 2022, Code at https://github.com/arthurdouillard/dytox