Search | arXiv e-print repository

GalLoP: Learning Global and Local Prompts for Vision-Language Models

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Audebert, Nicolas Thome

Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning me… ▽ More Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: To be published at ECCV 2024

arXiv:2403.10403 [pdf, other]

Energy Correction Model in the Feature Space for Out-of-Distribution Detection

Authors: Marc Lafon, Clément Rambour, Nicolas Thome

Abstract: In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance… ▽ More In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance. To overcome this an energy-based correction of a mixture of class-conditional Gaussian distributions. We obtains favorable results when compared to a strong baseline like the KNN detector on the CIFAR-10/CIFAR-100 OOD detection benchmarks. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: NeurIPS ML Safety Workshop (2022)

arXiv:2309.08250 [pdf, other]

Optimization of Rank Losses for Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomp… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at https://github.com/cvdfoundation/google-landmark. Code will be released at https://github.com/elias-ramzi/SupRank. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: arXiv admin note: text overlap with arXiv:2207.04873

arXiv:2307.06795 [pdf, other]

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Authors: Denis Coquenet, Clément Rambour, Emanuele Dalsasso, Nicolas Thome

Abstract: Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further lev… ▽ More Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.08707 [pdf, other]

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Authors: Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome

Abstract: Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video an… ▽ More Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io △ Less

Submitted 2 April, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: TMLR 2024. Project web-page at https://videdit.github.io

arXiv:2305.16966 [pdf, other]

Hybrid Energy Based Model in the Feature Space for Out-of-Distribution Detection

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Thome

Abstract: Out-of-distribution (OOD) detection is a critical requirement for the deployment of deep neural networks. This paper introduces the HEAT model, a new post-hoc OOD detection method estimating the density of in-distribution (ID) samples using hybrid energy-based models (EBM) in the feature space of a pre-trained backbone. HEAT complements prior density estimators of the ID density, e.g. parametric m… ▽ More Out-of-distribution (OOD) detection is a critical requirement for the deployment of deep neural networks. This paper introduces the HEAT model, a new post-hoc OOD detection method estimating the density of in-distribution (ID) samples using hybrid energy-based models (EBM) in the feature space of a pre-trained backbone. HEAT complements prior density estimators of the ID density, e.g. parametric models like the Gaussian Mixture Model (GMM), to provide an accurate yet robust density estimation. A second contribution is to leverage the EBM framework to provide a unified density estimation and to compose several energy terms. Extensive experiments demonstrate the significance of the two contributions. HEAT sets new state-of-the-art OOD detection results on the CIFAR-10 / CIFAR-100 benchmark as well as on the large-scale Imagenet benchmark. The code is available at: https://github.com/MarcLafon/heatood. △ Less

Submitted 1 June, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Journal ref: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

arXiv:2212.07890 [pdf, other]

Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation

Authors: Loic Themyr, Clement Rambour, Nicolas Thome, Toby Collins, Alexandre Hostettler

Abstract: Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of globa… ▽ More Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: Winter Conference on Applications of Computer Vision (WACV 2023)

MSC Class: 68T45

arXiv:2210.05313 [pdf, other]

Memory transformers for full context and high-resolution 3D Medical Segmentation

Authors: Loic Themyr, Clément Rambour, Nicolas Thome, Toby Collins, Alexandre Hostettler

Abstract: Transformer models achieve state-of-the-art results for image segmentation. However, achieving long-range attention, necessary to capture global context, with high-resolution 3D images is a fundamental challenge. This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue. The core idea behind FINE is to learn memory tokens to indirectly model full range interactions… ▽ More Transformer models achieve state-of-the-art results for image segmentation. However, achieving long-range attention, necessary to capture global context, with high-resolution 3D images is a fundamental challenge. This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue. The core idea behind FINE is to learn memory tokens to indirectly model full range interactions while scaling well in both memory and computational costs. FINE introduces memory tokens at two levels: the first one allows full interaction between voxels within local image regions (patches), the second one allows full interactions between all regions of the 3D volume. Combined, they allow full attention over high resolution images, e.g. 512 x 512 x 256 voxels and above. Experiments on the BCV image segmentation dataset shows better performances than state-of-the-art CNN and transformer baselines, highlighting the superiority of our full attention mechanism compared to recent transformer baselines, e.g. CoTr, and nnFormer. △ Less

Submitted 11 October, 2022; originally announced October 2022.

MSC Class: 68T45

arXiv:2207.04873 [pdf, other]

Hierarchical Average Precision Training for Pertinent Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Nicolas Thome, Clément Rambour, Xavier Bitot

Abstract: Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance a… ▽ More Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance and better evaluate rankings. To train deep models with H-AP, we carefully study the problem's structure and design a smooth lower bound surrogate combined with a clustering loss that ensures consistent ordering. Extensive experiments on 6 datasets show that HAPPIER significantly outperforms state-of-the-art methods for hierarchical retrieval, while being on par with the latest approaches when evaluating fine-grained ranking performances. Finally, we show that HAPPIER leads to better organization of the embedding space, and prevents most severe failure cases of non-hierarchical methods. Our code is publicly available at: https://github.com/elias-ramzi/HAPPIER. △ Less

Submitted 22 July, 2022; v1 submitted 5 July, 2022; originally announced July 2022.

Journal ref: ECCV 2022, Oct 2022, Tel-Aviv, Israel

arXiv:2207.03790 [pdf, other]

Complementing Brightness Constancy with Deep Networks for Optical Flow Prediction

Authors: Vincent Le Guen, Clément Rambour, Nicolas Thome

Abstract: State-of-the-art methods for optical flow estimation rely on deep learning, which require complex sequential training schemes to reach optimal performances on real-world data. In this work, we introduce the COMBO deep network that explicitly exploits the brightness constancy (BC) model used in traditional methods. Since BC is an approximate physical model violated in several situations, we propose… ▽ More State-of-the-art methods for optical flow estimation rely on deep learning, which require complex sequential training schemes to reach optimal performances on real-world data. In this work, we introduce the COMBO deep network that explicitly exploits the brightness constancy (BC) model used in traditional methods. Since BC is an approximate physical model violated in several situations, we propose to train a physically-constrained network complemented with a data-driven network. We introduce a unique and meaningful flow decomposition between the physical prior and the data-driven complement, including an uncertainty quantification of the BC model. We derive a joint training scheme for learning the different components of the decomposition ensuring an optimal cooperation, in a supervised but also in a semi-supervised context. Experiments show that COMBO can improve performances over state-of-the-art supervised networks, e.g. RAFT, reaching state-of-the-art results on several benchmarks. We highlight how COMBO can leverage the BC model and adapt to its limitations. Finally, we show that our semi-supervised method can significantly simplify the training procedure. △ Less

Submitted 12 July, 2022; v1 submitted 8 July, 2022; originally announced July 2022.

arXiv:2110.01445 [pdf, other]

Robust and Decomposable Average Precision for Image Retrieval

Authors: Elias Ramzi, Nicolas Thome, Clément Rambour, Nicolas Audebert, Xavier Bitot

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank func… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank function, which provides an upper bound of the AP loss and ensures robust training. Secondly, we design a simple yet effective loss function to reduce the decomposability gap between the AP in the whole training set and its averaged batch approximation, for which we provide theoretical guarantees. Extensive experiments conducted on three image retrieval datasets show that ROADMAP outperforms several recent AP approximation methods and highlight the importance of our two contributions. Finally, using ROADMAP for training deep models yields very good performances, outperforming state-of-the-art results on the three datasets. △ Less

Submitted 8 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Journal ref: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Dec 2021, Sydney, Australia

arXiv:2103.07202 [pdf, other]

doi 10.1016/j.cviu.2019.07.011

Urban Surface Reconstruction in SAR Tomography by Graph-Cuts

Authors: Clément Rambour, Loïc Denis, Florence Tupin, Hélène Oriot, Yue Huang, Laurent Ferro-Famil

Abstract: SAR (Synthetic Aperture Radar) tomography reconstructs 3-D volumes from stacks of SAR images. High-resolution satellites such as TerraSAR-X provide images that can be combined to produce 3-D models. In urban areas, sparsity priors are generally enforced during the tomographic inversion process in order to retrieve the location of scatterers seen within a given radar resolution cell. However, such… ▽ More SAR (Synthetic Aperture Radar) tomography reconstructs 3-D volumes from stacks of SAR images. High-resolution satellites such as TerraSAR-X provide images that can be combined to produce 3-D models. In urban areas, sparsity priors are generally enforced during the tomographic inversion process in order to retrieve the location of scatterers seen within a given radar resolution cell. However, such priors often miss parts of the urban surfaces. Those missing parts are typically regions of flat areas such as ground or rooftops. This paper introduces a surface segmentation algorithm based on the computation of the optimal cut in a flow network. This segmentation process can be included within the 3-D reconstruction framework in order to improve the recovery of urban surfaces. Illustrations on a TerraSAR-X tomographic dataset demonstrate the potential of the approach to produce a 3-D model of urban surfaces such as ground, façades and rooftops. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Journal ref: Computer Vision and Image Understanding 188 (2019) 102791

arXiv:2103.06104 [pdf, other]

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation

Authors: Olivier Petit, Nicolas Thome, Clément Rambour, Luc Soler

Abstract: Medical image segmentation remains particularly challenging for complex and low-contrast anatomical structures. In this paper, we introduce the U-Transformer network, which combines a U-shaped architecture for image segmentation with self- and cross-attention from Transformers. U-Transformer overcomes the inability of U-Nets to model long-range contextual interactions and spatial dependencies, whi… ▽ More Medical image segmentation remains particularly challenging for complex and low-contrast anatomical structures. In this paper, we introduce the U-Transformer network, which combines a U-shaped architecture for image segmentation with self- and cross-attention from Transformers. U-Transformer overcomes the inability of U-Nets to model long-range contextual interactions and spatial dependencies, which are arguably crucial for accurate segmentation in challenging contexts. To this end, attention mechanisms are incorporated at two main levels: a self-attention module leverages global interactions between encoder features, while cross-attention in the skip connections allows a fine spatial recovery in the U-Net decoder by filtering out non-semantic features. Experiments on two abdominal CT-image datasets show the large performance gain brought out by U-Transformer compared to U-Net and local Attention U-Nets. We also highlight the importance of using both self- and cross-attention, and the nice interpretability features brought out by U-Transformer. △ Less

Submitted 12 March, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

Showing 1–13 of 13 results for author: Rambour, C