Search | arXiv e-print repository

doi 10.1007/s40314-024-02836-x

Properties of core-EP matrices and binary relationships

Authors: Ehsan Kheirandish, Abbas Salemi, Néstor Thome

Abstract: In this paper, various properties of core-EP matrices are investigated. We introduce the MPDMP matrix associated with $A$ and by means of it, some properties and equivalent conditions of core-EP matrices can be obtained. Also, properties of MPD, DMP, and CMP inverses are studied and we prove that in the class of core-EP matrices, DMP, MPD, and Drazin inverses are the same. Moreover, DMP and MPD bi… ▽ More In this paper, various properties of core-EP matrices are investigated. We introduce the MPDMP matrix associated with $A$ and by means of it, some properties and equivalent conditions of core-EP matrices can be obtained. Also, properties of MPD, DMP, and CMP inverses are studied and we prove that in the class of core-EP matrices, DMP, MPD, and Drazin inverses are the same. Moreover, DMP and MPD binary relation orders are introduced and the relationship between these orders and other binary relation orders are considered. △ Less

Submitted 6 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: 20 pages

MSC Class: 15A09; 15A45

arXiv:2407.02217 [pdf, other]

Physics-Informed Model and Hybrid Planning for Efficient Dyna-Style Reinforcement Learning

Authors: Zakariae El Asri, Olivier Sigaud, Nicolas Thome

Abstract: Applying reinforcement learning (RL) to real-world applications requires addressing a trade-off between asymptotic performance, sample efficiency, and inference time. In this work, we demonstrate how to address this triple challenge by leveraging partial physical knowledge about the system dynamics. Our approach involves learning a physics-informed model to boost sample efficiency and generating i… ▽ More Applying reinforcement learning (RL) to real-world applications requires addressing a trade-off between asymptotic performance, sample efficiency, and inference time. In this work, we demonstrate how to address this triple challenge by leveraging partial physical knowledge about the system dynamics. Our approach involves learning a physics-informed model to boost sample efficiency and generating imaginary trajectories from this model to learn a model-free policy and Q-function. Furthermore, we propose a hybrid planning strategy, combining the learned policy and Q-function with the learned model to enhance time efficiency in planning. Through practical demonstrations, we illustrate that our method improves the compromise between sample efficiency, time efficiency, and performance over state-of-the-art methods. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01400 [pdf, other]

GalLoP: Learning Global and Local Prompts for Vision-Language Models

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Audebert, Nicolas Thome

Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning me… ▽ More Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: To be published at ECCV 2024

arXiv:2406.02842 [pdf, other]

Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features

Authors: Paul Couairon, Mustafa Shukor, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

Abstract: Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harn… ▽ More Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks. Project page at https://diffcut-segmentation.github.io △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2403.14201 [pdf, ps, other]

Parametrizing $W$-weighted BT inverse to obtain the $W$-weighted $q$-BT inverse

Authors: D. E. Ferreyra, N. Thome, C. Torigino

Abstract: The core-EP and BT inverses for rectangular matrices were studied recently in the literature. The main aim of this paper is to unify both concepts by means of a new kind of generalized inverse called $W$-weighted $q$-BT inverse. We analyze its existence and uniqueness by considering an adequate matrix system. Basic properties and some interesting characterizations are proved for this new weighted… ▽ More The core-EP and BT inverses for rectangular matrices were studied recently in the literature. The main aim of this paper is to unify both concepts by means of a new kind of generalized inverse called $W$-weighted $q$-BT inverse. We analyze its existence and uniqueness by considering an adequate matrix system. Basic properties and some interesting characterizations are proved for this new weighted generalized inverse. Also, we give a canonical form of the $W$-weighted $q$-BT inverse by means of the weighted core-EP decomposition. △ Less

Submitted 21 March, 2024; originally announced March 2024.

MSC Class: 15A09; 15A24

arXiv:2403.10403 [pdf, other]

Energy Correction Model in the Feature Space for Out-of-Distribution Detection

Authors: Marc Lafon, Clément Rambour, Nicolas Thome

Abstract: In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance… ▽ More In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance. To overcome this an energy-based correction of a mixture of class-conditional Gaussian distributions. We obtains favorable results when compared to a strong baseline like the KNN detector on the CIFAR-10/CIFAR-100 OOD detection benchmarks. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: NeurIPS ML Safety Workshop (2022)

arXiv:2402.09699 [pdf, ps, other]

doi 10.1080/03081087.2024.2316786

G-Drazin inverse combined with inner inverse

Authors: G. Maharanaa, J. K. Sahooa, Nestor Thome

Abstract: This paper introduces new classes of generalized inverses for square matrices named GD1, and the dual, called 1GD inverse. In addition, we discuss a few characterizations and representations of these inverses. The explicit expressions of these inverses have been established via core-nilpotent decomposition. Further, we introduce a binary relation for GD1 inverse and 1GD inverse, along with a few d… ▽ More This paper introduces new classes of generalized inverses for square matrices named GD1, and the dual, called 1GD inverse. In addition, we discuss a few characterizations and representations of these inverses. The explicit expressions of these inverses have been established via core-nilpotent decomposition. Further, we introduce a binary relation for GD1 inverse and 1GD inverse, along with a few derived properties. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 16 pages, Linear and Multilinear Algebra (2024)

arXiv:2401.13121 [pdf, other]

Procrustes problem for the inverse eigenvalue problem of normal (skew) $J$-Hamiltonian matrices and normal $J$-symplectic matrices

Authors: S. Gigola, L. Lebtahi, N. Thome

Abstract: A square complex matrix $A$ is called (skew) $J$-Hamiltonian if $AJ$ is (skew) hermitian where $J$ is a real normal matrix such that $J^2=-I$, where $I$ is the identity matrix. In this paper, we solve the Procrustes problem to find normal (skew) $J$-Hamiltonian solutions for the inverse eigenvalue problem. In addition, a similar problem is investigated for normal $J$-symplectic matrices. A square complex matrix $A$ is called (skew) $J$-Hamiltonian if $AJ$ is (skew) hermitian where $J$ is a real normal matrix such that $J^2=-I$, where $I$ is the identity matrix. In this paper, we solve the Procrustes problem to find normal (skew) $J$-Hamiltonian solutions for the inverse eigenvalue problem. In addition, a similar problem is investigated for normal $J$-symplectic matrices. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 25 pages

arXiv:2401.09106 [pdf, ps, other]

Extending EP matrices by means of recent generalized inverses

Authors: D. E. Ferreyra, F. E. Levis, A. N. Priori, N. Thome

Abstract: It is well known that a square complex matrix is called EP if it commutes with its Moore-Penrose inverse. In this paper, new classes of matrices which extend this concept are characterized. For that, we consider commutative equalities given by matrices of arbitrary index and generalized inverses recently investigated in the literature. More specifically, these classes are characterized by expressi… ▽ More It is well known that a square complex matrix is called EP if it commutes with its Moore-Penrose inverse. In this paper, new classes of matrices which extend this concept are characterized. For that, we consider commutative equalities given by matrices of arbitrary index and generalized inverses recently investigated in the literature. More specifically, these classes are characterized by expressions of type $A^mX=XA^m$, where $X$ is an outer inverse of a given complex square matrix $A$ and $m$ is an arbitrary positive integer. The relationships between the different classes of matrices are also analyzed. Finally, a picture presents an overview of the overall studied classes. △ Less

Submitted 17 January, 2024; originally announced January 2024.

MSC Class: 15A09; 15A27

arXiv:2310.12646 [pdf, other]

TRUSTED: The Paired 3D Transabdominal Ultrasound and CT Human Data for Kidney Segmentation and Registration Research

Authors: William Ndzimbong, Cyril Fourniol, Loic Themyr, Nicolas Thome, Yvonne Keeza, Beniot Sauer, Pierre-Thierry Piechaud, Arnaud Mejean, Jacques Marescaux, Daniel George, Didier Mutter, Alexandre Hostettler, Toby Collins

Abstract: Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data has many important clinical applications, including image-guided surgery, automatic organ measurement and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired tran… ▽ More Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data has many important clinical applications, including image-guided surgery, automatic organ measurement and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired transabdominal 3DUS and CT kidney images from 48 human patients (96 kidneys), including segmentation, and anatomical landmark annotations by two experienced radiographers. Inter-rater segmentation agreement was over 94 (Dice score), and gold-standard segmentations were generated using the STAPLE algorithm. Seven anatomical landmarks were annotated, important for IMIR systems development and evaluation. To validate the dataset's utility, 5 competitive Deep Learning models for automatic kidney segmentation were benchmarked, yielding average DICE scores from 83.2% to 89.1% for CT, and 61.9% to 79.4% for US images. Three IMIR methods were benchmarked, and Coherent Point Drift performed best with an average Target Registration Error of 4.53mm. The TRUSTED dataset may be used freely researchers to develop and validate new segmentation and IMIR methods. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Alexandre Hostettler, and Toby Collins share last authorship

arXiv:2309.08250 [pdf, other]

Optimization of Rank Losses for Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomp… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at https://github.com/cvdfoundation/google-landmark. Code will be released at https://github.com/elias-ramzi/SupRank. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: arXiv admin note: text overlap with arXiv:2207.04873

arXiv:2307.06795 [pdf, other]

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Authors: Denis Coquenet, Clément Rambour, Emanuele Dalsasso, Nicolas Thome

Abstract: Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further lev… ▽ More Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.08707 [pdf, other]

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Authors: Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome

Abstract: Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video an… ▽ More Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io △ Less

Submitted 2 April, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: TMLR 2024. Project web-page at https://videdit.github.io

arXiv:2305.16966 [pdf, other]

Hybrid Energy Based Model in the Feature Space for Out-of-Distribution Detection

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Thome

Abstract: Out-of-distribution (OOD) detection is a critical requirement for the deployment of deep neural networks. This paper introduces the HEAT model, a new post-hoc OOD detection method estimating the density of in-distribution (ID) samples using hybrid energy-based models (EBM) in the feature space of a pre-trained backbone. HEAT complements prior density estimators of the ID density, e.g. parametric m… ▽ More Out-of-distribution (OOD) detection is a critical requirement for the deployment of deep neural networks. This paper introduces the HEAT model, a new post-hoc OOD detection method estimating the density of in-distribution (ID) samples using hybrid energy-based models (EBM) in the feature space of a pre-trained backbone. HEAT complements prior density estimators of the ID density, e.g. parametric models like the Gaussian Mixture Model (GMM), to provide an accurate yet robust density estimation. A second contribution is to leverage the EBM framework to provide a unified density estimation and to compose several energy terms. Extensive experiments demonstrate the significance of the two contributions. HEAT sets new state-of-the-art OOD detection results on the CIFAR-10 / CIFAR-100 benchmark as well as on the large-scale Imagenet benchmark. The code is available at: https://github.com/MarcLafon/heatood. △ Less

Submitted 1 June, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Journal ref: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

arXiv:2302.10803 [pdf, other]

Eagle: Large-Scale Learning of Turbulent Fluid Dynamics with Mesh Transformers

Authors: Steeven Janny, Aurélien Béneteau, Madiha Nadri, Julie Digne, Nicolas Thome, Christian Wolf

Abstract: Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated… ▽ More Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE, a large-scale dataset of 1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure, comprised of 600 different scenes of three different types. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration. △ Less

Submitted 17 March, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Published as a conference paper at ICLR 2023

Journal ref: International Conference on Learning Representation (ICLR) 2023

arXiv:2302.03462 [pdf, other]

Diverse Probabilistic Trajectory Forecasting with Admissibility Constraints

Authors: Laura Calem, Hedi Ben-Younes, Patrick Pérez, Nicolas Thome

Abstract: Predicting multiple trajectories for road users is important for automated driving systems: ego-vehicle motion planning indeed requires a clear view of the possible motions of the surrounding agents. However, the generative models used for multiple-trajectory forecasting suffer from a lack of diversity in their proposals. To avoid this form of collapse, we propose a novel method for structured pre… ▽ More Predicting multiple trajectories for road users is important for automated driving systems: ego-vehicle motion planning indeed requires a clear view of the possible motions of the surrounding agents. However, the generative models used for multiple-trajectory forecasting suffer from a lack of diversity in their proposals. To avoid this form of collapse, we propose a novel method for structured prediction of diverse trajectories. To this end, we complement an underlying pretrained generative model with a diversity component, based on a determinantal point process (DPP). We balance and structure this diversity with the inclusion of knowledge-based quality constraints, independent from the underlying generative model. We combine these two novel components with a gating operation, ensuring that the predictions are both diverse and within the drivable area. We demonstrate on the nuScenes driving dataset the relevance of our compound approach, which yields significant improvements in the diversity and the quality of the generated trajectories. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Journal ref: International Conference on Pattern Recognition (ICPR) 2022

arXiv:2212.07890 [pdf, other]

Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation

Authors: Loic Themyr, Clement Rambour, Nicolas Thome, Toby Collins, Alexandre Hostettler

Abstract: Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of globa… ▽ More Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: Winter Conference on Applications of Computer Vision (WACV 2023)

MSC Class: 68T45

arXiv:2212.04267 [pdf, other]

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Authors: Mustafa Shukor, Nicolas Thome, Matthieu Cord

Abstract: Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based comp… ▽ More Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (i.e, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code is available here: https://github.com/mshukor/VLPCook △ Less

Submitted 15 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: Code: https://github.com/mshukor/VLPCook

arXiv:2210.05313 [pdf, other]

Memory transformers for full context and high-resolution 3D Medical Segmentation

Authors: Loic Themyr, Clément Rambour, Nicolas Thome, Toby Collins, Alexandre Hostettler

Abstract: Transformer models achieve state-of-the-art results for image segmentation. However, achieving long-range attention, necessary to capture global context, with high-resolution 3D images is a fundamental challenge. This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue. The core idea behind FINE is to learn memory tokens to indirectly model full range interactions… ▽ More Transformer models achieve state-of-the-art results for image segmentation. However, achieving long-range attention, necessary to capture global context, with high-resolution 3D images is a fundamental challenge. This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue. The core idea behind FINE is to learn memory tokens to indirectly model full range interactions while scaling well in both memory and computational costs. FINE introduces memory tokens at two levels: the first one allows full interaction between voxels within local image regions (patches), the second one allows full interactions between all regions of the 3D volume. Combined, they allow full attention over high resolution images, e.g. 512 x 512 x 256 voxels and above. Experiments on the BCV image segmentation dataset shows better performances than state-of-the-art CNN and transformer baselines, highlighting the superiority of our full attention mechanism compared to recent transformer baselines, e.g. CoTr, and nnFormer. △ Less

Submitted 11 October, 2022; originally announced October 2022.

MSC Class: 68T45

arXiv:2208.12625 [pdf, other]

Take One Gram of Neural Features, Get Enhanced Group Robustness

Authors: Simon Roburin, Charles Corbière, Gilles Puy, Nicolas Thome, Matthieu Aubry, Renaud Marlet, Patrick Pérez

Abstract: Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. The presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations. Extensive attempts have been made to develop methods improving worst-group rob… ▽ More Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. The presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotation during training or validation. To this end, we propose to partition the training dataset into groups based on Gram matrices of features extracted by an ``identification'' model and to apply robust optimization based on these pseudo-groups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines △ Less

Submitted 7 February, 2023; v1 submitted 26 August, 2022; originally announced August 2022.

Comments: Long version (Previous version: OOD-CV Workshop @ ECCV 2022)

arXiv:2207.04873 [pdf, other]

Hierarchical Average Precision Training for Pertinent Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Nicolas Thome, Clément Rambour, Xavier Bitot

Abstract: Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance a… ▽ More Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance and better evaluate rankings. To train deep models with H-AP, we carefully study the problem's structure and design a smooth lower bound surrogate combined with a clustering loss that ensures consistent ordering. Extensive experiments on 6 datasets show that HAPPIER significantly outperforms state-of-the-art methods for hierarchical retrieval, while being on par with the latest approaches when evaluating fine-grained ranking performances. Finally, we show that HAPPIER leads to better organization of the embedding space, and prevents most severe failure cases of non-hierarchical methods. Our code is publicly available at: https://github.com/elias-ramzi/HAPPIER. △ Less

Submitted 22 July, 2022; v1 submitted 5 July, 2022; originally announced July 2022.

Journal ref: ECCV 2022, Oct 2022, Tel-Aviv, Israel

arXiv:2207.03790 [pdf, other]

Complementing Brightness Constancy with Deep Networks for Optical Flow Prediction

Authors: Vincent Le Guen, Clément Rambour, Nicolas Thome

Abstract: State-of-the-art methods for optical flow estimation rely on deep learning, which require complex sequential training schemes to reach optimal performances on real-world data. In this work, we introduce the COMBO deep network that explicitly exploits the brightness constancy (BC) model used in traditional methods. Since BC is an approximate physical model violated in several situations, we propose… ▽ More State-of-the-art methods for optical flow estimation rely on deep learning, which require complex sequential training schemes to reach optimal performances on real-world data. In this work, we introduce the COMBO deep network that explicitly exploits the brightness constancy (BC) model used in traditional methods. Since BC is an approximate physical model violated in several situations, we propose to train a physically-constrained network complemented with a data-driven network. We introduce a unique and meaningful flow decomposition between the physical prior and the data-driven complement, including an uncertainty quantification of the BC model. We derive a joint training scheme for learning the different components of the decomposition ensuring an optimal cooperation, in a supervised but also in a semi-supervised context. Experiments show that COMBO can improve performances over state-of-the-art supervised networks, e.g. RAFT, reaching state-of-the-art results on several benchmarks. We highlight how COMBO can leverage the BC model and adapt to its limitations. Finally, we show that our semi-supervised method can significantly simplify the training procedure. △ Less

Submitted 12 July, 2022; v1 submitted 8 July, 2022; originally announced July 2022.

arXiv:2205.10158 [pdf, other]

Swap** Semantic Contents for Mixing Images

Authors: Rémy Sun, Clément Masson, Gilles Hénaff, Nicolas Thome, Matthieu Cord

Abstract: Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we in… ▽ More Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we introduce the SciMix framework that can learn to generator to embed a semantic style code into image backgrounds, we obtain new mixing scheme for data augmentation. We then demonstrate that SciMix yields novel mixed samples that inherit many characteristics from their non-semantic parents. Afterwards, we verify those samples can be used to improve the performance semi-supervised frameworks like Mean Teacher or Fixmatch, and even fully supervised learning on a small labeled dataset. △ Less

Submitted 20 May, 2022; originally announced May 2022.

Comments: Accepted at ICPR 2022, 7 pages, 4 figures, 6 tables

arXiv:2205.10139 [pdf, other]

Towards efficient feature sharing in MIMO architectures

Authors: Rémy Sun, Alexandre Ramé, Clément Masson, Nicolas Thome, Matthieu Cord

Abstract: Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicab… ▽ More Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicability on smaller mobile and AR/VR devices. We posit this behavior stems from an ill-posed part of the multi-input multi-output framework. To solve this issue, we propose a novel unmixing step in MIMO architectures that allows subnetworks to properly share features. Preliminary experiments on CIFAR-100 show our adjustments allow feature sharing and improve model performance for small architectures. △ Less

Submitted 20 May, 2022; originally announced May 2022.

Comments: 7 pages, 6 figures, 1 table

arXiv:2110.01445 [pdf, other]

Robust and Decomposable Average Precision for Image Retrieval

Authors: Elias Ramzi, Nicolas Thome, Clément Rambour, Nicolas Audebert, Xavier Bitot

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank func… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank function, which provides an upper bound of the AP loss and ensures robust training. Secondly, we design a simple yet effective loss function to reduce the decomposability gap between the AP in the whole training set and its averaged batch approximation, for which we provide theoretical guarantees. Extensive experiments conducted on three image retrieval datasets show that ROADMAP outperforms several recent AP approximation methods and highlight the importance of our two contributions. Finally, using ROADMAP for training deep models yields very good performances, outperforming state-of-the-art results on the three datasets. △ Less

Submitted 8 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Journal ref: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Dec 2021, Sydney, Australia

arXiv:2104.04610 [pdf, other]

Deep Time Series Forecasting with Shape and Temporal Criteria

Authors: Vincent Le Guen, Nicolas Thome

Abstract: This paper addresses the problem of multi-step time series forecasting for non-stationary signals that can present sudden changes. Current state-of-the-art deep learning forecasting methods, often trained with variants of the MSE, lack the ability to provide sharp predictions in deterministic and probabilistic contexts. To handle these challenges, we propose to incorporate shape and temporal crite… ▽ More This paper addresses the problem of multi-step time series forecasting for non-stationary signals that can present sudden changes. Current state-of-the-art deep learning forecasting methods, often trained with variants of the MSE, lack the ability to provide sharp predictions in deterministic and probabilistic contexts. To handle these challenges, we propose to incorporate shape and temporal criteria in the training objective of deep models. We define shape and temporal similarities and dissimilarities, based on a smooth relaxation of Dynamic Time War** (DTW) and Temporal Distortion Index (TDI), that enable to build differentiable loss functions and positive semi-definite (PSD) kernels. With these tools, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective for deterministic forecasting, that explicitly incorporates two terms supporting precise shape and temporal change detection. For probabilistic forecasting, we introduce STRIPE++ (Shape and Time diverRsIty in Probabilistic forEcasting), a framework for providing a set of sharp and diverse forecasts, where the structured shape and time diversity is enforced with a determinantal point process (DPP) diversity loss. Extensive experiments and ablations studies on synthetic and real-world datasets confirm the benefits of leveraging shape and time features in time series forecasting. △ Less

Submitted 17 February, 2022; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: arXiv admin note: text overlap with arXiv:2010.07349

arXiv:2103.06104 [pdf, other]

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation

Authors: Olivier Petit, Nicolas Thome, Clément Rambour, Luc Soler

Abstract: Medical image segmentation remains particularly challenging for complex and low-contrast anatomical structures. In this paper, we introduce the U-Transformer network, which combines a U-shaped architecture for image segmentation with self- and cross-attention from Transformers. U-Transformer overcomes the inability of U-Nets to model long-range contextual interactions and spatial dependencies, whi… ▽ More Medical image segmentation remains particularly challenging for complex and low-contrast anatomical structures. In this paper, we introduce the U-Transformer network, which combines a U-shaped architecture for image segmentation with self- and cross-attention from Transformers. U-Transformer overcomes the inability of U-Nets to model long-range contextual interactions and spatial dependencies, which are arguably crucial for accurate segmentation in challenging contexts. To this end, attention mechanisms are incorporated at two main levels: a self-attention module leverages global interactions between encoder features, while cross-attention in the skip connections allows a fine spatial recovery in the U-Net decoder by filtering out non-semantic features. Experiments on two abdominal CT-image datasets show the large performance gain brought out by U-Transformer compared to U-Net and local Attention U-Nets. We also highlight the importance of using both self- and cross-attention, and the nice interpretability features brought out by U-Transformer. △ Less

Submitted 12 March, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

arXiv:2012.06508 [pdf, other]

Confidence Estimation via Auxiliary Models

Authors: Charles Corbière, Nicolas Thome, Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, Patrick Pérez

Abstract: Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deploying such models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP). Si… ▽ More Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deploying such models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and of self-training with pseudo-labels for domain adaptation, which both necessitate effective confidence estimates. Extensive experiments are conducted for validating the relevance of the proposed approach in each task. We study various network architectures and experiment with small and large datasets for image classification and semantic segmentation. In every tested benchmark, our approach outperforms strong baselines. △ Less

Submitted 31 May, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Comments: Accepted to TPAMI 2021

arXiv:2010.07349 [pdf, other]

Probabilistic Time Series Forecasting with Structured Shape and Temporal Diversity

Authors: Vincent Le Guen, Nicolas Thome

Abstract: Probabilistic forecasting consists in predicting a distribution of possible future outcomes. In this paper, we address this problem for non-stationary time series, which is very challenging yet crucially important. We introduce the STRIPE model for representing structured diversity based on shape and time features, ensuring both probable predictions while being sharp and accurate. STRIPE is agnost… ▽ More Probabilistic forecasting consists in predicting a distribution of possible future outcomes. In this paper, we address this problem for non-stationary time series, which is very challenging yet crucially important. We introduce the STRIPE model for representing structured diversity based on shape and time features, ensuring both probable predictions while being sharp and accurate. STRIPE is agnostic to the forecasting model, and we equip it with a diversification mechanism relying on determinantal point processes (DPP). We introduce two DPP kernels for modeling diverse trajectories in terms of shape and time, which are both differentiable and proved to be positive semi-definite. To have an explicit control on the diversity structure, we also design an iterative sampling mechanism to disentangle shape and time representations in the latent space. Experiments carried out on synthetic datasets show that STRIPE significantly outperforms baseline methods for representing diversity, while maintaining accuracy of the forecasting model. We also highlight the relevance of the iterative sampling scheme and the importance to use different criteria for measuring quality and diversity. Finally, experiments on real datasets illustrate that STRIPE is able to outperform state-of-the-art probabilistic forecasting approaches in the best sample prediction. △ Less

Submitted 10 April, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

arXiv:2010.04456 [pdf, other]

doi 10.1088/1742-5468/ac3ae5

Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting

Authors: Yuan Yin, Vincent Le Guen, Jérémie Dona, Emmanuel de Bézenac, Ibrahim Ayed, Nicolas Thome, Patrick Gallinari

Abstract: Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framewor… ▽ More Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. Code is available at https://github.com/yuan-yin/APHYNITY . △ Less

Submitted 10 May, 2022; v1 submitted 9 October, 2020; originally announced October 2020.

Comments: Accepted at ICLR 2021 (Oral)

Journal ref: J. Stat. Mech. (2021) 124012

arXiv:2003.01460 [pdf, other]

Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction

Authors: Vincent Le Guen, Nicolas Thome

Abstract: Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution… ▽ More Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting. △ Less

Submitted 16 March, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

Report number: CVPR 2020

arXiv:1910.04851 [pdf, other]

Addressing Failure Prediction by Learning Model Confidence

Authors: Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, Patrick Pérez

Abstract: Assessing reliably the confidence of a deep neural network and predicting its failures is of primary importance for the practical deployment of these models. In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). We show how using the TCP is more suited than relying on the classic Maximum Class Probability (MCP). We provide in addi… ▽ More Assessing reliably the confidence of a deep neural network and predicting its failures is of primary importance for the practical deployment of these models. In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). We show how using the TCP is more suited than relying on the classic Maximum Class Probability (MCP). We provide in addition theoretical guarantees for TCP in the context of failure prediction. Since the true class is by essence unknown at test time, we propose to learn TCP criterion on the training set, introducing a specific learning scheme adapted to this context. Extensive experiments are conducted for validating the relevance of the proposed approach. We study various network architectures, small and large scale datasets for image classification and semantic segmentation. We show that our approach consistently outperforms several strong methods, from MCP to Bayesian uncertainty, as well as recent approaches specifically designed for failure prediction. △ Less

Submitted 26 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019 (accepted)

arXiv:1909.09020 [pdf, other]

Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models

Authors: Vincent Le Guen, Nicolas Thome

Abstract: This paper addresses the problem of time series forecasting for non-stationary signals and multiple future steps prediction. To handle this challenging task, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective function for training deep neural networks. DILATE aims at accurately predicting sudden changes, and explicitly incorporates two terms supporting precise shape an… ▽ More This paper addresses the problem of time series forecasting for non-stationary signals and multiple future steps prediction. To handle this challenging task, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective function for training deep neural networks. DILATE aims at accurately predicting sudden changes, and explicitly incorporates two terms supporting precise shape and temporal change detection. We introduce a differentiable loss function suitable for training deep neural nets, and provide a custom back-prop implementation for speeding up optimization. We also introduce a variant of DILATE, which provides a smooth generalization of temporally-constrained Dynamic Time War** (DTW). Experiments carried out on various non-stationary datasets reveal the very good behaviour of DILATE compared to models trained with the standard Mean Squared Error (MSE) loss function, and also to DTW and variants. DILATE is also agnostic to the choice of the model, and we highlight its benefit for training fully connected networks as well as specialized recurrent architectures, showing its capacity to improve over state-of-the-art trajectory forecasting approaches. △ Less

Submitted 10 November, 2019; v1 submitted 19 September, 2019; originally announced September 2019.

arXiv:1909.05397 [pdf, other]

Multitask Classification and Segmentation for Cancer Diagnosis in Mammography

Authors: Thi-Lam-Thuy Le, Nicolas Thome, Sylvain Bernard, Vincent Bismuth, Fanny Patoureaux

Abstract: Annotation cost is a bottleneck for collecting massive data in mammography, especially for training deep neural networks. In this paper, we study the use of heterogeneous levels of annotation granularity to improve predictive performances. More precisely, we introduce a multi-task learning scheme for training convolutional neural network (ConvNets), which combines segmentation and classification,… ▽ More Annotation cost is a bottleneck for collecting massive data in mammography, especially for training deep neural networks. In this paper, we study the use of heterogeneous levels of annotation granularity to improve predictive performances. More precisely, we introduce a multi-task learning scheme for training convolutional neural network (ConvNets), which combines segmentation and classification, using image-level and pixel-level annotations. In this way, different objectives can be used to regularize training by sharing intermediate deep representations. Successful experiments are carried out on the Digital Database of Screening Mammography (DDSM) to validate the relevance of the proposed approach. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: International Conference on Medical Imaging with Deep Learning 2019. MIDL 2019 [arXiv:1907.08612]

Report number: MIDL/2019/ExtendedAbstract/r1xDM5DGcV

arXiv:1906.00804 [pdf, other]

DualDis: Dual-Branch Disentangling with Adversarial Learning

Authors: Thomas Robert, Nicolas Thome, Matthieu Cord

Abstract: In computer vision, disentangling techniques aim at improving latent representations of images by modeling factors of variation. In this paper, we propose DualDis, a new auto-encoder-based framework that disentangles and linearizes class and attribute information. This is achieved thanks to a two-branch architecture forcing the separation of the two kinds of information, accompanied by a decoder f… ▽ More In computer vision, disentangling techniques aim at improving latent representations of images by modeling factors of variation. In this paper, we propose DualDis, a new auto-encoder-based framework that disentangles and linearizes class and attribute information. This is achieved thanks to a two-branch architecture forcing the separation of the two kinds of information, accompanied by a decoder for image reconstruction and generation. To effectively separate the information, we propose to use a combination of regular and adversarial classifiers to guide the two branches in specializing for class and attribute information respectively. We also investigate the possibility of using semi-supervised learning for an effective disentangling even using few labels. We leverage the linearization property of the latent spaces for semantic image editing and generation of new images. We validate our approach on CelebA, Yale-B and NORB by measuring the efficiency of information separation via classification metrics, visual image manipulation and data augmentation. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1902.09487 [pdf, other]

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Authors: Remi Cadene, Hedi Ben-younes, Matthieu Cord, Nicolas Thome

Abstract: Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relatio… ▽ More Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps. We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 and TDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context. Our code is available: https://github.com/Cadene/murel.bootstrap.pytorch △ Less

Submitted 25 February, 2019; originally announced February 2019.

Comments: CVPR2019 accepted paper

arXiv:1902.00038 [pdf, other]

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Authors: Hedi Ben-younes, Rémi Cadene, Nicolas Thome, Matthieu Cord

Abstract: Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOC… ▽ More Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to define new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fine interactions between modalities while maintaining powerful mono-modal representations. We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities. Through extensive experiments, we show that BLOCK compares favorably with respect to state-of-the-art multimodal fusion models for both VQA and VRD tasks. Our code is available at https://github.com/Cadene/block.bootstrap.pytorch. △ Less

Submitted 12 February, 2019; v1 submitted 31 January, 2019; originally announced February 2019.

arXiv:1807.11407 [pdf, other]

HybridNet: Classification and Reconstruction Cooperation for Semi-Supervised Learning

Authors: Thomas Robert, Nicolas Thome, Matthieu Cord

Abstract: In this paper, we introduce a new model for leveraging unlabeled data to improve generalization performances of image classifiers: a two-branch encoder-decoder architecture called HybridNet. The first branch receives supervision signal and is dedicated to the extraction of invariant class-related representations. The second branch is fully unsupervised and dedicated to model information discarded… ▽ More In this paper, we introduce a new model for leveraging unlabeled data to improve generalization performances of image classifiers: a two-branch encoder-decoder architecture called HybridNet. The first branch receives supervision signal and is dedicated to the extraction of invariant class-related representations. The second branch is fully unsupervised and dedicated to model information discarded by the first branch to reconstruct input data. To further support the expected behavior of our model, we propose an original training objective. It favors stability in the discriminative branch and complementarity between the learned representations in the two branches. HybridNet is able to outperform state-of-the-art results on CIFAR-10, SVHN and STL-10 in various semi-supervised settings. In addition, visualizations and ablation studies validate our contributions and the behavior of the model on both CIFAR-10 and STL-10 datasets. △ Less

Submitted 30 July, 2018; originally announced July 2018.

Comments: Accepted at ECCV 2018

arXiv:1805.05814 [pdf, other]

SHADE: Information-Based Regularization for Deep Learning

Authors: Michael Blot, Thomas Robert, Nicolas Thome, Matthieu Cord

Abstract: Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and la… ▽ More Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to standard regularization schemes on several standard architectures. △ Less

Submitted 14 May, 2018; originally announced May 2018.

Comments: IEEE International Conference on Image Processing (ICIP) 2018. arXiv admin note: substantial text overlap with arXiv:1804.10988

arXiv:1804.11146 [pdf, other]

Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings

Authors: Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord

Abstract: Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an ef… ▽ More Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an effective learning scheme, capable of tackling large-scale problems, and validate it on the Recipe1M dataset containing nearly 1 million picture-recipe pairs. We show the effectiveness of our approach regarding previous state-of-the-art models and present qualitative results over computational cooking use cases. △ Less

Submitted 30 April, 2018; originally announced April 2018.

Comments: accepted at the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018

arXiv:1804.10988 [pdf, other]

SHADE: Information Based Regularization for Deep Learning

Authors: Michael Blot, Thomas Robert, Nicolas Thome, Matthieu Cord

Abstract: Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and la… ▽ More Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to common regularization schemes on several standard architectures. △ Less

Submitted 22 May, 2018; v1 submitted 29 April, 2018; originally announced April 2018.

arXiv:1707.06175 [pdf, other]

Deformable Part-based Fully Convolutional Network for Object Detection

Authors: Taylor Mordan, Nicolas Thome, Matthieu Cord, Gilles Henaff

Abstract: Existing region-based object detectors are limited to regions with fixed box geometry to represent objects, even if those are highly non-rectangular. In this paper we introduce DP-FCN, a deep model for object detection which explicitly adapts to shapes of objects with deformable parts. Without additional annotations, it learns to focus on discriminative elements and to align them, and simultaneous… ▽ More Existing region-based object detectors are limited to regions with fixed box geometry to represent objects, even if those are highly non-rectangular. In this paper we introduce DP-FCN, a deep model for object detection which explicitly adapts to shapes of objects with deformable parts. Without additional annotations, it learns to focus on discriminative elements and to align them, and simultaneously brings more invariance for classification and geometric information to refine localization. DP-FCN is composed of three main modules: a Fully Convolutional Network to efficiently maintain spatial resolution, a deformable part-based RoI pooling layer to optimize positions of parts and build invariance, and a deformation-aware localization module explicitly exploiting displacements of parts to improve accuracy of bounding box regression. We experimentally validate our model and show significant gains. DP-FCN achieves state-of-the-art performances of 83.1% and 80.9% on PASCAL VOC 2007 and 2012 with VOC data only. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Comments: Accepted to BMVC 2017 (oral)

arXiv:1705.06676 [pdf, other]

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

Authors: Hedi Ben-younes, Rémi Cadene, Matthieu Cord, Nicolas Thome

Abstract: Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between v… ▽ More Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations. Additionally to the Tucker framework, we design a low-rank matrix-based decomposition to explicitly constrain the interaction rank. With MUTAN, we control the complexity of the merging scheme while kee** nice interpretable fusion relations. We show how our MUTAN model generalizes some of the latest VQA architectures, providing state-of-the-art results. △ Less

Submitted 18 May, 2017; originally announced May 2017.

arXiv:1611.09726 [pdf, other]

Gossip training for deep learning

Authors: Michael Blot, David Picard, Matthieu Cord, Nicolas Thome

Abstract: We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way to share information between different threads inspired by gossip algorithms and showing good consensus… ▽ More We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way to share information between different threads inspired by gossip algorithms and showing good consensus convergence properties. Our method called GoSGD has the advantage to be fully asynchronous and decentralized. We compared our method to the recent EASGD in \cite{elastic} on CIFAR-10 show encouraging results. △ Less

Submitted 29 November, 2016; originally announced November 2016.

arXiv:1610.07882 [pdf, other]

Maxmin convolutional neural networks for image classification

Authors: Michael Blot, Matthieu Cord, Nicolas Thome

Abstract: Convolutional neural networks (CNN) are widely used in computer vision, especially in image classification. However, the way in which information and invariance properties are encoded through in deep CNN architectures is still an open question. In this paper, we propose to modify the standard convo- lutional block of CNN in order to transfer more information layer after layer while kee** some in… ▽ More Convolutional neural networks (CNN) are widely used in computer vision, especially in image classification. However, the way in which information and invariance properties are encoded through in deep CNN architectures is still an open question. In this paper, we propose to modify the standard convo- lutional block of CNN in order to transfer more information layer after layer while kee** some invariance within the net- work. Our main idea is to exploit both positive and negative high scores obtained in the convolution maps. This behav- ior is obtained by modifying the traditional activation func- tion step before pooling. We are doubling the maps with spe- cific activations functions, called MaxMin strategy, in order to achieve our pipeline. Extensive experiments on two classical datasets, MNIST and CIFAR-10, show that our deep MaxMin convolutional net outperforms standard CNN. △ Less

Submitted 25 October, 2016; originally announced October 2016.

arXiv:1610.05567 [pdf, other]

Master's Thesis : Deep Learning for Visual Recognition

Authors: Rémi Cadène, Nicolas Thome, Matthieu Cord

Abstract: The goal of our research is to develop methods advancing automatic visual recognition. In order to predict the unique or multiple labels associated to an image, we study different kind of Deep Neural Networks architectures and methods for supervised features learning. We first draw up a state-of-the-art review of the Convolutional Neural Networks aiming to understand the history behind this family… ▽ More The goal of our research is to develop methods advancing automatic visual recognition. In order to predict the unique or multiple labels associated to an image, we study different kind of Deep Neural Networks architectures and methods for supervised features learning. We first draw up a state-of-the-art review of the Convolutional Neural Networks aiming to understand the history behind this family of statistical models, the limit of modern architectures and the novel techniques currently used to train deep CNNs. The originality of our work lies in our approach focusing on tasks with a low amount of data. We introduce different models and techniques to achieve the best accuracy on several kind of datasets, such as a medium dataset of food recipes (100k images) for building a web API, or a small dataset of satellite images (6,000) for the DSG online challenge that we've won. We also draw up the state-of-the-art in Weakly Supervised Learning, introducing different kind of CNNs able to localize regions of interest. Our last contribution is a framework, build on top of Torch7, for training and testing deep models on any visual recognition tasks and on datasets of any scale. △ Less

Submitted 18 October, 2016; originally announced October 2016.

arXiv:1610.05541 [pdf, other]

M2CAI Workflow Challenge: Convolutional Neural Networks with Time Smoothing and Hidden Markov Model for Video Frames Classification

Authors: Rémi Cadène, Thomas Robert, Nicolas Thome, Matthieu Cord

Abstract: Our approach is among the three best to tackle the M2CAI Workflow challenge. The latter consists in recognizing the operation phase for each frames of endoscopic videos. In this technical report, we compare several classification models and temporal smoothing methods. Our submitted solution is a fine tuned Residual Network-200 on 80% of the training set with temporal smoothing using simple tempora… ▽ More Our approach is among the three best to tackle the M2CAI Workflow challenge. The latter consists in recognizing the operation phase for each frames of endoscopic videos. In this technical report, we compare several classification models and temporal smoothing methods. Our submitted solution is a fine tuned Residual Network-200 on 80% of the training set with temporal smoothing using simple temporal averaging of the predictions and a Hidden Markov Model modeling the sequence. △ Less

Submitted 2 December, 2016; v1 submitted 18 October, 2016; originally announced October 2016.

arXiv:1605.03498 [pdf, other]

doi 10.1109/ICIP.2016.7533200

Deep Neural Networks Under Stress

Authors: Micael Carvalho, Matthieu Cord, Sandra Avila, Nicolas Thome, Eduardo Valle

Abstract: In recent years, deep architectures have been used for transfer learning with state-of-the-art performance in many datasets. The properties of their features remain, however, largely unstudied under the transfer perspective. In this work, we present an extensive analysis of the resiliency of feature vectors extracted from deep models, with special focus on the trade-off between performance and com… ▽ More In recent years, deep architectures have been used for transfer learning with state-of-the-art performance in many datasets. The properties of their features remain, however, largely unstudied under the transfer perspective. In this work, we present an extensive analysis of the resiliency of feature vectors extracted from deep models, with special focus on the trade-off between performance and compression rate. By introducing perturbations to image descriptions extracted from a deep convolutional neural network, we change their precision and number of dimensions, measuring how it affects the final score. We show that deep features are more robust to these disturbances when compared to classical approaches, achieving a compression rate of 98.4%, while losing only 0.88% of their original score for Pascal VOC 2007. △ Less

Submitted 23 May, 2016; v1 submitted 11 May, 2016; originally announced May 2016.

Comments: This article corresponds to the accepted version at IEEE ICIP 2016. We will link the DOI as soon as it is available

arXiv:1312.6594 [pdf, other]

Sequentially Generated Instance-Dependent Image Representations for Classification

Authors: Gabriel Dulac-Arnold, Ludovic Denoyer, Nicolas Thome, Matthieu Cord, Patrick Gallinari

Abstract: In this paper, we investigate a new framework for image classification that adaptively generates spatial representations. Our strategy is based on a sequential process that learns to explore the different regions of any image in order to infer its category. In particular, the choice of regions is specific to each image, directed by the actual content of previously selected regions.The capacity of… ▽ More In this paper, we investigate a new framework for image classification that adaptively generates spatial representations. Our strategy is based on a sequential process that learns to explore the different regions of any image in order to infer its category. In particular, the choice of regions is specific to each image, directed by the actual content of previously selected regions.The capacity of the system to handle incomplete image information as well as its adaptive region selection allow the system to perform well in budgeted classification tasks by exploiting a dynamicly generated representation of each image. We demonstrate the system's abilities in a series of image-based exploration and classification tasks that highlight its learned exploration and inference abilities. △ Less

Submitted 11 February, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

Showing 1–49 of 49 results for author: Thome, N