Search | arXiv e-print repository

doi 10.1109/ICIP49359.2023.10222005

LATENTPATCH: A Non-Parametric Approach for Face Generation and Editing

Authors: Benjamin Samuth, Julien Rabin, David Tschumperlé, Frédéric Jurie

Abstract: This paper presents LatentPatch, a new method for generating realistic images from a small dataset of only a few images. We use a lightweight model with only a few thousand parameters. Unlike traditional few-shot generation methods that finetune pre-trained large-scale generative models, our approach is computed directly on the latent distribution by sequential feature matching, and is explainable… ▽ More This paper presents LatentPatch, a new method for generating realistic images from a small dataset of only a few images. We use a lightweight model with only a few thousand parameters. Unlike traditional few-shot generation methods that finetune pre-trained large-scale generative models, our approach is computed directly on the latent distribution by sequential feature matching, and is explainable by design. Avoiding large models based on transformers, recursive networks, or self-attention, which are not suitable for small datasets, our method is inspired by non-parametric texture synthesis and style transfer models, and ensures that generated image features are sampled from the source distribution. We extend previous single-image models to work with a few images and demonstrate that our method can generate realistic images, as well as enable conditional sampling and image editing. We conduct experiments on face datasets and show that our simplistic model is effective and versatile. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Journal ref: 2023 IEEE International Conference on Image Processing (ICIP), Oct 2023, Kuala Lumpur, Malaysia. pp.1790-1794

arXiv:2309.16729 [pdf, other]

SimPINNs: Simulation-Driven Physics-Informed Neural Networks for Enhanced Performance in Nonlinear Inverse Problems

Authors: Sidney Besnard, Frédéric Jurie, Jalal M. Fadili

Abstract: This paper introduces a novel approach to solve inverse problems by leveraging deep learning techniques. The objective is to infer unknown parameters that govern a physical system based on observed data. We focus on scenarios where the underlying forward model demonstrates pronounced nonlinear behaviour, and where the dimensionality of the unknown parameter space is substantially smaller than that… ▽ More This paper introduces a novel approach to solve inverse problems by leveraging deep learning techniques. The objective is to infer unknown parameters that govern a physical system based on observed data. We focus on scenarios where the underlying forward model demonstrates pronounced nonlinear behaviour, and where the dimensionality of the unknown parameter space is substantially smaller than that of the observations. Our proposed method builds upon physics-informed neural networks (PINNs) trained with a hybrid loss function that combines observed data with simulated data generated by a known (approximate) physical model. Experimental results on an orbit restitution problem demonstrate that our approach surpasses the performance of standard PINNs, providing improved accuracy and robustness. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.07944 [pdf, other]

Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach

Authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie

Abstract: This paper addresses the challenge of generating Counterfactual Explanations (CEs), involving the identification and modification of the fewest necessary features to alter a classifier's prediction for a given image. Our proposed method, Text-to-Image Models for Counterfactual Explanations (TIME), is a black-box counterfactual technique based on distillation. Unlike previous methods, this approach… ▽ More This paper addresses the challenge of generating Counterfactual Explanations (CEs), involving the identification and modification of the fewest necessary features to alter a classifier's prediction for a given image. Our proposed method, Text-to-Image Models for Counterfactual Explanations (TIME), is a black-box counterfactual technique based on distillation. Unlike previous methods, this approach requires solely the image and its prediction, omitting the need for the classifier's structure, parameters, or gradients. Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation. Extensive empirical studies validate that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting. △ Less

Submitted 15 November, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: WACV 2024 Camera ready + supplementary material

arXiv:2303.12733 [pdf, other]

On the De-duplication of LAION-2B

Authors: Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie

Abstract: Generative models, such as DALL-E, Midjourney, and Stable Diffusion, have societal implications that extend beyond the field of computer science. These models require large image databases like LAION-2B, which contain two billion images. At this scale, manual inspection is difficult and automated analysis is challenging. In addition, recent studies show that duplicated images pose copyright proble… ▽ More Generative models, such as DALL-E, Midjourney, and Stable Diffusion, have societal implications that extend beyond the field of computer science. These models require large image databases like LAION-2B, which contain two billion images. At this scale, manual inspection is difficult and automated analysis is challenging. In addition, recent studies show that duplicated images pose copyright problems for models trained on LAION2B, which hinders its usability. This paper proposes an algorithmic chain that runs with modest compute, that compresses CLIP features to enable efficient duplicate detection, even for vast image volumes. Our approach demonstrates that roughly 700 million images, or about 30\%, of LAION-2B's images are likely duplicated. Our method also provides the histograms of duplication on this dataset, which we use to reveal more examples of verbatim copies by Stable Diffusion and further justify the approach. The current version of the de-duplicated set will be distributed online. △ Less

Submitted 17 March, 2023; originally announced March 2023.

arXiv:2303.09962 [pdf, other]

Adversarial Counterfactual Visual Explanations

Authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie

Abstract: Counterfactual explanations and adversarial attacks have a related goal: flip** output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literat… ▽ More Counterfactual explanations and adversarial attacks have a related goal: flip** output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literature, this paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations, without modifying the classifiers to explain. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper's key idea is to build attacks through a diffusion model to polish them. This allows studying the target model regardless of its robustification level. Extensive experimentation shows the advantages of our counterfactual explanation approach over current State-of-the-Art in multiple testbeds. △ Less

Submitted 17 March, 2023; originally announced March 2023.

Comments: CVPR 2023 camera-ready; Main manuscript + supplementary material

arXiv:2209.15438 [pdf, other]

Empowering the trustworthiness of ML-based critical systems through engineering activities

Authors: Juliette Mattioli, Agnes Delaborde, Souhaiel Khalfaoui, Freddy Lecue, Henri Sohier, Frederic Jurie

Abstract: This paper reviews the entire engineering process of trustworthy Machine Learning (ML) algorithms designed to equip critical systems with advanced analytics and decision functions. We start from the fundamental principles of ML and describe the core elements conditioning its trust, particularly through its design: namely domain specification, data engineering, design of the ML algorithms, their im… ▽ More This paper reviews the entire engineering process of trustworthy Machine Learning (ML) algorithms designed to equip critical systems with advanced analytics and decision functions. We start from the fundamental principles of ML and describe the core elements conditioning its trust, particularly through its design: namely domain specification, data engineering, design of the ML algorithms, their implementation, evaluation and deployment. The latter components are organized in an unique framework for the design of trusted ML systems. △ Less

Submitted 30 September, 2022; originally announced September 2022.

Comments: This work has been supported by the French government under the "France 2030" program, as part of the SystemX Technological Research Institute Research Institute

arXiv:2203.15636 [pdf, other]

Diffusion Models for Counterfactual Explanations

Authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie

Abstract: Counterfactual explanations have shown promising results as a post-hoc framework to make image classifiers more explainable. In this paper, we propose DiME, a method allowing the generation of counterfactual images using the recent diffusion models. By leveraging the guided generative diffusion process, our proposed methodology shows how to use the gradients of the target classifier to generate co… ▽ More Counterfactual explanations have shown promising results as a post-hoc framework to make image classifiers more explainable. In this paper, we propose DiME, a method allowing the generation of counterfactual images using the recent diffusion models. By leveraging the guided generative diffusion process, our proposed methodology shows how to use the gradients of the target classifier to generate counterfactual explanations of input instances. Further, we analyze current approaches to evaluate spurious correlations and extend the evaluation measurements by proposing a new metric: Correlation Difference. Our experimental validations show that the proposed algorithm surpasses previous State-of-the-Art results on 5 out of 6 metrics on CelebA. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2109.07920 [pdf, other]

On the inductive biases of deep domain adaptation

Authors: Rodrigue Siry, Louis Hémadou, Loïc Simon, Frédéric Jurie

Abstract: Domain alignment is currently the most prevalent solution to unsupervised domain-adaptation tasks and are often being presented as minimizers of some theoretical upper-bounds on risk in the target domain. However, further works revealed severe inadequacies between theory and practice: we consolidate this analysis and confirm that imposing domain invariance on features is neither necessary nor suff… ▽ More Domain alignment is currently the most prevalent solution to unsupervised domain-adaptation tasks and are often being presented as minimizers of some theoretical upper-bounds on risk in the target domain. However, further works revealed severe inadequacies between theory and practice: we consolidate this analysis and confirm that imposing domain invariance on features is neither necessary nor sufficient to obtain low target risk. We instead argue that successful deep domain adaptation rely largely on hidden inductive biases found in the common practice, such as model pre-training or design of encoder architecture. We perform various ablation experiments on popular benchmarks and our own synthetic transfers to illustrate their role in prototypical situations. To conclude our analysis, we propose to meta-learn parametric inductive biases to solve specific transfers and show their superior performance over handcrafted heuristics. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Comments: 10 pages, 8 Figures

arXiv:2108.00800 [pdf, other]

Training face verification models from generated face identity data

Authors: Dennis Conway, Loic Simon, Alexis Lechervy, Frederic Jurie

Abstract: Machine learning tools are becoming increasingly powerful and widely used. Unfortunately membership attacks, which seek to uncover information from data sets used in machine learning, have the potential to limit data sharing. In this paper we consider an approach to increase the privacy protection of data sets, as applied to face recognition. Using an auxiliary face recognition model, we build on… ▽ More Machine learning tools are becoming increasingly powerful and widely used. Unfortunately membership attacks, which seek to uncover information from data sets used in machine learning, have the potential to limit data sharing. In this paper we consider an approach to increase the privacy protection of data sets, as applied to face recognition. Using an auxiliary face recognition model, we build on the StyleGAN generative adversarial network and feed it with latent codes combining two distinct sub-codes, one encoding visual identity factors, and, the other, non-identity factors. By independently varying these vectors during image generation, we create a synthetic data set of fictitious face identities. We use this data set to train a face recognition model. The model performance degrades in comparison to the state-of-the-art of face verification. When tested with a simple membership attack our model provides good privacy protection, however the model performance degrades in comparison to the state-of-the-art of face verification. We find that the addition of a small amount of private data greatly improves the performance of our model, which highlights the limitations of using synthetic data to train machine learning models. △ Less

Submitted 2 August, 2021; originally announced August 2021.

arXiv:2107.06018 [pdf, other]

This Person (Probably) Exists. Identity Membership Attacks Against GAN Generated Faces

Authors: Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie

Abstract: Recently, generative adversarial networks (GANs) have achieved stunning realism, fooling even human observers. Indeed, the popular tongue-in-cheek website {\small \url{ http://thispersondoesnotexist.com}}, taunts users with GAN generated images that seem too real to believe. On the other hand, GANs do leak information about their training data, as evidenced by membership attacks recently demonstra… ▽ More Recently, generative adversarial networks (GANs) have achieved stunning realism, fooling even human observers. Indeed, the popular tongue-in-cheek website {\small \url{ http://thispersondoesnotexist.com}}, taunts users with GAN generated images that seem too real to believe. On the other hand, GANs do leak information about their training data, as evidenced by membership attacks recently demonstrated in the literature. In this work, we challenge the assumption that GAN faces really are novel creations, by constructing a successful membership attack of a new kind. Unlike previous works, our attack can accurately discern samples sharing the same identity as training samples without being the same samples. We demonstrate the interest of our attack across several popular face datasets and GAN training procedures. Notably, we show that even in the presence of significant dataset diversity, an over represented person can pose a privacy concern. △ Less

Submitted 13 July, 2021; originally announced July 2021.

arXiv:1911.03222 [pdf, other]

Towards a General Model of Knowledge for Facial Analysis by Multi-Source Transfer Learning

Authors: Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, Frédéric Jurie

Abstract: This paper proposes a step toward obtaining general models of knowledge for facial analysis, by addressing the question of multi-source transfer learning. More precisely, the proposed approach consists in two successive training steps: the first one consists in applying a combination operator to define a common embedding for the multiple sources materialized by different existing trained models. T… ▽ More This paper proposes a step toward obtaining general models of knowledge for facial analysis, by addressing the question of multi-source transfer learning. More precisely, the proposed approach consists in two successive training steps: the first one consists in applying a combination operator to define a common embedding for the multiple sources materialized by different existing trained models. The proposed operator relies on an auto-encoder, trained on a large dataset, efficient both in terms of compression ratio and transfer learning performance. In a second step we exploit a distillation approach to obtain a lightweight student model mimicking the collection of the fused existing models. This model outperforms its teacher on novel tasks, achieving results on par with state-of-the-art methods on 15 facial analysis tasks (and domains), at an affordable training cost. Moreover, this student has 75 times less parameters than the original teacher and can be applied to a variety of novel face-related tasks. △ Less

Submitted 8 November, 2019; originally announced November 2019.

arXiv:1908.07253 [pdf, other]

n-MeRCI: A new Metric to Evaluate the Correlation Between Predictive Uncertainty and True Error

Authors: Michel Moukari, Loïc Simon, Sylvaine Picard, Frédéric Jurie

Abstract: As deep learning applications are becoming more and more pervasive in robotics, the question of evaluating the reliability of inferences becomes a central question in the robotics community. This domain, known as predictive uncertainty, has come under the scrutiny of research groups develo** Bayesian approaches adapted to deep learning such as Monte Carlo Dropout. Unfortunately, for the time bei… ▽ More As deep learning applications are becoming more and more pervasive in robotics, the question of evaluating the reliability of inferences becomes a central question in the robotics community. This domain, known as predictive uncertainty, has come under the scrutiny of research groups develo** Bayesian approaches adapted to deep learning such as Monte Carlo Dropout. Unfortunately, for the time being, the real goal of predictive uncertainty has been swept under the rug. Indeed, these approaches are solely evaluated in terms of raw performance of the network prediction, while the quality of their estimated uncertainty is not assessed. Evaluating such uncertainty prediction quality is especially important in robotics, as actions shall depend on the confidence in perceived information. In this context, the main contribution of this article is to propose a novel metric that is adapted to the evaluation of relative uncertainty assessment and directly applicable to regression with deep neural networks. To experimentally validate this metric, we evaluate it on a toy dataset and then apply it to the task of monocular depth estimation. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Journal ref: IEEE/RJS International Conference on Intelligent Robots and Systems (IROS), In press

arXiv:1904.08494 [pdf, other]

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

Authors: Siddharth Srivastava, Frederic Jurie, Gaurav Sharma

Abstract: We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D data to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy dat… ▽ More We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D data to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy data, such a method is not only feasible, but gives higher results than many methods working on actual 3D inputs acquired from physical sensors. On the challenging KITTI benchmark, we show that our 2D to 3D lifted method outperforms many recent competitive 3D networks while significantly outperforming previous state-of-the-art for 3D detection from monocular images. We also show that a late fusion of the output of the network trained on generated 3D images, with that trained on real 3D images, improves performance. We find the results very interesting and argue that such a method could serve as a highly reliable backup in case of malfunction of expensive 3D sensors, if not potentially making them redundant, at least in the case of low human injury risk autonomous navigation scenarios like warehouse automation. △ Less

Submitted 11 October, 2019; v1 submitted 27 March, 2019; originally announced April 2019.

arXiv:1903.06496 [pdf, other]

MFAS: Multimodal Fusion Architecture Search

Authors: Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, Frédéric Jurie

Abstract: We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonst… ▽ More We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available. △ Less

Submitted 15 March, 2019; originally announced March 2019.

Comments: CVPR 2019, Jun 2019, Long Beach, United States http://cvpr2019.thecvf.com/

arXiv:1901.03396 [pdf, other]

Detecting Overfitting of Deep Generative Networks via Latent Recovery

Authors: Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie

Abstract: State of the art deep generative networks are capable of producing images with such incredible realism that they can be suspected of memorizing training images. It is why it is not uncommon to include visualizations of training set nearest neighbors, to suggest generated images are not simply memorized. We demonstrate this is not sufficient and motivates the need to study memorization/overfitting… ▽ More State of the art deep generative networks are capable of producing images with such incredible realism that they can be suspected of memorizing training images. It is why it is not uncommon to include visualizations of training set nearest neighbors, to suggest generated images are not simply memorized. We demonstrate this is not sufficient and motivates the need to study memorization/overfitting of deep generators with more scrutiny. This paper addresses this question by i) showing how simple losses are highly effective at reconstructing images for deep generators ii) analyzing the statistics of reconstruction errors when reconstructing training and validation images, which is the standard way to analyze overfitting in machine learning. Using this methodology, this paper shows that overfitting is not detectable in the pure GAN models proposed in the literature, in contrast with those using hybrid adversarial losses, which are amongst the most widely applied generative methods. The paper also shows that standard GAN evaluation metrics fail to capture memorization for some deep generators. Finally, the paper also shows how off-the-shelf GAN generators can be successfully applied to face inpainting and face super-resolution using the proposed reconstruction method, without hybrid adversarial losses. △ Less

Submitted 9 January, 2019; originally announced January 2019.

arXiv:1811.02447 [pdf, other]

Multi-Level Sensor Fusion with Deep Learning

Authors: Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, Frédéric Jurie

Abstract: In the context of deep learning, this article presents an original deep network, namely CentralNet, for the fusion of information coming from different sensors. This approach is designed to efficiently and automatically balance the trade-off between early and late fusion (i.e. between the fusion of low-level vs high-level information). More specifically, at each level of abstraction-the different… ▽ More In the context of deep learning, this article presents an original deep network, namely CentralNet, for the fusion of information coming from different sensors. This approach is designed to efficiently and automatically balance the trade-off between early and late fusion (i.e. between the fusion of low-level vs high-level information). More specifically, at each level of abstraction-the different levels of deep networks-uni-modal representations of the data are fed to a central neural network which combines them into a common embedding. In addition, a multi-objective regularization is also introduced, hel** to both optimize the central network and the unimodal networks. Experiments on four multimodal datasets not only show state-of-the-art performance, but also demonstrate that CentralNet can actually choose the best possible fusion strategy for a given problem. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1808.07275

arXiv:1811.02234 [pdf, other]

Semantic bottleneck for computer vision tasks

Authors: Maxime Bucher, Stéphane Herbin, Frédéric Jurie

Abstract: This paper introduces a novel method for the representation of images that is semantic by nature, addressing the question of computation intelligibility in computer vision tasks. More specifically, our proposition is to introduce what we call a semantic bottleneck in the processing pipeline, which is a crossing point in which the representation of the image is entirely expressed with natural langu… ▽ More This paper introduces a novel method for the representation of images that is semantic by nature, addressing the question of computation intelligibility in computer vision tasks. More specifically, our proposition is to introduce what we call a semantic bottleneck in the processing pipeline, which is a crossing point in which the representation of the image is entirely expressed with natural language , while retaining the efficiency of numerical representations. We show that our approach is able to generate semantic representations that give state-of-the-art results on semantic content-based image retrieval and also perform very well on image classification tasks. Intelligibility is evaluated through user centered experiments for failure detection. △ Less

Submitted 6 November, 2018; originally announced November 2018.

Journal ref: Asian Conference on Computer Vision (ACCV), Dec 2018, Perth, Australia

arXiv:1810.13197 [pdf, other]

The Many Moods of Emotion

Authors: Valentin Vielzeuf, Corentin Kervadec, Stéphane Pateux, Frédéric Jurie

Abstract: This paper presents a novel approach to the facial expression generation problem. Building upon the assumption of the psychological community that emotion is intrinsically continuous, we first design our own continuous emotion representation with a 3-dimensional latent space issued from a neural network trained on discrete emotion classification. The so-obtained representation can be used to annot… ▽ More This paper presents a novel approach to the facial expression generation problem. Building upon the assumption of the psychological community that emotion is intrinsically continuous, we first design our own continuous emotion representation with a 3-dimensional latent space issued from a neural network trained on discrete emotion classification. The so-obtained representation can be used to annotate large in the wild datasets and later used to trained a Generative Adversarial Network. We first show that our model is able to map back to discrete emotion classes with a objectively and subjectively better quality of the images than usual discrete approaches. But also that we are able to pave the larger space of possible facial expressions, generating the many moods of emotion. Moreover, two axis in this space may be found to generate similar expression changes as in traditional continuous representations such as arousal-valence. Finally we show from visual interpretation, that the third remaining dimension is highly related to the well-known dominance dimension from psychology. △ Less

Submitted 31 October, 2018; originally announced October 2018.

arXiv:1809.08402 [pdf, other]

RPNet: an End-to-End Network for Relative Camera Pose Estimation

Authors: Sovann En, Alexis Lechervy, Frédéric Jurie

Abstract: This paper addresses the task of relative camera pose estimation from raw image pixels, by means of deep neural networks. The proposed RPNet network takes pairs of images as input and directly infers the relative poses, without the need of camera intrinsic/extrinsic. While state-of-the-art systems based on SIFT + RANSAC, are able to recover the translation vector only up to scale, RPNet is trained… ▽ More This paper addresses the task of relative camera pose estimation from raw image pixels, by means of deep neural networks. The proposed RPNet network takes pairs of images as input and directly infers the relative poses, without the need of camera intrinsic/extrinsic. While state-of-the-art systems based on SIFT + RANSAC, are able to recover the translation vector only up to scale, RPNet is trained to produce the full translation vector, in an end-to-end way. Experimental results on the Cambridge Landmark dataset show very promising results regarding the recovery of the full translation vector. They also show that RPNet produces more accurate and more stable results than traditional approaches, especially for hard images (repetitive textures, textureless images, etc). To the best of our knowledge, RPNet is the first attempt to recover full translation vectors in relative pose estimation. △ Less

Submitted 22 September, 2018; originally announced September 2018.

arXiv:1809.07628 [pdf, other]

Faster RER-CNN: application to the detection of vehicles in aerial images

Authors: Jean Ogier du Terrail, Frédéric Jurie

Abstract: Detecting small vehicles in aerial images is a difficult job that can be challenging even for humans. Rotating objects, low resolution, small inter-class variability and very large images comprising complicated backgrounds render the work of photo-interpreters tedious and wearisome. Unfortunately even the best classical detection pipelines like Faster R-CNN cannot be used off-the-shelf with good r… ▽ More Detecting small vehicles in aerial images is a difficult job that can be challenging even for humans. Rotating objects, low resolution, small inter-class variability and very large images comprising complicated backgrounds render the work of photo-interpreters tedious and wearisome. Unfortunately even the best classical detection pipelines like Faster R-CNN cannot be used off-the-shelf with good results because they were built to process object centric images from day-to-day life with multi-scale vertical objects. In this work we build on the Faster R-CNN approach to turn it into a detection framework that deals appropriately with the rotation equivariance inherent to any aerial image task. This new pipeline (Faster Rotation Equivariant Regions CNN) gives, without any bells and whistles, state-of-the-art results on one of the most challenging aerial imagery datasets: VeDAI and give good results w.r.t. the baseline Faster R-CNN on two others: Munich and GoogleEarth . △ Less

Submitted 16 September, 2020; v1 submitted 20 September, 2018; originally announced September 2018.

Comments: technical report v2 fixes formatting issues and add acknowledgments

arXiv:1809.03193 [pdf, other]

Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks

Authors: Shivang Agarwal, Jean Ogier Du Terrail, Frédéric Jurie

Abstract: Object detection-the computer vision task dealing with detecting instances of objects of a certain class (e.g., 'car', 'plane', etc.) in images-attracted a lot of attention from the community during the last 5 years. This strong interest can be explained not only by the importance this task has for many applications but also by the phenomenal advances in this area since the arrival of deep convolu… ▽ More Object detection-the computer vision task dealing with detecting instances of objects of a certain class (e.g., 'car', 'plane', etc.) in images-attracted a lot of attention from the community during the last 5 years. This strong interest can be explained not only by the importance this task has for many applications but also by the phenomenal advances in this area since the arrival of deep convolutional neural networks (DCNN). This article reviews the recent literature on object detection with deep CNN, in a comprehensive way, and provides an in-depth view of these recent advances. The survey covers not only the typical architectures (SSD, YOLO, Faster-RCNN) but also discusses the challenges currently met by the community and goes on to show how the problem of object detection can be extended. This survey also reviews the public datasets and associated state-of-the-art algorithms. △ Less

Submitted 20 August, 2019; v1 submitted 10 September, 2018; originally announced September 2018.

arXiv:1808.07275 [pdf, other]

CentralNet: a Multilayer Approach for Multimodal Fusion

Authors: Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, Frédéric Jurie

Abstract: This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from bo… ▽ More This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches. △ Less

Submitted 22 August, 2018; originally announced August 2018.

Journal ref: European Conference on Computer Vision Workshops: Multimodal Learning and Applications, Sep 2018, Munich, Germany. https://mula2018.github.io/

arXiv:1808.02668 [pdf, other]

An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

Authors: Valentin Vielzeuf, Corentin Kervadec, Stéphane Pateux, Alexis Lechervy, Frédéric Jurie

Abstract: This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest earning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of t… ▽ More This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest earning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations. ii) The isual temporal information is handled by a simple score-per-frame selection process, averaged across time. iii) A simple frame selection echanism is also proposed to weight the images of a sequence. iv) The fusion of the different modalities is performed at prediction level (late usion). We also highlight the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation equences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at he Emotion in the Wild 2018 challenge. △ Less

Submitted 8 August, 2018; originally announced August 2018.

Journal ref: ICMI (EmotiW) 2018, Oct 2018, Boulder, Colorado, United States

arXiv:1807.11215 [pdf, other]

CAKE: Compact and Accurate K-dimensional representation of Emotion

Authors: Corentin Kervadec, Valentin Vielzeuf, Stéphane Pateux, Alexis Lechervy, Frédéric Jurie

Abstract: Numerous models describing the human emotional states have been built by the psychology community. Alongside, Deep Neural Networks (DNN) are reaching excellent performances and are becoming interesting features extraction tools in many computer vision tasks.Inspired by works from the psychology community, we first study the link between the compact two-dimensional representation of the emotion kno… ▽ More Numerous models describing the human emotional states have been built by the psychology community. Alongside, Deep Neural Networks (DNN) are reaching excellent performances and are becoming interesting features extraction tools in many computer vision tasks.Inspired by works from the psychology community, we first study the link between the compact two-dimensional representation of the emotion known as arousal-valence, and discrete emotion classes (e.g. anger, happiness, sadness, etc.) used in the computer vision community. It enables to assess the benefits -- in terms of discrete emotion inference -- of adding an extra dimension to arousal-valence (usually named dominance). Building on these observations, we propose CAKE, a 3-dimensional representation of emotion learned in a multi-domain fashion, achieving accurate emotion recognition on several public datasets. Moreover, we visualize how emotions boundaries are organized inside DNN representations and show that DNNs are implicitly learning arousal-valence-like descriptions of emotions. Finally, we use the CAKE representation to compare the quality of the annotations of different public datasets. △ Less

Submitted 3 August, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

Journal ref: Image Analysis for Human Facial and Activity Recognition (BMVC Workshop), Sep 2018, Newcastle, United Kingdom. http://juz-dev.myweb.port.ac.uk/BMVCWorkshop/index.html

arXiv:1806.03051 [pdf, other]

Deep multi-scale architectures for monocular depth estimation

Authors: Michel Moukari, Sylvaine Picard, Loic Simon, Frédéric Jurie

Abstract: This paper aims at understanding the role of multi-scale information in the estimation of depth from monocular images. More precisely, the paper investigates four different deep CNN architectures, designed to explicitly make use of multi-scale features along the network, and compare them to a state-of-the-art single-scale approach. The paper also shows that involving multi-scale features in depth… ▽ More This paper aims at understanding the role of multi-scale information in the estimation of depth from monocular images. More precisely, the paper investigates four different deep CNN architectures, designed to explicitly make use of multi-scale features along the network, and compare them to a state-of-the-art single-scale approach. The paper also shows that involving multi-scale features in depth estimation not only improves the performance in terms of accuracy, but also gives qualitatively better depth maps. Experiments are done on the widely used NYU Depth dataset, on which the proposed method achieves state-of-the-art performance. △ Less

Submitted 8 June, 2018; originally announced June 2018.

arXiv:1806.01550 [pdf, other]

TS-Net: Combining modality specific and common features for multimodal patch matching

Authors: Sovann En, Alexis Lechervy, Frédéric Jurie

Abstract: Multimodal patch matching addresses the problem of finding the correspondences between image patches from two different modalities, e.g. RGB vs sketch or RGB vs near-infrared. The comparison of patches of different modalities can be done by discovering the information common to both modalities (Siamese like approaches) or the modality-specific information (Pseudo-Siamese like approaches). We obser… ▽ More Multimodal patch matching addresses the problem of finding the correspondences between image patches from two different modalities, e.g. RGB vs sketch or RGB vs near-infrared. The comparison of patches of different modalities can be done by discovering the information common to both modalities (Siamese like approaches) or the modality-specific information (Pseudo-Siamese like approaches). We observed that none of these two scenarios is optimal. This motivates us to propose a three-stream architecture, dubbed as TS-Net, combining the benefits of the two. In addition, we show that adding extra constraints in the intermediate layers of such networks further boosts the performance. Experimentations on three multimodal datasets show significant performance gains in comparison with Siamese and Pseudo-Siamese networks. △ Less

Submitted 5 June, 2018; originally announced June 2018.

arXiv:1709.07200 [pdf, other]

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

Authors: Valentin Vielzeuf, Stéphane Pateux, Frédéric Jurie

Abstract: This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directi… ▽ More This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %. △ Less

Submitted 21 September, 2017; originally announced September 2017.

Journal ref: ACM - ICMI 2017, Nov 2017, Glasgow, United Kingdom

arXiv:1708.06975 [pdf, other]

Generating Visual Representations for Zero-Shot Classification

Authors: Maxime Bucher, Stéphane Herbin, Frédéric Jurie

Abstract: This paper addresses the task of learning an image clas-sifier when some categories are defined by semantic descriptions only (e.g. visual attributes) while the others are defined by exemplar images as well. This task is often referred to as the Zero-Shot classification task (ZSC). Most of the previous methods rely on learning a common embedding space allowing to compare visual features of unknown… ▽ More This paper addresses the task of learning an image clas-sifier when some categories are defined by semantic descriptions only (e.g. visual attributes) while the others are defined by exemplar images as well. This task is often referred to as the Zero-Shot classification task (ZSC). Most of the previous methods rely on learning a common embedding space allowing to compare visual features of unknown categories with semantic descriptions. This paper argues that these approaches are limited as i) efficient discrimi-native classifiers can't be used ii) classification tasks with seen and unseen categories (Generalized Zero-Shot Classification or GZSC) can't be addressed efficiently. In contrast , this paper suggests to address ZSC and GZSC by i) learning a conditional generator using seen classes ii) generate artificial training examples for the categories without exemplars. ZSC is then turned into a standard supervised learning problem. Experiments with 4 generative models and 5 datasets experimentally validate the approach, giving state-of-the-art results on both ZSC and GZSC. △ Less

Submitted 11 December, 2017; v1 submitted 23 August, 2017; originally announced August 2017.

Journal ref: International Conference on Computer Vision (ICCV) Workshops : TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision, Oct 2017, venise, Italy. International Conference on Computer Vision (ICCV) Workshops, 2017

arXiv:1704.03755 [pdf, other]

Unsupervised part learning for visual recognition

Authors: Ronan Sicre, Yannis Avrithis, Ewa Kijak, Frederic Jurie

Abstract: Part-based image classification aims at representing categories by small sets of learned discriminative parts, upon which an image representation is built. Considered as a promising avenue a decade ago, this direction has been neglected since the advent of deep neural networks. In this context, this paper brings two contributions: first, it shows that despite the recent success of end-to-end holis… ▽ More Part-based image classification aims at representing categories by small sets of learned discriminative parts, upon which an image representation is built. Considered as a promising avenue a decade ago, this direction has been neglected since the advent of deep neural networks. In this context, this paper brings two contributions: first, it shows that despite the recent success of end-to-end holistic models, explicit part learning can boosts classification performance. Second, this work proceeds one step further than recent part-based models (PBM), focusing on how to learn parts without using any labeled data. Instead of learning a set of parts per class, as generally done in the PBM literature, the proposed approach both constructs a partition of a given set of images into visually similar groups, and subsequently learn a set of discriminative parts per group in a fully unsupervised fashion. This strategy opens the door to the use of PBM in new applications for which the notion of image categories is irrelevant, such as instance-based image retrieval, for example. We experimentally show that our learned parts can help building efficient image representations, for classification as well as for indexing tasks, resulting in performance superior to holistic state-of-the art Deep Convolutional Neural Networks (DCNN) encoding. △ Less

Submitted 12 April, 2017; originally announced April 2017.

arXiv:1702.02382 [pdf, ps, other]

An Adversarial Regularisation for Semi-Supervised Training of Structured Output Neural Networks

Authors: Mateusz Koziński, Loïc Simon, Frédéric Jurie

Abstract: We propose a method for semi-supervised training of structured-output neural networks. Inspired by the framework of Generative Adversarial Networks (GAN), we train a discriminator network to capture the notion of a quality of network output. To this end, we leverage the qualitative difference between outputs obtained on the labelled training data and unannotated data. We then use the discriminator… ▽ More We propose a method for semi-supervised training of structured-output neural networks. Inspired by the framework of Generative Adversarial Networks (GAN), we train a discriminator network to capture the notion of a quality of network output. To this end, we leverage the qualitative difference between outputs obtained on the labelled training data and unannotated data. We then use the discriminator as a source of error signal for unlabelled data. This effectively boosts the performance of a network on a held out test set. Initial experiments in image segmentation demonstrate that the proposed framework enables achieving the same network performance as in a fully supervised scenario, while using two times less annotations. △ Less

Submitted 8 February, 2017; originally announced February 2017.

arXiv:1611.04413 [pdf, other]

Automatic discovery of discriminative parts as a quadratic assignment problem

Authors: Ronan Sicre, Julien Rabin, Yannis Avrithis, Teddy Furon, Frederic Jurie

Abstract: Part-based image classification consists in representing categories by small sets of discriminative parts upon which a representation of the images is built. This paper addresses the question of how to automatically learn such parts from a set of labeled training images. The training of parts is cast as a quadratic assignment problem in which optimal correspondences between image regions and parts… ▽ More Part-based image classification consists in representing categories by small sets of discriminative parts upon which a representation of the images is built. This paper addresses the question of how to automatically learn such parts from a set of labeled training images. The training of parts is cast as a quadratic assignment problem in which optimal correspondences between image regions and parts are automatically learned. The paper analyses different assignment strategies and thoroughly evaluates them on two public datasets: Willow actions and MIT 67 scenes. State-of-the art results are obtained on these datasets. △ Less

Submitted 14 November, 2016; originally announced November 2016.

arXiv:1611.00142 [pdf, other]

Deep fusion of visual signatures for client-server facial analysis

Authors: Binod Bhattarai, Gaurav Sharma, Frederic Jurie

Abstract: Facial analysis is a key technology for enabling human-machine interaction. In this context, we present a client-server framework, where a client transmits the signature of a face to be analyzed to the server, and, in return, the server sends back various information describing the face e.g. is the person male or female, is she/he bald, does he have a mustache, etc. We assume that a client can com… ▽ More Facial analysis is a key technology for enabling human-machine interaction. In this context, we present a client-server framework, where a client transmits the signature of a face to be analyzed to the server, and, in return, the server sends back various information describing the face e.g. is the person male or female, is she/he bald, does he have a mustache, etc. We assume that a client can compute one (or a combination) of visual features; from very simple and efficient features, like Local Binary Patterns, to more complex and computationally heavy, like Fisher Vectors and CNN based, depending on the computing resources available. The challenge addressed in this paper is to design a common universal representation such that a single merged signature is transmitted to the server, whatever be the type and number of features computed by the client, ensuring nonetheless an optimal performance. Our solution is based on learning of a common optimal subspace for aligning the different face features and merging them into a universal signature. We have validated the proposed method on the challenging CelebA dataset, on which our method outperforms existing state-of-the-art methods when rich representation is available at test time, while giving competitive performance when only simple signatures (like LBP) are available at test time due to resource constraints on the client. △ Less

Submitted 9 November, 2016; v1 submitted 1 November, 2016; originally announced November 2016.

arXiv:1608.07441 [pdf, other]

Hard Negative Mining for Metric Learning Based Zero-Shot Classification

Authors: Maxime Bucher, Stéphane Herbin, Frédéric Jurie

Abstract: Zero-Shot learning has been shown to be an efficient strategy for domain adaptation. In this context, this paper builds on the recent work of Bucher et al. [1], which proposed an approach to solve Zero-Shot classification problems (ZSC) by introducing a novel metric learning based objective function. This objective function allows to learn an optimal embedding of the attributes jointly with a meas… ▽ More Zero-Shot learning has been shown to be an efficient strategy for domain adaptation. In this context, this paper builds on the recent work of Bucher et al. [1], which proposed an approach to solve Zero-Shot classification problems (ZSC) by introducing a novel metric learning based objective function. This objective function allows to learn an optimal embedding of the attributes jointly with a measure of similarity between images and attributes. This paper extends their approach by proposing several schemes to control the generation of the negative pairs, resulting in a significant improvement of the performance and giving above state-of-the-art results on three challenging ZSC datasets. △ Less

Submitted 26 August, 2016; originally announced August 2016.

Journal ref: ECCV 16 WS TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision, Oct 2016, Amsterdam, Netherlands. ECCV 16 WS TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision

arXiv:1607.08085 [pdf, other]

Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classification

Authors: Maxime Bucher, Stéphane Herbin, Frédéric Jurie

Abstract: This paper addresses the task of zero-shot image classification. The key contribution of the proposed approach is to control the semantic embedding of images -- one of the main ingredients of zero-shot learning -- by formulating it as a metric learning problem. The optimized empirical criterion associates two types of sub-task constraints: metric discriminating capacity and accurate attribute pred… ▽ More This paper addresses the task of zero-shot image classification. The key contribution of the proposed approach is to control the semantic embedding of images -- one of the main ingredients of zero-shot learning -- by formulating it as a metric learning problem. The optimized empirical criterion associates two types of sub-task constraints: metric discriminating capacity and accurate attribute prediction. This results in a novel expression of zero-shot learning not requiring the notion of class in the training phase: only pairs of image/attributes, augmented with a consistency indicator, are given as ground truth. At test time, the learned model can predict the consistency of a test image with a given set of attributes , allowing flexible ways to produce recognition inferences. Despite its simplicity, the proposed approach gives state-of-the-art results on four challenging datasets used for zero-shot recognition evaluation. △ Less

Submitted 27 July, 2016; originally announced July 2016.

Comments: in ECCV 2016, Oct 2016, amsterdam, Netherlands. 2016

arXiv:1604.02975 [pdf, other]

CP-mtML: Coupled Projection multi-task Metric Learning for Large Scale Face Retrieval

Authors: Binod Bhattarai, Gaurav Sharma, Frederic Jurie

Abstract: We propose a novel Coupled Projection multi-task Metric Learning (CP-mtML) method for large scale face retrieval. In contrast to previous works which were limited to low dimensional features and small datasets, the proposed method scales to large datasets with high dimensional face descriptors. It utilises pairwise (dis-)similarity constraints as supervision and hence does not require exhaustive c… ▽ More We propose a novel Coupled Projection multi-task Metric Learning (CP-mtML) method for large scale face retrieval. In contrast to previous works which were limited to low dimensional features and small datasets, the proposed method scales to large datasets with high dimensional face descriptors. It utilises pairwise (dis-)similarity constraints as supervision and hence does not require exhaustive class annotation for every training image. While, traditionally, multi-task learning methods have been validated on same dataset but different tasks, we work on the more challenging setting with heterogeneous datasets and different tasks. We show empirical validation on multiple face image datasets of different facial traits, e.g. identity, age and expression. We use classic Local Binary Pattern (LBP) descriptors along with the recent Deep Convolutional Neural Network (CNN) features. The experiments clearly demonstrate the scalability and improved performance of the proposed method on the tasks of identity and age based face image retrieval compared to competitive existing methods, on the standard datasets and with the presence of a million distractor face images. △ Less

Submitted 11 April, 2016; originally announced April 2016.

arXiv:1510.00542 [pdf, other]

Local Higher-Order Statistics (LHS) describing images with statistics of local non-binarized pixel patterns

Authors: Gaurav Sharma, Frederic Jurie

Abstract: We propose a new image representation for texture categorization and facial analysis, relying on the use of higher-order local differential statistics as features. It has been recently shown that small local pixel pattern distributions can be highly discriminative while being extremely efficient to compute, which is in contrast to the models based on the global structure of images. Motivated by su… ▽ More We propose a new image representation for texture categorization and facial analysis, relying on the use of higher-order local differential statistics as features. It has been recently shown that small local pixel pattern distributions can be highly discriminative while being extremely efficient to compute, which is in contrast to the models based on the global structure of images. Motivated by such works, we propose to use higher-order statistics of local non-binarized pixel patterns for the image description. The proposed model does not require either (i) user specified quantization of the space (of pixel patterns) or (ii) any heuristics for discarding low occupancy volumes of the space. We propose to use a data driven soft quantization of the space, with parametric mixture models, combined with higher-order statistics, based on Fisher scores. We demonstrate that this leads to a more expressive representation which, when combined with discriminatively learned classifiers and metrics, achieves state-of-the-art performance on challenging texture and facial analysis datasets, in low complexity setup. Further, it is complementary to higher complexity features and when combined with them improves performance. △ Less

Submitted 2 October, 2015; originally announced October 2015.

Comments: CVIU preprint

arXiv:1509.04186 [pdf, other]

doi 10.1109/TPAMI.2016.2537325

Expanded Parts Model for Semantic Description of Humans in Still Images

Authors: Gaurav Sharma, Frederic Jurie, Cordelia Schmid

Abstract: We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jum**) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few… ▽ More We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jum**) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e. it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets. △ Less

Submitted 25 February, 2016; v1 submitted 14 September, 2015; originally announced September 2015.

Comments: Accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:1503.04065 [pdf, other]

Hybrid multi-layer Deep CNN/Aggregator feature for image classification

Authors: Praveen Kulkarni, Joaquin Zepeda, Frederic Jurie, Patrick Perez, Louis Chevallier

Abstract: Deep Convolutional Neural Networks (DCNN) have established a remarkable performance benchmark in the field of image classification, displacing classical approaches based on hand-tailored aggregations of local descriptors. Yet DCNNs impose high computational burdens both at training and at testing time, and training them requires collecting and annotating large amounts of training data. Supervised… ▽ More Deep Convolutional Neural Networks (DCNN) have established a remarkable performance benchmark in the field of image classification, displacing classical approaches based on hand-tailored aggregations of local descriptors. Yet DCNNs impose high computational burdens both at training and at testing time, and training them requires collecting and annotating large amounts of training data. Supervised adaptation methods have been proposed in the literature that partially re-learn a transferred DCNN structure from a new target dataset. Yet these require expensive bounding-box annotations and are still computationally expensive to learn. In this paper, we address these shortcomings of DCNN adaptation schemes by proposing a hybrid approach that combines conventional, unsupervised aggregators such as Bag-of-Words (BoW), with the DCNN pipeline by treating the output of intermediate layers as densely extracted local descriptors. We test a variant of our approach that uses only intermediate DCNN layers on the standard PASCAL VOC 2007 dataset and show performance significantly higher than the standard BoW model and comparable to Fisher vector aggregation but with a feature that is 150 times smaller. A second variant of our approach that includes the fully connected DCNN layers significantly outperforms Fisher vector schemes and performs comparably to DCNN approaches adapted to Pascal VOC 2007, yet at only a small fraction of the training and testing cost. △ Less

Submitted 13 March, 2015; originally announced March 2015.

Comments: Accepted in ICASSP 2015 conference, 5 pages including reference, 4 figures and 2 tables

Showing 1–38 of 38 results for author: Jurie, F