Search | arXiv e-print repository

GalLoP: Learning Global and Local Prompts for Vision-Language Models

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Audebert, Nicolas Thome

Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning me… ▽ More Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: To be published at ECCV 2024

arXiv:2405.09922 [pdf, other]

Cross-sensor self-supervised training and alignment for remote sensing

Authors: Valerio Marsocci, Nicolas Audebert

Abstract: Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolut… ▽ More Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2404.16409 [pdf, other]

Cross-sensor super-resolution of irregularly sampled Sentinel-2 time series

Authors: Aimi Okabayashi, Nicolas Audebert, Simon Donike, Charlotte Pelletier

Abstract: Satellite imaging generally presents a trade-off between the frequency of acquisitions and the spatial resolution of the images. Super-resolution is often advanced as a way to get the best of both worlds. In this work, we investigate multi-image super-resolution of satellite image time series, i.e. how multiple images of the same area acquired at different dates can help reconstruct a higher resol… ▽ More Satellite imaging generally presents a trade-off between the frequency of acquisitions and the spatial resolution of the images. Super-resolution is often advanced as a way to get the best of both worlds. In this work, we investigate multi-image super-resolution of satellite image time series, i.e. how multiple images of the same area acquired at different dates can help reconstruct a higher resolution observation. In particular, we extend state-of-the-art deep single and multi-image super-resolution algorithms, such as SRDiff and HighRes-net, to deal with irregularly sampled Sentinel-2 time series. We introduce BreizhSR, a new dataset for 4x super-resolution of Sentinel-2 time series using very high-resolution SPOT-6 imagery of Brittany, a French region. We show that using multiple images significantly improves super-resolution performance, and that a well-designed temporal positional encoding allows us to perform super-resolution for different times of the series. In addition, we observe a trade-off between spectral fidelity and perceptual quality of the reconstructed HR images, questioning future directions for super-resolution of Earth Observation data. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Journal ref: EARTHVISION 2024 IEEE/CVF CVPR Workshop. Large Scale Computer Vision for Remote Sensing Imagery, Jun 2024, Seattle, United States

arXiv:2404.12667 [pdf, other]

Detecting Out-Of-Distribution Earth Observation Images with Diffusion Models

Authors: Georges Le Bellier, Nicolas Audebert

Abstract: Earth Observation imagery can capture rare and unusual events, such as disasters and major landscape changes, whose visual appearance contrasts with the usual observations. Deep models trained on common remote sensing data will output drastically different features for these out-of-distribution samples, compared to those closer to their training dataset. Detecting them could therefore help anticip… ▽ More Earth Observation imagery can capture rare and unusual events, such as disasters and major landscape changes, whose visual appearance contrasts with the usual observations. Deep models trained on common remote sensing data will output drastically different features for these out-of-distribution samples, compared to those closer to their training dataset. Detecting them could therefore help anticipate changes in the observations, either geographical or environmental. In this work, we show that the reconstruction error of diffusion models can effectively serve as unsupervised out-of-distribution detectors for remote sensing images, using them as a plausibility score. Moreover, we introduce ODEED, a novel reconstruction-based scorer using the probability-flow ODE of diffusion models. We validate it experimentally on SpaceNet 8 with various scenarios, such as classical OOD detection with geographical shift and near-OOD setups: pre/post-flood and non-flooded/flooded image recognition. We show that our ODEED scorer significantly outperforms other diffusion-based and discriminative baselines on the more challenging near-OOD scenarios of flood image detection, where OOD images are close to the distribution tail. We aim to pave the way towards better use of generative models for anomaly detection in remote sensing. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: EARTHVISION 2024 IEEE/CVF CVPR Workshop. Large Scale Computer Vision for Remote Sensing Imagery, Jun 2024, Seattle, United States

arXiv:2311.16122 [pdf, other]

Semantic Generative Augmentations for Few-Shot Counting

Authors: Perla Doubinsky, Nicolas Audebert, Michel Crucianu, Hervé Le Borgne

Abstract: With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input… ▽ More With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK. △ Less

Submitted 26 October, 2023; originally announced November 2023.

arXiv:2309.08250 [pdf, other]

Optimization of Rank Losses for Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomp… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at https://github.com/cvdfoundation/google-landmark. Code will be released at https://github.com/elias-ramzi/SupRank. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: arXiv admin note: text overlap with arXiv:2207.04873

arXiv:2304.10508 [pdf, other]

Wasserstein Loss for Semantic Editing in the Latent Space of GANs

Authors: Perla Doubinsky, Nicolas Audebert, Michel Crucianu, Hervé Le Borgne

Abstract: The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We… ▽ More The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2. △ Less

Submitted 22 March, 2023; originally announced April 2023.

arXiv:2207.04873 [pdf, other]

Hierarchical Average Precision Training for Pertinent Image Retrieval

Authors: Elias Ramzi, Nicolas Audebert, Nicolas Thome, Clément Rambour, Xavier Bitot

Abstract: Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance a… ▽ More Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance and better evaluate rankings. To train deep models with H-AP, we carefully study the problem's structure and design a smooth lower bound surrogate combined with a clustering loss that ensures consistent ordering. Extensive experiments on 6 datasets show that HAPPIER significantly outperforms state-of-the-art methods for hierarchical retrieval, while being on par with the latest approaches when evaluating fine-grained ranking performances. Finally, we show that HAPPIER leads to better organization of the embedding space, and prevents most severe failure cases of non-hierarchical methods. Our code is publicly available at: https://github.com/elias-ramzi/HAPPIER. △ Less

Submitted 22 July, 2022; v1 submitted 5 July, 2022; originally announced July 2022.

Journal ref: ECCV 2022, Oct 2022, Tel-Aviv, Israel

arXiv:2202.03190 [pdf, other]

Efficient Autoprecoder-based deep learning for massive MU-MIMO Downlink under PA Non-Linearities

Authors: Xinying Cheng, Rafik Zayani, Marin Ferecatu, Nicolas Audebert

Abstract: This paper introduces a new efficient autoprecoder (AP) based deep learning approach for massive multiple-input multiple-output (mMIMO) downlink systems in which the base station is equipped with a large number of antennas with energy-efficient power amplifiers (PAs) and serves multiple user terminals. We present AP-mMIMO, a new method that jointly eliminates the multiuser interference and compens… ▽ More This paper introduces a new efficient autoprecoder (AP) based deep learning approach for massive multiple-input multiple-output (mMIMO) downlink systems in which the base station is equipped with a large number of antennas with energy-efficient power amplifiers (PAs) and serves multiple user terminals. We present AP-mMIMO, a new method that jointly eliminates the multiuser interference and compensates the severe nonlinear (NL) PA distortions. Unlike previous works, AP-mMIMO has a low computational complexity, making it suitable for a global energy-efficient system. Specifically, we aim to design the PA-aware precoder and the receive decoder by leveraging the concept of autoprecoder, whereas the end-to-end massive multiuser (MU)-MIMO downlink is designed using a deep neural network (NN). Most importantly, the proposed AP-mMIMO is suited for the varying block fading channel scenario. To deal with such scenarios, we consider a two-stage precoding scheme: 1) a NN-precoder is used to address the PA non-linearities and 2) a linear precoder is used to suppress the multiuser interference. The NN-precoder and the receive decoder are trained off-line and when the channel varies, only the linear precoder changes on-line. This latter is designed by using the widely used zero-forcing precoding scheme or its lowcomplexity version based on matrix polynomials. Numerical simulations show that the proposed AP-mMIMO approach achieves competitive performance with a significantly lower complexity compared to existing literature. Index Terms-multiuser (MU) precoding, massive multipleinput multiple-output (MIMO), energy-efficiency, hardware impairment, power amplifier (PA) nonlinearities, autoprecoder, deep learning, neural network (NN) △ Less

Submitted 3 February, 2022; originally announced February 2022.

Journal ref: IEEE Wireless Communications and Networking Conference, Apr 2022, Austin, United States

arXiv:2111.00909 [pdf, other]

Multi-Attribute Balanced Sampling for Disentangled GAN Controls

Authors: Perla Doubinsky, Nicolas Audebert, Michel Crucianu, Hervé Le Borgne

Abstract: Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of l… ▽ More Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by subsampling the generated data to remove over-represented co-occuring attributes thus balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on two popular GAN architectures, PGGAN and StyleGAN, and two datasets, CelebAHQ and FFHQ. We show that this approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing. △ Less

Submitted 27 January, 2022; v1 submitted 28 October, 2021; originally announced November 2021.

arXiv:2110.01445 [pdf, other]

Robust and Decomposable Average Precision for Image Retrieval

Authors: Elias Ramzi, Nicolas Thome, Clément Rambour, Nicolas Audebert, Xavier Bitot

Abstract: In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank func… ▽ More In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank function, which provides an upper bound of the AP loss and ensures robust training. Secondly, we design a simple yet effective loss function to reduce the decomposability gap between the AP in the whole training set and its averaged batch approximation, for which we provide theoretical guarantees. Extensive experiments conducted on three image retrieval datasets show that ROADMAP outperforms several recent AP approximation methods and highlight the importance of our two contributions. Finally, using ROADMAP for training deep models yields very good performances, outperforming state-of-the-art results on the three datasets. △ Less

Submitted 8 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Journal ref: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Dec 2021, Sydney, Australia

arXiv:2108.11629 [pdf, other]

Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM tree

Authors: Chen Dang, Hicham Randrianarivo, Raphaël Fournier-S'Niehotta, Nicolas Audebert

Abstract: Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image using the content of the surrounding webpage. A common preprocessing step before performing WICE is to render the content of the webpage. When done at a large scale (e.g., for search engine indexation), it may become very computationally costly (up to several seconds per page). To avoid this cost,… ▽ More Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image using the content of the surrounding webpage. A common preprocessing step before performing WICE is to render the content of the webpage. When done at a large scale (e.g., for search engine indexation), it may become very computationally costly (up to several seconds per page). To avoid this cost, we introduce a novel WICE approach that combines Graph Neural Networks (GNNs) and Natural Language Processing models. Our method relies on a graph model containing both node types and text as features. The model is fed through several blocks of GNNs to extract the textual context. Since no labeled WICE dataset with ground truth exists, we train and evaluate the GNNs on a proxy task that consists in finding the semantically closest text to the image caption. We then interpret importance weights to find the most relevant text nodes and define them as the image context. Thanks to GNNs, our model is able to encode both structural and semantic information from the webpage. We show that our approach gives promising results to help address the large-scale WICE problem using only HTML data. △ Less

Submitted 26 August, 2021; originally announced August 2021.

Journal ref: GEM: Graph Embedding and Mining - ECML/PKDD Workshops, Sep 2021, Bilbao, Spain

arXiv:2107.14009 [pdf, other]

PKSpell: Data-Driven Pitch Spelling and Key Signature Estimation

Authors: Francesco Foscarin, Nicolas Audebert, Raphaël Fournier-S'Niehotta

Abstract: We present PKSpell: a data-driven approach for the joint estimation of pitch spelling and key signatures from MIDI files. Both elements are fundamental for the production of a full-fledged musical score and facilitate many MIR tasks such as harmonic analysis, section identification, melodic similarity, and search in a digital music library. We design a deep recurrent neural network model that only… ▽ More We present PKSpell: a data-driven approach for the joint estimation of pitch spelling and key signatures from MIDI files. Both elements are fundamental for the production of a full-fledged musical score and facilitate many MIR tasks such as harmonic analysis, section identification, melodic similarity, and search in a digital music library. We design a deep recurrent neural network model that only requires information readily available in all kinds of MIDI files, including performances, or other symbolic encodings. We release a model trained on the ASAP dataset. Our system can be used with these pre-trained parameters and is easy to integrate into a MIR pipeline. We also propose a data augmentation procedure that helps retraining on small datasets. PKSpell achieves strong key signature estimation performance on a challenging dataset. Most importantly, this model establishes a new state-of-the-art performance on the MuseData pitch spelling dataset without retraining. △ Less

Submitted 27 July, 2021; originally announced July 2021.

Comments: International Society for Music Information Retrieval Conference (ISMIR), Nov 2021, Online, India

arXiv:2010.07830 [pdf, other]

Semi-Supervised Semantic Segmentation in Earth Observation: The MiniFrance Suite, Dataset Analysis and Multi-task Network Study

Authors: Javiera Castillo-Navarro, Bertrand Le Saux, Alexandre Boulch, Nicolas Audebert, Sébastien Lefèvre

Abstract: The development of semi-supervised learning techniques is essential to enhance the generalization capacities of machine learning algorithms. Indeed, raw image data are abundant while labels are scarce, therefore it is crucial to leverage unlabeled inputs to build better models. The availability of large databases have been key for the development of learning algorithms with high level performance.… ▽ More The development of semi-supervised learning techniques is essential to enhance the generalization capacities of machine learning algorithms. Indeed, raw image data are abundant while labels are scarce, therefore it is crucial to leverage unlabeled inputs to build better models. The availability of large databases have been key for the development of learning algorithms with high level performance. Despite the major role of machine learning in Earth Observation to derive products such as land cover maps, datasets in the field are still limited, either because of modest surface coverage, lack of variety of scenes or restricted classes to identify. We introduce a novel large-scale dataset for semi-supervised semantic segmentation in Earth Observation, the MiniFrance suite. MiniFrance has several unprecedented properties: it is large-scale, containing over 2000 very high resolution aerial images, accounting for more than 200 billions samples (pixels); it is varied, covering 16 conurbations in France, with various climates, different landscapes, and urban as well as countryside scenes; and it is challenging, considering land use classes with high-level semantics. Nevertheless, the most distinctive quality of MiniFrance is being the only dataset in the field especially designed for semi-supervised learning: it contains labeled and unlabeled images in its training partition, which reproduces a life-like scenario. Along with this dataset, we present tools for data representativeness analysis in terms of appearance similarity and a thorough study of MiniFrance data, demonstrating that it is suitable for learning and generalizes well in a semi-supervised setting. Finally, we present semi-supervised deep architectures based on multi-task learning and the first experiments on MiniFrance. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:1909.01671 [pdf, other]

Distance transform regression for spatially-aware deep semantic segmentation

Authors: Nicolas Audebert, Alexandre Boulch, Bertrand Le Saux, Sébastien Lefèvre

Abstract: Understanding visual scenes relies more and more on dense pixel-wise classification obtained via deep fully convolutional neural networks. However, due to the nature of the networks, predictions often suffer from blurry boundaries and ill-segmented shapes, fueling the need for post-processing. This work introduces a new semantic segmentation regularization based on the regression of a distance tra… ▽ More Understanding visual scenes relies more and more on dense pixel-wise classification obtained via deep fully convolutional neural networks. However, due to the nature of the networks, predictions often suffer from blurry boundaries and ill-segmented shapes, fueling the need for post-processing. This work introduces a new semantic segmentation regularization based on the regression of a distance transform. After computing the distance transform on the label masks, we train a FCN in a multi-task setting in both discrete and continuous spaces by learning jointly classification and distance regression. This requires almost no modification of the network structure and adds a very low overhead to the training process. Learning to approximate the distance transform back-propagates spatial cues that implicitly regularizes the segmentation. We validate this technique with several architectures on various datasets, and we show significant improvements compared to competitive baselines. △ Less

Submitted 4 September, 2019; originally announced September 2019.

arXiv:1907.06370 [pdf, other]

Multimodal deep networks for text and image-based document classification

Authors: Nicolas Audebert, Catherine Herold, Kuider Slimani, Cédric Vidal

Abstract: Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone… ▽ More Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document. We design a multimodal neural network that is able to learn from word embeddings, computed on text extracted by OCR, and from the image. We show that this approach boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by our new QS-OCR text dataset (https://github.com/Quicksign/ocrized-text-dataset), even without clean text information. △ Less

Submitted 15 July, 2019; originally announced July 2019.

arXiv:1904.10674 [pdf, other]

doi 10.1109/MGRS.2019.2912563

Deep Learning for Classification of Hyperspectral Data: A Comparative Review

Authors: Nicolas Audebert, Bertrand Saux, Sébastien Lefèvre

Abstract: In recent years, deep learning techniques revolutionized the way remote sensing data are processed. Classification of hyperspectral data is no exception to the rule, but has intrinsic specificities which make application of deep learning less straightforward than with other optical data. This article presents a state of the art of previous machine learning approaches, reviews the various deep lear… ▽ More In recent years, deep learning techniques revolutionized the way remote sensing data are processed. Classification of hyperspectral data is no exception to the rule, but has intrinsic specificities which make application of deep learning less straightforward than with other optical data. This article presents a state of the art of previous machine learning approaches, reviews the various deep learning approaches currently proposed for hyperspectral classification, and identifies the problems and difficulties which arise to implement deep neural networks for this task. In particular, the issues of spatial and spectral resolution, data volume, and transfer of models from multimedia images to hyperspectral data are addressed. Additionally, a comparative study of various families of network architectures is provided and a software toolbox is publicly released to allow experimenting with these methods. 1 This article is intended for both data scientists with interest in hyperspectral data and remote sensing experts eager to apply deep learning techniques to their own dataset. △ Less

Submitted 24 April, 2019; originally announced April 2019.

arXiv:1806.02583 [pdf, other]

Generative Adversarial Networks for Realistic Synthesis of Hyperspectral Samples

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: This work addresses the scarcity of annotated hyperspectral data required to train deep neural networks. Especially, we investigate generative adversarial networks and their application to the synthesis of consistent labeled spectra. By training such networks on public datasets, we show that these models are not only able to capture the underlying distribution, but also to generate genuine-looking… ▽ More This work addresses the scarcity of annotated hyperspectral data required to train deep neural networks. Especially, we investigate generative adversarial networks and their application to the synthesis of consistent labeled spectra. By training such networks on public datasets, we show that these models are not only able to capture the underlying distribution, but also to generate genuine-looking and physically plausible spectra. Moreover, we experimentally validate that the synthetic samples can be used as an effective data augmentation strategy. We validate our approach on several public hyper-spectral datasets using a variety of deep classifiers. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Journal ref: International Geoscience and Remote Sensing Symposium (IGARSS 2018), Jul 2018, Valencia, Spain

arXiv:1712.01600 [pdf, other]

Deep learning for semantic segmentation of remote sensing images with rich spectral content

Authors: A Hamida, A. Benoît, P. Lambert, L Klein, C Amar, N. Audebert, S. Lefèvre

Abstract: With the rapid development of Remote Sensing acquisition techniques, there is a need to scale and improve processing tools to cope with the observed increase of both data volume and richness. Among popular techniques in remote sensing, Deep Learning gains increasing interest but depends on the quality of the training data. Therefore, this paper presents recent Deep Learning approaches for fine or… ▽ More With the rapid development of Remote Sensing acquisition techniques, there is a need to scale and improve processing tools to cope with the observed increase of both data volume and richness. Among popular techniques in remote sensing, Deep Learning gains increasing interest but depends on the quality of the training data. Therefore, this paper presents recent Deep Learning approaches for fine or coarse land cover semantic segmentation estimation. Various 2D architectures are tested and a new 3D model is introduced in order to jointly process the spatial and spectral dimensions of the data. Such a set of networks enables the comparison of the different spectral fusion schemes. Besides, we also assess the use of a " noisy ground truth " (i.e. outdated and low spatial resolution labels) for training and testing the networks. △ Less

Submitted 5 December, 2017; originally announced December 2017.

Comments: IEEE International Geoscience and Remote Sensing Symposium, Jul 2017, Fort Worth, United States. 2017

arXiv:1711.08681 [pdf, other]

Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: In this work, we investigate various methods to deal with semantic labeling of very high resolution multi-modal remote sensing data. Especially, we study how deep fully convolutional networks can be adapted to deal with multi-modal and multi-scale remote sensing data for semantic labeling. Our contributions are threefold: a) we present an efficient multi-scale approach to leverage both a large spa… ▽ More In this work, we investigate various methods to deal with semantic labeling of very high resolution multi-modal remote sensing data. Especially, we study how deep fully convolutional networks can be adapted to deal with multi-modal and multi-scale remote sensing data for semantic labeling. Our contributions are threefold: a) we present an efficient multi-scale approach to leverage both a large spatial context and the high resolution data, b) we investigate early and late fusion of Lidar and multispectral data, c) we validate our methods on two public datasets with state-of-the-art results. Our results indicate that late fusion make it possible to recover errors steaming from ambiguous data, while early fusion allows for better joint-feature learning but at the cost of higher sensitivity to missing data. △ Less

Submitted 23 November, 2017; originally announced November 2017.

Comments: ISPRS Journal of Photogrammetry and Remote Sensing, Elsevier, A Para{î}tre

arXiv:1705.06057 [pdf, other]

Joint Learning from Earth Observation and OpenStreetMap Data to Get Faster Better Semantic Maps

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: In this work, we investigate the use of OpenStreetMap data for semantic labeling of Earth Observation images. Deep neural networks have been used in the past for remote sensing data classification from various sensors, including multispectral, hyperspectral, SAR and LiDAR data. While OpenStreetMap has already been used as ground truth data for training such networks, this abundant data source rema… ▽ More In this work, we investigate the use of OpenStreetMap data for semantic labeling of Earth Observation images. Deep neural networks have been used in the past for remote sensing data classification from various sensors, including multispectral, hyperspectral, SAR and LiDAR data. While OpenStreetMap has already been used as ground truth data for training such networks, this abundant data source remains rarely exploited as an input information layer. In this paper, we study different use cases and deep network architectures to leverage OpenStreetMap data for semantic labeling of aerial and satellite images. Especially , we look into fusion based architectures and coarse-to-fine segmentation to include the OpenStreetMap layer into multispectral-based deep fully convolutional networks. We illustrate how these methods can be successfully used on two public datasets: ISPRS Potsdam and DFC2017. We show that OpenStreetMap data can efficiently be integrated into the vision-based deep learning models and that it significantly improves both the accuracy performance and the convergence speed of the networks. △ Less

Submitted 17 May, 2017; originally announced May 2017.

Journal ref: EARTHVISION 2017 IEEE/ISPRS CVPR Workshop. Large Scale Computer Vision for Remote Sensing Imagery, Jul 2017, Honolulu, United States. 2017

arXiv:1701.05818 [pdf, other]

Fusion of Heterogeneous Data in Convolutional Networks for Urban Semantic Labeling (Invited Paper)

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: In this work, we present a novel module to perform fusion of heterogeneous data using fully convolutional networks for semantic labeling. We introduce residual correction as a way to learn how to fuse predictions coming out of a dual stream architecture. Especially, we perform fusion of DSM and IRRG optical data on the ISPRS Vaihingen dataset over a urban area and obtain new state-of-the-art resul… ▽ More In this work, we present a novel module to perform fusion of heterogeneous data using fully convolutional networks for semantic labeling. We introduce residual correction as a way to learn how to fuse predictions coming out of a dual stream architecture. Especially, we perform fusion of DSM and IRRG optical data on the ISPRS Vaihingen dataset over a urban area and obtain new state-of-the-art results. △ Less

Submitted 20 January, 2017; originally announced January 2017.

Comments: Joint Urban Remote Sensing Event (JURSE), Mar 2017, Dubai, United Arab Emirates. Joint Urban Remote Sensing Event 2017

arXiv:1609.06861 [pdf, other]

How Useful is Region-based Classification of Remote Sensing Images in a Deep Learning Framework?

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: In this paper, we investigate the impact of segmentation algorithms as a preprocessing step for classification of remote sensing images in a deep learning framework. Especially, we address the issue of segmenting the image into regions to be classified using pre-trained deep neural networks as feature extractors for an SVM-based classifier. An efficient segmentation as a preprocessing step… ▽ More In this paper, we investigate the impact of segmentation algorithms as a preprocessing step for classification of remote sensing images in a deep learning framework. Especially, we address the issue of segmenting the image into regions to be classified using pre-trained deep neural networks as feature extractors for an SVM-based classifier. An efficient segmentation as a preprocessing step helps learning by adding a spatially-coherent structure to the data. Therefore, we compare algorithms producing superpixels with more traditional remote sensing segmentation algorithms and measure the variation in terms of classification accuracy. We establish that superpixel algorithms allow for a better classification accuracy as a homogenous and compact segmentation favors better generalization of the training samples. △ Less

Submitted 22 September, 2016; originally announced September 2016.

Comments: IEEE International Geosciences and Remote Sensing Symposium (IGARSS), Jul 2016, Bei**g, China

arXiv:1609.06846 [pdf, other]

Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-scale Deep Networks

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: This work investigates the use of deep fully convolutional neural networks (DFCNN) for pixel-wise scene labeling of Earth Observation images. Especially, we train a variant of the SegNet architecture on remote sensing data over an urban area and study different strategies for performing accurate semantic segmentation. Our contributions are the following: 1) we transfer efficiently a DFCNN from gen… ▽ More This work investigates the use of deep fully convolutional neural networks (DFCNN) for pixel-wise scene labeling of Earth Observation images. Especially, we train a variant of the SegNet architecture on remote sensing data over an urban area and study different strategies for performing accurate semantic segmentation. Our contributions are the following: 1) we transfer efficiently a DFCNN from generic everyday images to remote sensing images; 2) we introduce a multi-kernel convolutional layer for fast aggregation of predictions at multiple scales; 3) we perform data fusion from heterogeneous sensors (optical and laser) using residual correction. Our framework improves state-of-the-art accuracy on the ISPRS Vaihingen 2D Semantic Labeling dataset. △ Less

Submitted 22 September, 2016; originally announced September 2016.

Comments: Asian Conference on Computer Vision (ACCV16), Nov 2016, Taipei, Taiwan

arXiv:1609.06845 [pdf, other]

On the usability of deep networks for object-based image analysis

Authors: Nicolas Audebert, Bertrand Le Saux, Sébastien Lefèvre

Abstract: As computer vision before, remote sensing has been radically changed by the introduction of Convolution Neural Networks. Land cover use, object detection and scene understanding in aerial images rely more and more on deep learning to achieve new state-of-the-art results. Recent architectures such as Fully Convolutional Networks (Long et al., 2015) can even produce pixel level annotations for sema… ▽ More As computer vision before, remote sensing has been radically changed by the introduction of Convolution Neural Networks. Land cover use, object detection and scene understanding in aerial images rely more and more on deep learning to achieve new state-of-the-art results. Recent architectures such as Fully Convolutional Networks (Long et al., 2015) can even produce pixel level annotations for semantic map**. In this work, we show how to use such deep networks to detect, segment and classify different varieties of wheeled vehicles in aerial images from the ISPRS Potsdam dataset. This allows us to tackle object detection and classification on a complex dataset made up of visually similar classes, and to demonstrate the relevance of such a subclass modeling approach. Especially, we want to show that deep learning is also suitable for object-oriented analysis of Earth Observation data. First, we train a FCN variant on the ISPRS Potsdam dataset and show how the learnt semantic maps can be used to extract precise segmentation of vehicles, which allow us studying the repartition of vehicles in the city. Second, we train a CNN to perform vehicle classification on the VEDAI (Razakarivony and Jurie, 2016) dataset, and transfer its knowledge to classify candidate segmented vehicles on the Potsdam dataset. △ Less

Submitted 22 September, 2016; originally announced September 2016.

Comments: in International Conference on Geographic Object-Based Image Analysis (GEOBIA), Sep 2016, Enschede, Netherlands

Showing 1–25 of 25 results for author: Audebert, N