Search | arXiv e-print repository

UMBRAE: Unified Multimodal Decoding of Brain Signals

Authors: Weihao Xia, Raoul de Charette, Cengiz Öztireli, **g-Hao Xue

Abstract: We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficie… ▽ More We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language model (MLLM). Second, we introduce a cross-subject training strategy map** subject-specific features to a common feature space. This allows a model to be trained on multiple subjects without extra resources, even yielding superior results compared to subject-specific models. Further, we demonstrate this supports weakly-supervised adaptation to new subjects, with only a fraction of the total training data. Experiments demonstrate that UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms methods in well established tasks. To assess our method, we construct and share with the community a comprehensive brain understanding benchmark BrainHub. Our code and benchmark are available at https://weihaox.github.io/UMBRAE. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: Project Page: https://weihaox.github.io/UMBRAE

arXiv:2312.02158 [pdf, other]

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

Authors: Anh-Quan Cao, Angela Dai, Raoul de Charette

Abstract: We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for ro… ▽ More We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications, we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy, while improving performance and yielding better uncertainty for little additional compute. Additionally, we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo . △ Less

Submitted 25 May, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: CVPR 2024 Oral - Best paper award candidate. Project page: https://astra-vision.github.io/PaSCo

arXiv:2311.17922 [pdf, other]

A Simple Recipe for Language-guided Domain Generalized Segmentation

Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

Abstract: Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization c… ▽ More Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix . △ Less

Submitted 2 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: CVPR 2024

arXiv:2311.17060 [pdf, other]

Material Palette: Extraction of Materials from a Single Image

Authors: Ivan Lopes, Fabio Pizzati, Raoul de Charette

Abstract: In this paper, we propose a method to extract physically-based rendering (PBR) materials from a single real-world image. We do so in two steps: first, we map regions of the image to material concepts using a diffusion model, which allows the sampling of texture images resembling each material in the scene. Second, we benefit from a separate network to decompose the generated textures into Spatiall… ▽ More In this paper, we propose a method to extract physically-based rendering (PBR) materials from a single real-world image. We do so in two steps: first, we map regions of the image to material concepts using a diffusion model, which allows the sampling of texture images resembling each material in the scene. Second, we benefit from a separate network to decompose the generated textures into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be used in rendering applications. Our approach builds on existing synthetic material libraries with SVBRDF ground truth, but also exploits a diffusion-generated RGB texture dataset to allow generalization to new samples using unsupervised domain adaptation (UDA). Our contributions are thoroughly evaluated on synthetic and real-world datasets. We further demonstrate the applicability of our method for editing 3D scenes with materials estimated from real photographs. The code and models will be made open-source. Project page: https://astra-vision.github.io/MaterialPalette/ △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 8 pages, 11 figures, 2 tables. Webpage https://astra-vision.github.io/MaterialPalette/

arXiv:2310.02265 [pdf, other]

DREAM: Visual Decoding from Reversing Human Visual System

Authors: Weihao Xia, Raoul de Charette, Cengiz Öztireli, **g-Hao Xue

Abstract: In this work we present DREAM, an fMRI-to-image method for reconstructing viewed images from brain activities, grounded on fundamental knowledge of the human visual system. We craft reverse pathways that emulate the hierarchical and parallel nature of how humans perceive the visual world. These tailored pathways are specialized to decipher semantics, color, and depth cues from fMRI data, mirroring… ▽ More In this work we present DREAM, an fMRI-to-image method for reconstructing viewed images from brain activities, grounded on fundamental knowledge of the human visual system. We craft reverse pathways that emulate the hierarchical and parallel nature of how humans perceive the visual world. These tailored pathways are specialized to decipher semantics, color, and depth cues from fMRI data, mirroring the forward pathways from visual stimuli to fMRI recordings. To do so, two components mimic the inverse processes within the human visual system: the Reverse Visual Association Cortex (R-VAC) which reverses pathways of this brain region, extracting semantics from fMRI data; the Reverse Parallel PKM (R-PKM) component simultaneously predicting color and depth from fMRI signals. The experiments indicate that our method outperforms the current state-of-the-art models in terms of the consistency of appearance, structure, and semantics. Code will be made publicly available to facilitate further research in this field. △ Less

Submitted 10 April, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Project Page: https://weihaox.github.io/DREAM

arXiv:2212.03241 [pdf, other]

PØDA: Prompt-driven Zero-shot Domain Adaptation

Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

Abstract: Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of `Prompt-driven Zero-shot Domain Adaptation', where we adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prom… ▽ More Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of `Prompt-driven Zero-shot Domain Adaptation', where we adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt. First, we leverage a pretrained contrastive vision-language model (CLIP) to optimize affine transformations of source features, steering them towards the target text embedding while preserving their content and semantics. To achieve this, we propose Prompt-driven Instance Normalization (PIN). Second, we show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand, even surpassing one-shot unsupervised domain adaptation. A similar boost is observed on object detection and image classification. The code is available at https://github.com/astra-vision/PODA . △ Less

Submitted 19 August, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: Accepted to ICCV 2023, Project Page: https://astra-vision.github.io/PODA/

arXiv:2212.02501 [pdf, other]

SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields

Authors: Anh-Quan Cao, Raoul de Charette

Abstract: 3D reconstruction from a single 2D image was extensively covered in the literature but relies on depth supervision at training time, which limits its applicability. To relax the dependence to depth we propose SceneRF, a self-supervised monocular scene reconstruction method using only posed image sequences for training. Fueled by the recent progress in neural radiance fields (NeRF) we optimize a ra… ▽ More 3D reconstruction from a single 2D image was extensively covered in the literature but relies on depth supervision at training time, which limits its applicability. To relax the dependence to depth we propose SceneRF, a self-supervised monocular scene reconstruction method using only posed image sequences for training. Fueled by the recent progress in neural radiance fields (NeRF) we optimize a radiance field though with explicit depth optimization and a novel probabilistic sampling strategy to efficiently handle large scenes. At inference, a single input image suffices to hallucinate novel depth views which are fused together to obtain 3D scene reconstruction. Thorough experiments demonstrate that we outperform all baselines for novel depth views synthesis and scene reconstruction, on indoor BundleFusion and outdoor SemanticKITTI. Code is available at https://astra-vision.github.io/SceneRF . △ Less

Submitted 24 August, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: ICCV 2023. Project page: https://astra-vision.github.io/SceneRF

arXiv:2210.01784 [pdf, other]

COARSE3D: Class-Prototypes for Contrastive Learning in Weakly-Supervised 3D Point Cloud Segmentation

Authors: Rong Li, Anh-Quan Cao, Raoul de Charette

Abstract: Annotation of large-scale 3D data is notoriously cumbersome and costly. As an alternative, weakly-supervised learning alleviates such a need by reducing the annotation by several order of magnitudes. We propose COARSE3D, a novel architecture-agnostic contrastive learning strategy for 3D segmentation. Since contrastive learning requires rich and diverse examples as keys and anchors, we leverage a p… ▽ More Annotation of large-scale 3D data is notoriously cumbersome and costly. As an alternative, weakly-supervised learning alleviates such a need by reducing the annotation by several order of magnitudes. We propose COARSE3D, a novel architecture-agnostic contrastive learning strategy for 3D segmentation. Since contrastive learning requires rich and diverse examples as keys and anchors, we leverage a prototype memory bank capturing class-wise global dataset information efficiently into a small number of prototypes acting as keys. An entropy-driven sampling technique then allows us to select good pixels from predictions as anchors. Experiments on three projection-based backbones show we outperform baselines on three challenging real-world outdoor datasets, working with as low as 0.001% annotations. △ Less

Submitted 7 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

arXiv:2206.08927 [pdf, other]

Cross-task Attention Mechanism for Dense Multi-task Learning

Authors: Ivan Lopes, Tuan-Hung Vu, Raoul de Charette

Abstract: Multi-task learning has recently become a promising solution for a comprehensive understanding of complex scenes. Not only being memory-efficient, multi-task models with an appropriate design can favor exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation, and two geometry-related tasks, namely dense depth, surface normal estimation as well as ed… ▽ More Multi-task learning has recently become a promising solution for a comprehensive understanding of complex scenes. Not only being memory-efficient, multi-task models with an appropriate design can favor exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation, and two geometry-related tasks, namely dense depth, surface normal estimation as well as edge estimation showing their benefit on indoor and outdoor datasets. We propose a novel multi-task learning architecture that exploits pair-wise cross-task exchange through correlation-guided attention and self-attention to enhance the average representation learning for all tasks. We conduct extensive experiments considering three multi-task setups, showing the benefit of our proposal in comparison to competitive baselines in both synthetic and real benchmarks. We also extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is available at https://github.com/cv-rits/DenseMTL. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: 10 figures, 6 tables, 23 pages

arXiv:2112.00726 [pdf, other]

MonoScene: Monocular 3D Semantic Scene Completion

Authors: Anh-Quan Cao, Raoul de Charette

Abstract: MonoScene proposes a 3D Semantic Scene Completion (SSC) framework, where the dense geometry and semantics of a scene are inferred from a single monocular RGB image. Different from the SSC literature, relying on 2.5 or 3D input, we solve the complex problem of 2D to 3D scene reconstruction while jointly inferring its semantics. Our framework relies on successive 2D and 3D UNets bridged by a novel 2… ▽ More MonoScene proposes a 3D Semantic Scene Completion (SSC) framework, where the dense geometry and semantics of a scene are inferred from a single monocular RGB image. Different from the SSC literature, relying on 2.5 or 3D input, we solve the complex problem of 2D to 3D scene reconstruction while jointly inferring its semantics. Our framework relies on successive 2D and 3D UNets bridged by a novel 2D-3D features projection inspiring from optics and introduces a 3D context relation prior to enforce spatio-semantic consistency. Along with architectural contributions, we introduce novel global scene and local frustums losses. Experiments show we outperform the literature on all metrics and datasets while hallucinating plausible scenery even beyond the camera field of view. Our code and trained models are available at https://github.com/cv-rits/MonoScene. △ Less

Submitted 29 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted at CVPR 2022. Project page: https://cv-rits.github.io/MonoScene/

arXiv:2111.13681 [pdf, other]

ManiFest: Manifold Deformation for Few-shot Image Translation

Authors: Fabio Pizzati, Jean-François Lalonde, Raoul de Charette

Abstract: Most image-to-image translation methods require a large number of training images, which restricts their applicability. We instead propose ManiFest: a framework for few-shot image translation that learns a context-aware representation of a target domain from a few images only. To enforce feature consistency, our framework learns a style manifold between source and proxy anchor domains (assumed to… ▽ More Most image-to-image translation methods require a large number of training images, which restricts their applicability. We instead propose ManiFest: a framework for few-shot image translation that learns a context-aware representation of a target domain from a few images only. To enforce feature consistency, our framework learns a style manifold between source and proxy anchor domains (assumed to be composed of large numbers of images). The learned manifold is interpolated and deformed towards the few-shot target domain via patch-based adversarial and feature statistics alignment losses. All of these components are trained simultaneously during a single end-to-end loop. In addition to the general few-shot translation task, our approach can alternatively be conditioned on a single exemplar image to reproduce its specific style. Extensive experiments demonstrate the efficacy of ManiFest on multiple tasks, outperforming the state-of-the-art on all metrics and in both the general- and exemplar-based scenarios. Our code is available at https://github.com/cv-rits/Manifest . △ Less

Submitted 20 July, 2022; v1 submitted 26 November, 2021; originally announced November 2021.

Comments: ECCV 2022

arXiv:2109.04468 [pdf, other]

Leveraging Local Domains for Image-to-Image Translation

Authors: Anthony Dell'Eva, Fabio Pizzati, Massimo Bertozzi, Raoul de Charette

Abstract: Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as… ▽ More Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training. △ Less

Submitted 14 February, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: VISAPP 2022 Best Paper Award

arXiv:2107.14229 [pdf, other]

Physics-informed Guided Disentanglement in Generative Networks

Authors: Fabio Pizzati, Pietro Cerri, Raoul de Charette

Abstract: Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we propose a general framework to disentangle visual traits in target images. Primarily, we build upon collection of simple physics models, gu… ▽ More Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we propose a general framework to disentangle visual traits in target images. Primarily, we build upon collection of simple physics models, guiding the disentanglement with a physical model that renders some of the target traits, and learning the remaining ones. Because physics allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. Secondarily, we show the versatility of our framework to neural-guided disentanglement where a generative network is used in place of a physical model in case the latter is not directly accessible. Altogether, we introduce three strategies of disentanglement being guided from either a fully differentiable physics model, a (partially) non-differentiable physics model, or a neural network. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation. △ Less

Submitted 27 April, 2023; v1 submitted 29 July, 2021; originally announced July 2021.

Comments: TPAMI 2023. Code: https://github.com/astra-vision/GuidedDisent

arXiv:2103.09189 [pdf, other]

Goal-constrained Sparse Reinforcement Learning for End-to-End Driving

Authors: Pranav Agarwal, Pierre de Beaucorps, Raoul de Charette

Abstract: Deep reinforcement Learning for end-to-end driving is limited by the need of complex reward engineering. Sparse rewards can circumvent this challenge but suffers from long training time and leads to sub-optimal policy. In this work, we explore full-control driving with only goal-constrained sparse reward and propose a curriculum learning approach for end-to-end driving using only navigation view m… ▽ More Deep reinforcement Learning for end-to-end driving is limited by the need of complex reward engineering. Sparse rewards can circumvent this challenge but suffers from long training time and leads to sub-optimal policy. In this work, we explore full-control driving with only goal-constrained sparse reward and propose a curriculum learning approach for end-to-end driving using only navigation view maps that benefit from small virtual-to-real domain gap. To address the complexity of multiple driving policies, we learn concurrent individual policies selected at inference by a navigation system. We demonstrate the ability of our proposal to generalize on unseen road layout, and to drive significantly longer than in the training. △ Less

Submitted 31 July, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: Conference submission 6 pages, 8 figures

arXiv:2103.07466 [pdf, other]

3D Semantic Scene Completion: a Survey

Authors: Luis Roldao, Raoul de Charette, Anne Verroust-Blondet

Abstract: Semantic Scene Completion (SSC) aims to jointly estimate the complete geometry and semantics of a scene, assuming partial sparse input. In the last years following the multiplication of large-scale 3D datasets, SSC has gained significant momentum in the research community because it holds unresolved challenges. Specifically, SSC lies in the ambiguous completion of large unobserved areas and the we… ▽ More Semantic Scene Completion (SSC) aims to jointly estimate the complete geometry and semantics of a scene, assuming partial sparse input. In the last years following the multiplication of large-scale 3D datasets, SSC has gained significant momentum in the research community because it holds unresolved challenges. Specifically, SSC lies in the ambiguous completion of large unobserved areas and the weak supervision signal of the ground truth. This led to a substantially increasing number of papers on the matter. This survey aims to identify, compare and analyze the techniques providing a critical analysis of the SSC literature on both methods and datasets. Throughout the paper, we provide an in-depth analysis of the existing works covering all choices made by the authors while highlighting the remaining avenues of research. SSC performance of the SoA on the most popular datasets is also evaluated and analyzed. △ Less

Submitted 12 July, 2021; v1 submitted 12 March, 2021; originally announced March 2021.

Comments: Accepted in IJCV

arXiv:2103.06879 [pdf, other]

CoMoGAN: continuous model-guided image-to-image translation

Authors: Fabio Pizzati, Pietro Cerri, Raoul de Charette

Abstract: CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations featu… ▽ More CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations features. CoMoGAN can be used with any GAN backbone and allows new types of image translation, such as cyclic image translation like timelapse generation, or detached linear translation. On all datasets, it outperforms the literature. Our code is available at http://github.com/cv-rits/CoMoGAN . △ Less

Submitted 29 June, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: CVPR 2021 oral

arXiv:2101.07253 [pdf, other]

Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Authors: Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Émilie Wirbel, Patrick Pérez

Abstract: Domain adaptation is an important task to enable learning when labels are scarce. While most works focus only on the image modality, there are many important multi-modal datasets. In order to leverage multi-modality for domain adaptation, we propose cross-modal learning, where we enforce consistency between the predictions of two modalities via mutual mimicking. We constrain our network to make co… ▽ More Domain adaptation is an important task to enable learning when labels are scarce. While most works focus only on the image modality, there are many important multi-modal datasets. In order to leverage multi-modality for domain adaptation, we propose cross-modal learning, where we enforce consistency between the predictions of two modalities via mutual mimicking. We constrain our network to make correct predictions on labeled data and consistent predictions across modalities on unlabeled target-domain data. Experiments in unsupervised and semi-supervised domain adaptation settings prove the effectiveness of this novel domain adaptation strategy. Specifically, we evaluate on the task of 3D semantic segmentation from either the 2D image, the 3D point cloud or from both. We leverage recent driving datasets to produce a wide variety of domain adaptation scenarios including changes in scene layout, lighting, sensor setup and weather, as well as the synthetic-to-real setup. Our method significantly improves over previous uni-modal adaptation baselines on all adaption scenarios. Our code is publicly available at https://github.com/valeoai/xmuda_journal △ Less

Submitted 22 June, 2022; v1 submitted 18 January, 2021; originally announced January 2021.

Comments: TPAMI 2022

arXiv:2009.03683 [pdf, other]

doi 10.1007/s11263-020-01366-3

Rain rendering for evaluating and improving robustness to bad weather

Authors: Maxime Tremblay, Shirsendu Sukanta Halder, Raoul de Charette, Jean-François Lalonde

Abstract: Rain fills the atmosphere with water particles, which breaks the common assumption that light travels unaltered from the scene to the camera. While it is well-known that rain affects computer vision algorithms, quantifying its impact is difficult. In this context, we present a rain rendering pipeline that enables the systematic evaluation of common computer vision algorithms to controlled amounts… ▽ More Rain fills the atmosphere with water particles, which breaks the common assumption that light travels unaltered from the scene to the camera. While it is well-known that rain affects computer vision algorithms, quantifying its impact is difficult. In this context, we present a rain rendering pipeline that enables the systematic evaluation of common computer vision algorithms to controlled amounts of rain. We present three different ways to add synthetic rain to existing images datasets: completely physic-based; completely data-driven; and a combination of both. The physic-based rain augmentation combines a physical particle simulator and accurate rain photometric modeling. We validate our rendering methods with a user study, demonstrating our rain is judged as much as 73% more realistic than the state-of-theart. Using our generated rain-augmented KITTI, Cityscapes, and nuScenes datasets, we conduct a thorough evaluation of object detection, semantic segmentation, and depth estimation algorithms and show that their performance decreases in degraded weather, on the order of 15% for object detection, 60% for semantic segmentation, and 6-fold increase in depth estimation error. Finetuning on our augmented synthetic data results in improvements of 21% on object detection, 37% on semantic segmentation, and 8% on depth estimation. △ Less

Submitted 6 September, 2020; originally announced September 2020.

Comments: 19 pages, 19 figures, IJCV 2020 preprint. arXiv admin note: text overlap with arXiv:1908.10335

arXiv:2008.10559 [pdf, other]

LMSCNet: Lightweight Multiscale 3D Semantic Completion

Authors: Luis Roldão, Raoul de Charette, Anne Verroust-Blondet

Abstract: We introduce a new approach for multiscale 3Dsemantic scene completion from voxelized sparse 3D LiDAR scans. As opposed to the literature, we use a 2D UNet backbone with comprehensive multiscale skip connections to enhance feature flow, along with 3D segmentation heads. On the SemanticKITTI benchmark, our method performs on par on semantic completion and better on occupancy completion than all oth… ▽ More We introduce a new approach for multiscale 3Dsemantic scene completion from voxelized sparse 3D LiDAR scans. As opposed to the literature, we use a 2D UNet backbone with comprehensive multiscale skip connections to enhance feature flow, along with 3D segmentation heads. On the SemanticKITTI benchmark, our method performs on par on semantic completion and better on occupancy completion than all other published methods -- while being significantly lighter and faster. As such it provides a great performance/speed trade-off for mobile-robotics applications. The ablation studies demonstrate our method is robust to lower density inputs, and that it enables very high speed semantic completion at the coarsest level. Our code is available at https://github.com/cv-rits/LMSCNet. △ Less

Submitted 25 October, 2020; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: Accepted at 3DV 2020 (Oral). For a demo video, see http://tiny.cc/lmscnet. Code is available at https://github.com/cv-rits/LMSCNet

arXiv:2006.05011 [pdf, other]

RGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking

Authors: Etienne Dubeau, Mathieu Garon, Benoit Debaque, Raoul de Charette, Jean-François Lalonde

Abstract: Augmented reality devices require multiple sensors to perform various tasks such as localization and tracking. Currently, popular cameras are mostly frame-based (e.g. RGB and Depth) which impose a high data bandwidth and power usage. With the necessity for low power and more responsive augmented reality systems, using solely frame-based sensors imposes limits to the various algorithms that needs h… ▽ More Augmented reality devices require multiple sensors to perform various tasks such as localization and tracking. Currently, popular cameras are mostly frame-based (e.g. RGB and Depth) which impose a high data bandwidth and power usage. With the necessity for low power and more responsive augmented reality systems, using solely frame-based sensors imposes limits to the various algorithms that needs high frequency data from the environement. As such, event-based sensors have become increasingly popular due to their low power, bandwidth and latency, as well as their very high frequency data acquisition capabilities. In this paper, we propose, for the first time, to use an event-based camera to increase the speed of 3D object tracking in 6 degrees of freedom. This application requires handling very high object speed to convey compelling AR experiences. To this end, we propose a new system which combines a recent RGB-D sensor (Kinect Azure) with an event camera (DAVIS346). We develop a deep learning approach, which combines an existing RGB-D network along with a novel event-based network in a cascade fashion, and demonstrate that our approach significantly improves the robustness of a state-of-the-art frame-based 6-DOF object tracker using our RGB-D-E pipeline. △ Less

Submitted 5 August, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 9 pages, 9 figures

arXiv:2004.01071 [pdf, other]

Model-based occlusion disentanglement for image-to-image translation

Authors: Fabio Pizzati, Pietro Cerri, Raoul de Charette

Abstract: Image-to-image translation is affected by entanglement phenomena, which may occur in case of target data encompassing occlusions such as raindrops, dirt, etc. Our unsupervised model-based learning disentangles scene and occlusions, while benefiting from an adversarial pipeline to regress physical parameters of the occlusion model. The experiments demonstrate our method is able to handle varying ty… ▽ More Image-to-image translation is affected by entanglement phenomena, which may occur in case of target data encompassing occlusions such as raindrops, dirt, etc. Our unsupervised model-based learning disentangles scene and occlusions, while benefiting from an adversarial pipeline to regress physical parameters of the occlusion model. The experiments demonstrate our method is able to handle varying types of occlusions and generate highly realistic translations, qualitatively and quantitatively outperforming the state-of-the-art on multiple datasets. △ Less

Submitted 20 July, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: ECCV 2020

arXiv:1911.12676 [pdf, other]

xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Authors: Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Émilie Wirbel, Patrick Pérez

Abstract: Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of annotations in a new domain. There are many multi-modal datasets, but most UDA approaches are uni-modal. In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation. This is challenging as the two input sp… ▽ More Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of annotations in a new domain. There are many multi-modal datasets, but most UDA approaches are uni-modal. In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation. This is challenging as the two input spaces are heterogeneous and can be impacted differently by domain shift. In xMUDA, modalities learn from each other through mutual mimicking, disentangled from the segmentation objective, to prevent the stronger modality from adopting false predictions from the weaker one. We evaluate on new UDA scenarios including day-to-night, country-to-country and dataset-to-dataset, leveraging recent autonomous driving datasets. xMUDA brings large improvements over uni-modal UDA on all tested scenarios, and is complementary to state-of-the-art UDA techniques. Code is available at https://github.com/valeoai/xmuda. △ Less

Submitted 30 March, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

Comments: Accepted at CVPR 2020. For a demo video, see http://tiny.cc/xmuda

arXiv:1910.10563 [pdf, other]

Domain Bridge for Unpaired Image-to-Image Translation and Unsupervised Domain Adaptation

Authors: Fabio Pizzati, Raoul de Charette, Michela Zaccaria, Pietro Cerri

Abstract: Image-to-image translation architectures may have limited effectiveness in some circumstances. For example, while generating rainy scenarios, they may fail to model typical traits of rain as water drops, and this ultimately impacts the synthetic images realism. With our method, called domain bridge, web-crawled data are exploited to reduce the domain gap, leading to the inclusion of previously ign… ▽ More Image-to-image translation architectures may have limited effectiveness in some circumstances. For example, while generating rainy scenarios, they may fail to model typical traits of rain as water drops, and this ultimately impacts the synthetic images realism. With our method, called domain bridge, web-crawled data are exploited to reduce the domain gap, leading to the inclusion of previously ignored elements in the generated images. We make use of a network for clear to rain translation trained with the domain bridge to extend our work to Unsupervised Domain Adaptation (UDA). In that context, we introduce an online multimodal style-sampling strategy, where image translation multimodality is exploited at training time to improve performances. Finally, a novel approach for self-supervised learning is presented, and used to further align the domains. With our contributions, we simultaneously increase the realism of the generated images, while reaching on par performances with respect to the UDA state-of-the-art, with a simpler approach. △ Less

Submitted 14 March, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: WACV 20 camera ready

arXiv:1908.10335 [pdf, other]

Physics-Based Rendering for Improving Robustness to Rain

Authors: Shirsendu Sukanta Halder, Jean-François Lalonde, Raoul de Charette

Abstract: To improve the robustness to rain, we present a physically-based rain rendering pipeline for realistically inserting rain into clear weather images. Our rendering relies on a physical particle simulator, an estimation of the scene lighting and an accurate rain photometric modeling to augment images with arbitrary amount of realistic rain or fog. We validate our rendering with a user study, proving… ▽ More To improve the robustness to rain, we present a physically-based rain rendering pipeline for realistically inserting rain into clear weather images. Our rendering relies on a physical particle simulator, an estimation of the scene lighting and an accurate rain photometric modeling to augment images with arbitrary amount of realistic rain or fog. We validate our rendering with a user study, proving our rain is judged 40% more realistic that state-of-the-art. Using our generated weather augmented Kitti and Cityscapes dataset, we conduct a thorough evaluation of deep object detection and semantic segmentation algorithms and show that their performance decreases in degraded weather, on the order of 15% for object detection and 60% for semantic segmentation. Furthermore, we show refining existing networks with our augmented images improves the robustness of both object detection and semantic segmentation algorithms. We experiment on nuScenes and measure an improvement of 15% for object detection and 35% for semantic segmentation compared to original rainy performance. Augmented databases and code are available on the project page. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: ICCV 2019. Supplementary pdf / videos available on project page

arXiv:1908.01523 [pdf, other]

3D Reconstruction of Deformable Revolving Object under Heavy Hand Interaction

Authors: Raoul de Charette, Sotiris Manitsaris

Abstract: We reconstruct 3D deformable object through time, in the context of a live pottery making process where the crafter molds the object. Because the object suffers from heavy hand interaction, and is being deformed, classical techniques cannot be applied. We use particle energy optimization to estimate the object profile and benefit of the object radial symmetry to increase the robustness of the reco… ▽ More We reconstruct 3D deformable object through time, in the context of a live pottery making process where the crafter molds the object. Because the object suffers from heavy hand interaction, and is being deformed, classical techniques cannot be applied. We use particle energy optimization to estimate the object profile and benefit of the object radial symmetry to increase the robustness of the reconstruction to both occlusion and noise. Our method works with an unconstrained scalable setup with one or more depth sensors. We evaluate on our database (released upon publication) on a per-frame and temporal basis and shows it significantly outperforms state-of-the-art achieving 7.60mm average object reconstruction error. Further ablation studies demonstrate the effectiveness of our method. △ Less

Submitted 5 August, 2019; originally announced August 2019.

Comments: 7 pages, 10 figures. Submitted to journal

arXiv:1906.10515 [pdf, other]

3D Surface Reconstruction from Voxel-based Lidar Data

Authors: Luis Roldão, Raoul de Charette, Anne Verroust-Blondet

Abstract: To achieve fully autonomous navigation, vehicles need to compute an accurate model of their direct surrounding. In this paper, a 3D surface reconstruction algorithm from heterogeneous density 3D data is presented. The proposed method is based on a TSDF voxel-based representation, where an adaptive neighborhood kernel sourced on a Gaussian confidence evaluation is introduced. This enables to keep a… ▽ More To achieve fully autonomous navigation, vehicles need to compute an accurate model of their direct surrounding. In this paper, a 3D surface reconstruction algorithm from heterogeneous density 3D data is presented. The proposed method is based on a TSDF voxel-based representation, where an adaptive neighborhood kernel sourced on a Gaussian confidence evaluation is introduced. This enables to keep a good trade-off between the density of the reconstructed mesh and its accuracy. Experimental evaluations carried on both synthetic (CARLA) and real (KITTI) 3D data show a good performance compared to a state of the art method used for surface reconstruction. △ Less

Submitted 25 June, 2019; originally announced June 2019.

Comments: IEEE Intelligent Transportation Systems Conference (ITSC) 2019

arXiv:1808.00769 [pdf, other]

Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation

Authors: Maximilian Jaritz, Raoul de Charette, Emilie Wirbel, Xavier Perrotton, Fawzi Nashashibi

Abstract: Convolutional neural networks are designed for dense data, but vision data is often sparse (stereo depth, point clouds, pen stroke, etc.). We present a method to handle sparse depth data with optional dense RGB, and accomplish depth completion and semantic segmentation changing only the last layer. Our proposal efficiently learns sparse features without the need of an additional validity mask. We… ▽ More Convolutional neural networks are designed for dense data, but vision data is often sparse (stereo depth, point clouds, pen stroke, etc.). We present a method to handle sparse depth data with optional dense RGB, and accomplish depth completion and semantic segmentation changing only the last layer. Our proposal efficiently learns sparse features without the need of an additional validity mask. We show how to ensure network robustness to varying input sparsities. Our method even works with densities as low as 0.8% (8 layer lidar), and outperforms all published state-of-the-art on the Kitti depth completion benchmark. △ Less

Submitted 31 August, 2018; v1 submitted 2 August, 2018; originally announced August 2018.

Comments: 3DV 2018

arXiv:1807.08483 [pdf, other]

A Statistical Update of Grid Representations from Range Sensors

Authors: Luis Roldão, Raoul De Charette, Anne Verroust-Blondet

Abstract: In a wide range of robotic applications, being able to create a 3D model of the surrounding environment is a key feature for autonomous tasks. In this research report, we present a statistical model to perform 3D reconstructions of the environment from range sensors using an occupancy grid. To do so, we take into account all the available information obtained from the sensor, considering the dista… ▽ More In a wide range of robotic applications, being able to create a 3D model of the surrounding environment is a key feature for autonomous tasks. In this research report, we present a statistical model to perform 3D reconstructions of the environment from range sensors using an occupancy grid. To do so, we take into account all the available information obtained from the sensor, considering the distances traversed by the rays in each cell and seeking to reduce reconstruction errors caused by discretization. The approach has been validated qualitatively using the KITTI dataset. △ Less

Submitted 4 July, 2019; v1 submitted 23 July, 2018; originally announced July 2018.

Comments: Meatadata change. Typo on author's name

arXiv:1807.02371 [pdf, other]

End-to-End Race Driving with Deep Reinforcement Learning

Authors: Maximilian Jaritz, Raoul de Charette, Marin Toromanoff, Etienne Perot, Fawzi Nashashibi

Abstract: We present research using the latest reinforcement learning algorithm for end-to-end driving without any mediated perception (object recognition, scene understanding). The newly proposed reward and learning strategies lead together to faster convergence and more robust driving using only RGB image from a forward facing camera. An Asynchronous Actor Critic (A3C) framework is used to learn the car c… ▽ More We present research using the latest reinforcement learning algorithm for end-to-end driving without any mediated perception (object recognition, scene understanding). The newly proposed reward and learning strategies lead together to faster convergence and more robust driving using only RGB image from a forward facing camera. An Asynchronous Actor Critic (A3C) framework is used to learn the car control in a physically and graphically realistic rally game, with the agents evolving simultaneously on tracks with a variety of road structures (turns, hills), graphics (seasons, location) and physics (road adherence). A thorough evaluation is conducted and generalization is proven on unseen tracks and using legal speed limits. Open loop tests on real sequences of images show some domain adaption capability of our method. △ Less

Submitted 31 August, 2018; v1 submitted 6 July, 2018; originally announced July 2018.

Comments: ICRA 2018

Showing 1–29 of 29 results for author: De Charette, R