Search | arXiv e-print repository

Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Authors: Stamatios Georgoulis, Weining Ren, Alfredo Bochicchio, Daniel Eckert, Yuanyou Li, Abel Gawel

Abstract: Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demon… ▽ More Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demonstrated in smaller-scale indoor environments with simplified dynamic objects. This work presents an event-based method for class-agnostic motion segmentation that can successfully be deployed across complex large-scale outdoor environments too. To this end, we introduce a novel divide-and-conquer pipeline that combines: (a) ego-motion compensated events, computed via a scene understanding module that predicts monocular depth and camera pose as auxiliary tasks, and (b) optical flow from a dedicated optical flow module. These intermediate representations are then fed into a segmentation module that predicts motion segmentation masks. A novel transformer-based temporal attention module in the segmentation module builds correlations across adjacent 'frames' to get temporally consistent segmentation masks. Our method sets the new state-of-the-art on the classic EV-IMO benchmark (indoors), where we achieve improvements of 2.19 moving object IoU (2.22 mIoU) and 4.52 point IoU respectively, as well as on a newly-generated motion segmentation and tracking benchmark (outdoors) based on the DSEC event dataset, termed DSEC-MOTS, where we show improvement of 12.91 moving object IoU. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 3DV 2024, the first two authors contributed equally

arXiv:2208.11398 [pdf, other]

Event-based Image Deblurring with Dynamic Motion Awareness

Authors: Patricia Vitoria, Stamatios Georgoulis, Stepan Tulyakov, Alfredo Bochicchio, Julius Erbach, Yuanyou Li

Abstract: Non-uniform image deblurring is a challenging task due to the lack of temporal and textural information in the blurry image itself. Complementary information from auxiliary sensors such event sensors are being explored to address these limitations. The latter can record changes in a logarithmic intensity asynchronously, called events, with high temporal resolution and high dynamic range. Current e… ▽ More Non-uniform image deblurring is a challenging task due to the lack of temporal and textural information in the blurry image itself. Complementary information from auxiliary sensors such event sensors are being explored to address these limitations. The latter can record changes in a logarithmic intensity asynchronously, called events, with high temporal resolution and high dynamic range. Current event-based deblurring methods combine the blurry image with events to jointly estimate per-pixel motion and the deblur operator. In this paper, we argue that a divide-and-conquer approach is more suitable for this task. To this end, we propose to use modulated deformable convolutions, whose kernel offsets and modulation masks are dynamically estimated from events to encode the motion in the scene, while the deblur operator is learned from the combination of blurry image and corresponding events. Furthermore, we employ a coarse-to-fine multi-scale reconstruction approach to cope with the inherent sparsity of events in low contrast regions. Importantly, we introduce the first dataset containing pairs of real RGB blur images and related events during the exposure time. Our results show better overall robustness when using events, with improvements in PSNR by up to 1.57dB on synthetic data and 1.08 dB on real event data. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Journal ref: ECCVW 2022

arXiv:2204.01267 [pdf, other]

FoV-Net: Field-of-View Extrapolation Using Self-Attention and Uncertainty

Authors: Liqian Ma, Stamatios Georgoulis, Xu Jia, Luc Van Gool

Abstract: The ability to make educated predictions about their surroundings, and associate them with certain confidence, is important for intelligent systems, like autonomous vehicles and robots. It allows them to plan early and decide accordingly. Motivated by this observation, in this paper we utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view… ▽ More The ability to make educated predictions about their surroundings, and associate them with certain confidence, is important for intelligent systems, like autonomous vehicles and robots. It allows them to plan early and decide accordingly. Motivated by this observation, in this paper we utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view. To this end, we propose a temporally consistent field-of-view extrapolation framework, namely FoV-Net, that: (1) leverages 3D information to propagate the observed scene parts from past frames; (2) aggregates the propagated multi-frame information using an attention-based feature aggregation module and a gated self-attention module, simultaneously hallucinating any unobserved scene parts; and (3) assigns an interpretable uncertainty value at each pixel. Extensive experiments show that FoV-Net does not only extrapolate the temporally consistent wide field-of-view scene better than existing alternatives, but also provides the associated uncertainty which may benefit critical decision-making downstream applications. Project page is at http://charliememory.github.io/RAL21_FoV. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Accepted to IEEE Robotics and Automation Letters and ICRA2021. Project page http://charliememory.github.io/RAL21_FoV

arXiv:2203.17191 [pdf, other]

Time Lens++: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion

Authors: Stepan Tulyakov, Alfredo Bochicchio, Daniel Gehrig, Stamatios Georgoulis, Yuanyou Li, Davide Scaramuzza

Abstract: Recently, video frame interpolation using a combination of frame- and event-based cameras has surpassed traditional image-based methods both in terms of performance and memory efficiency. However, current methods still suffer from (i) brittle image-level fusion of complementary interpolation results, that fails in the presence of artifacts in the fused image, (ii) potentially temporally inconsiste… ▽ More Recently, video frame interpolation using a combination of frame- and event-based cameras has surpassed traditional image-based methods both in terms of performance and memory efficiency. However, current methods still suffer from (i) brittle image-level fusion of complementary interpolation results, that fails in the presence of artifacts in the fused image, (ii) potentially temporally inconsistent and inefficient motion estimation procedures, that run for every inserted frame and (iii) low contrast regions that do not trigger events, and thus cause events-only motion estimation to generate artifacts. Moreover, previous methods were only tested on datasets consisting of planar and faraway scenes, which do not capture the full complexity of the real world. In this work, we address the above problems by introducing multi-scale feature-level fusion and computing one-shot non-linear inter-frame motion from events and images, which can be efficiently sampled for image war**. We also collect the first large-scale events and frames dataset consisting of more than 100 challenging scenes with depth variations, captured with a new experimental setup based on a beamsplitter. We show that our method improves the reconstruction quality by up to 0.2 dB in terms of PSNR and up to 15% in LPIPS score. △ Less

Submitted 25 April, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 2022

arXiv:2203.06622 [pdf, other]

Multi-Bracket High Dynamic Range Imaging with Event Cameras

Authors: Nico Messikommer, Stamatios Georgoulis, Daniel Gehrig, Stepan Tulyakov, Julius Erbach, Alfredo Bochicchio, Yuanyou Li, Davide Scaramuzza

Abstract: Modern high dynamic range (HDR) imaging pipelines align and fuse multiple low dynamic range (LDR) images captured at different exposure times. While these methods work well in static scenes, dynamic scenes remain a challenge since the LDR images still suffer from saturation and noise. In such scenarios, event cameras would be a valid complement, thanks to their higher temporal resolution and dynam… ▽ More Modern high dynamic range (HDR) imaging pipelines align and fuse multiple low dynamic range (LDR) images captured at different exposure times. While these methods work well in static scenes, dynamic scenes remain a challenge since the LDR images still suffer from saturation and noise. In such scenarios, event cameras would be a valid complement, thanks to their higher temporal resolution and dynamic range. In this paper, we propose the first multi-bracket HDR pipeline combining a standard camera with an event camera. Our results show better overall robustness when using events, with improvements in PSNR by up to 5dB on synthetic data and up to 0.7dB on real-world data. We also introduce a new dataset containing bracketed LDR images with aligned events and HDR ground truth. △ Less

Submitted 28 April, 2022; v1 submitted 13 March, 2022; originally announced March 2022.

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), New Orleans, 2022

arXiv:2106.07286 [pdf, other]

TimeLens: Event-based Video Frame Interpolation

Authors: Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza

Abstract: State-of-the-art frame interpolation methods generate intermediate frames by inferring object motions in the image from consecutive key-frames. In the absence of additional information, first-order approximations, i.e. optical flow, must be used, but this choice restricts the types of motions that can be modeled, leading to errors in highly dynamic scenarios. Event cameras are novel sensors that a… ▽ More State-of-the-art frame interpolation methods generate intermediate frames by inferring object motions in the image from consecutive key-frames. In the absence of additional information, first-order approximations, i.e. optical flow, must be used, but this choice restricts the types of motions that can be modeled, leading to errors in highly dynamic scenarios. Event cameras are novel sensors that address this limitation by providing auxiliary visual information in the blind-time between frames. They asynchronously measure per-pixel brightness changes and do this with high temporal resolution and low latency. Event-based frame interpolation methods typically adopt a synthesis-based approach, where predicted frame residuals are directly applied to the key-frames. However, while these approaches can capture non-linear motions they suffer from ghosting and perform poorly in low-texture regions with few events. Thus, synthesis-based and flow-based approaches are complementary. In this work, we introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both. We extensively evaluate our method on three synthetic and two real benchmarks where we show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods. Finally, we release a new large-scale dataset in highly dynamic scenarios, aimed at pushing the limits of existing methods. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

arXiv:2106.05967 [pdf, other]

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool

Abstract: Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) o… ▽ More Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. Second, given the generality of the approach, we try to realize further gains with minor modifications. We show that learning additional invariances -- through the use of multi-scale crop**, stronger augmentations and nearest neighbors -- improves the representations. Finally, we observe that MoCo learns spatially structured representations when trained with a multi-crop strategy. The representations can be used for semantic segment retrieval and video instance segmentation without finetuning. Moreover, the results are on par with specialized models. We hope this work will serve as a useful study for other researchers. The code and models are available at https://github.com/wvangansbeke/Revisiting-Contrastive-SSL. △ Less

Submitted 14 December, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021. Code: https://github.com/wvangansbeke/Revisiting-Contrastive-SSL

arXiv:2105.07830 [pdf, other]

Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation

Authors: Suman Saha, Anton Obukhov, Danda Pani Paudel, Menelaos Kanakis, Yuhua Chen, Stamatios Georgoulis, Luc Van Gool

Abstract: We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a n… ▽ More We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. To capture the cross-task relationships, we propose a neural network architecture that contains task-specific and cross-task refinement heads. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain. We experimentally observe improvements in both tasks' performance because the complementary information present in these tasks is better captured. Specifically, we show that: (1) our approach improves performance on all tasks when they are complementary and mutually dependent; (2) the CTRL helps to improve both semantic segmentation and depth estimation tasks performance in the challenging UDA setting; (3) the proposed ISL training scheme further improves the semantic segmentation performance. The implementation is available at https://github.com/susaha/ctrl-uda. △ Less

Submitted 3 July, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

Comments: Accepted at CVPR 2021; updated results according to the released source code

arXiv:2104.13874 [pdf, other]

Exploring Relational Context for Multi-Task Dense Prediction

Authors: David Bruggemann, Menelaos Kanakis, Anton Obukhov, Stamatios Georgoulis, Luc Van Gool

Abstract: The timeline of computer vision research is marked with advances in learning and utilizing efficient contextual representations. Most of them, however, are targeted at improving model performance on a single downstream task. We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. Our goal is to find the most efficient w… ▽ More The timeline of computer vision research is marked with advances in learning and utilizing efficient contextual representations. Most of them, however, are targeted at improving model performance on a single downstream task. We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. Our goal is to find the most efficient way to refine each task prediction by capturing cross-task contexts dependent on tasks' relations. We explore various attention-based contexts, such as global and local, in the multi-task setting and analyze their behavior when applied to refine each task independently. Empirical findings confirm that different source-target task pairs benefit from different context types. To automate the selection process, we propose an Adaptive Task-Relational Context (ATRC) module, which samples the pool of all available contexts for each task pair using neural architecture search and outputs the optimal configuration for deployment. Our method achieves state-of-the-art performance on two important multi-task benchmarks, namely NYUD-v2 and PASCAL-Context. The proposed ATRC has a low computational toll and can be used as a drop-in refinement module for any supervised multi-task architecture. △ Less

Submitted 23 August, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: International Conference on Computer Vision (ICCV) 2021

arXiv:2103.04217 [pdf, other]

Spectral Tensor Train Parameterization of Deep Learning Layers

Authors: Anton Obukhov, Maxim Rakhuba, Alexander Liniger, Zhiwu Huang, Stamatios Georgoulis, Dengxin Dai, Luc Van Gool

Abstract: We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context. The low-rank property leads to parameter efficiency and permits taking computational shortcuts when computing map**s. Spectral properties are often subject to constraints in optimization problems, leading to better models and stability of optimization. We start by looking at the… ▽ More We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context. The low-rank property leads to parameter efficiency and permits taking computational shortcuts when computing map**s. Spectral properties are often subject to constraints in optimization problems, leading to better models and stability of optimization. We start by looking at the compact SVD parameterization of weight matrices and identifying redundancy sources in the parameterization. We further apply the Tensor Train (TT) decomposition to the compact SVD components, and propose a non-redundant differentiable parameterization of fixed TT-rank tensor manifolds, termed the Spectral Tensor Train Parameterization (STTP). We demonstrate the effects of neural network compression in the image classification setting and both compression and improved training stability in the generative adversarial training setting. △ Less

Submitted 13 July, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

Comments: Accepted at AISTATS 2021

arXiv:2102.06191 [pdf, other]

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool

Abstract: Being able to learn dense semantic representations of images without supervision is an important problem in computer vision. However, despite its significance, this problem remains rather unexplored, with a few exceptions that considered unsupervised semantic segmentation on small-scale datasets with a narrow visual domain. In this paper, we make a first attempt to tackle the problem on datasets t… ▽ More Being able to learn dense semantic representations of images without supervision is an important problem in computer vision. However, despite its significance, this problem remains rather unexplored, with a few exceptions that considered unsupervised semantic segmentation on small-scale datasets with a narrow visual domain. In this paper, we make a first attempt to tackle the problem on datasets that have been traditionally utilized for the supervised case. To achieve this, we introduce a two-step framework that adopts a predetermined mid-level prior in a contrastive optimization objective to learn pixel embeddings. This marks a large deviation from existing works that relied on proxy tasks or end-to-end clustering. Additionally, we argue about the importance of having a prior that contains information about objects, or their parts, and discuss several possibilities to obtain such a prior in an unsupervised manner. Experimental evaluation shows that our method comes with key advantages over existing works. First, the learned pixel embeddings can be directly clustered in semantic groups using K-Means on PASCAL. Under the fully unsupervised setting, there is no precedent in solving the semantic segmentation task on such a challenging benchmark. Second, our representations can improve over strong baselines when transferred to new datasets, e.g. COCO and DAVIS. The code is available. △ Less

Submitted 3 August, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: ICCV 2021 - Code: https://github.com/wvangansbeke/Unsupervised-Semantic-Segmentation

arXiv:2008.10292 [pdf, other]

Automated Search for Resource-Efficient Branched Multi-Task Networks

Authors: David Bruggemann, Menelaos Kanakis, Stamatios Georgoulis, Luc Van Gool

Abstract: The multi-modal nature of many vision problems calls for neural network architectures that can perform multiple tasks concurrently. Typically, such architectures have been handcrafted in the literature. However, given the size and complexity of the problem, this manual architecture exploration likely exceeds human design abilities. In this paper, we propose a principled approach, rooted in differe… ▽ More The multi-modal nature of many vision problems calls for neural network architectures that can perform multiple tasks concurrently. Typically, such architectures have been handcrafted in the literature. However, given the size and complexity of the problem, this manual architecture exploration likely exceeds human design abilities. In this paper, we propose a principled approach, rooted in differentiable neural architecture search, to automatically define branching (tree-like) structures in the encoding stage of a multi-task neural network. To allow flexibility within resource-constrained environments, we introduce a proxyless, resource-aware loss that dynamically controls the model size. Evaluations across a variety of dense prediction tasks show that our approach consistently finds high-performing branching structures within limited resource budgets. △ Less

Submitted 11 May, 2021; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: British Machine Vision Conference (BMVC) 2020

arXiv:2007.12540 [pdf, other]

Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference

Authors: Menelaos Kanakis, David Bruggemann, Suman Saha, Stamatios Georgoulis, Anton Obukhov, Luc Van Gool

Abstract: Multi-task networks are commonly utilized to alleviate the need for a large number of highly specialized single-task networks. However, two common challenges in develo** multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental lear… ▽ More Multi-task networks are commonly utilized to alleviate the need for a large number of highly specialized single-task networks. However, two common challenges in develo** multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning). Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference). In this paper, we show that both can be achieved simply by reparameterizing the convolutions of standard neural network architectures into a non-trainable shared part (filter bank) and task-specific parts (modulators), where each modulator has a fraction of the filter bank parameters. Thus, our reparameterization enables the model to learn new tasks without adversely affecting the performance of existing ones. The results of our ablation study attest the efficacy of the proposed reparameterization. Moreover, our method achieves state-of-the-art on two challenging multi-task learning benchmarks, PASCAL-Context and NYUD, and also demonstrates superior incremental learning capability as compared to its close competitors. △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: European Conference on Computer Vision (ECCV), 2020

arXiv:2007.06631 [pdf, other]

T-Basis: a Compact Representation for Neural Networks

Authors: Anton Obukhov, Maxim Rakhuba, Stamatios Georgoulis, Menelaos Kanakis, Dengxin Dai, Luc Van Gool

Abstract: We introduce T-Basis, a novel concept for a compact representation of a set of tensors, each of an arbitrary shape, which is often seen in Neural Networks. Each of the tensors in the set is modeled using Tensor Rings, though the concept applies to other Tensor Networks. Owing its name to the T-shape of nodes in diagram notation of Tensor Rings, T-Basis is simply a list of equally shaped three-dime… ▽ More We introduce T-Basis, a novel concept for a compact representation of a set of tensors, each of an arbitrary shape, which is often seen in Neural Networks. Each of the tensors in the set is modeled using Tensor Rings, though the concept applies to other Tensor Networks. Owing its name to the T-shape of nodes in diagram notation of Tensor Rings, T-Basis is simply a list of equally shaped three-dimensional tensors, used to represent Tensor Ring nodes. Such representation allows us to parameterize the tensor set with a small number of parameters (coefficients of the T-Basis tensors), scaling logarithmically with each tensor's size in the set and linearly with the dimensionality of T-Basis. We evaluate the proposed approach on the task of neural network compression and demonstrate that it reaches high compression rates at acceptable performance drops. Finally, we analyze memory and operation requirements of the compressed networks and conclude that T-Basis networks are equally well suited for training and inference in resource-constrained environments and usage on the edge devices. △ Less

Submitted 13 July, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: Accepted at ICML 2020

arXiv:2005.12320 [pdf, other]

SCAN: Learning to Classify Images without Labels

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool

Abstract: Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature l… ▽ More Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task from representation learning is employed to obtain semantically meaningful features. Second, we use the obtained features as a prior in a learnable clustering approach. In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches. Experimental evaluation shows that we outperform state-of-the-art methods by large margins, in particular +26.6% on CIFAR10, +25.0% on CIFAR100-20 and +21.3% on STL10 in terms of classification accuracy. Furthermore, our method is the first to perform well on a large-scale dataset for image classification. In particular, we obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations. The code is made publicly available at https://github.com/wvangansbeke/Unsupervised-Classification. △ Less

Submitted 3 July, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

Comments: Accepted at ECCV 2020. Includes supplementary. Code and pretrained models at https://github.com/wvangansbeke/Unsupervised-Classification

arXiv:2004.13379 [pdf, other]

doi 10.1109/TPAMI.2021.3054719

Multi-Task Learning for Dense Prediction Tasks: A Survey

Authors: Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, Luc Van Gool

Abstract: With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computation… ▽ More With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation. In this survey, we provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense prediction tasks. Our contributions concern the following. First, we consider MTL from a network architecture point-of-view. We include an extensive overview and discuss the advantages/disadvantages of recent popular MTL models. Second, we examine various optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore their commonalities and differences. Finally, we provide an extensive experimental evaluation across a variety of dense prediction benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies. △ Less

Submitted 24 January, 2021; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Accepted to T-PAMI. Code + Suppl. Mat. can be found here: https://github.com/SimonVandenhende/Multi-Task-Learning-PyTorch IEEE Copyright Notice

arXiv:2001.06902 [pdf, other]

MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning

Authors: Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool

Abstract: In this paper, we argue about the importance of considering task interactions at multiple scales when distilling task information in a multi-task learning setup. In contrast to common belief, we show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales, and vice versa. We propose a novel architecture, namely MTI-Net, that builds upon this fin… ▽ More In this paper, we argue about the importance of considering task interactions at multiple scales when distilling task information in a multi-task learning setup. In contrast to common belief, we show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales, and vice versa. We propose a novel architecture, namely MTI-Net, that builds upon this finding in three ways. First, it explicitly models task interactions at every scale via a multi-scale multi-modal distillation unit. Second, it propagates distilled task information from lower to higher scales via a feature propagation module. Third, it aggregates the refined task features from all scales via a feature aggregation unit to produce the final per-task predictions. Extensive experiments on two multi-task dense labeling datasets show that, unlike prior work, our multi-task model delivers on the full potential of multi-task learning, that is, smaller memory footprint, reduced number of calculations, and better performance w.r.t. single-task learning. The code is made publicly available: https://github.com/SimonVandenhende/Multi-Task-Learning-PyTorch. △ Less

Submitted 8 July, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

Comments: Accepted at ECCV2020 (spotlight) - Code: https://github.com/SimonVandenhende/Multi-Task-Learning-PyTorch

arXiv:1912.07124 [pdf, other]

Domain Agnostic Feature Learning for Image and Video Based Face Anti-spoofing

Authors: Suman Saha, Wenhao Xu, Menelaos Kanakis, Stamatios Georgoulis, Yuhua Chen, Danda Pani Paudel, Luc Van Gool

Abstract: Nowadays, the increasingly growing number of mobile and computing devices has led to a demand for safer user authentication systems. Face anti-spoofing is a measure towards this direction for bio-metric user authentication, and in particular face recognition, that tries to prevent spoof attacks. The state-of-the-art anti-spoofing techniques leverage the ability of deep neural networks to learn dis… ▽ More Nowadays, the increasingly growing number of mobile and computing devices has led to a demand for safer user authentication systems. Face anti-spoofing is a measure towards this direction for bio-metric user authentication, and in particular face recognition, that tries to prevent spoof attacks. The state-of-the-art anti-spoofing techniques leverage the ability of deep neural networks to learn discriminative features, based on cues from the training set images or video samples, in an effort to detect spoof attacks. However, due to the particular nature of the problem, i.e. large variability due to factors like different backgrounds, lighting conditions, camera resolutions, spoof materials, etc., these techniques typically fail to generalize to new samples. In this paper, we explicitly tackle this problem and propose a class-conditional domain discriminator module, that, coupled with a gradient reversal layer, tries to generate live and spoof features that are discriminative, but at the same time robust against the aforementioned variability factors. Extensive experimental analysis shows the effectiveness of the proposed method over existing image- and video-based anti-spoofing techniques, both in terms of numerical improvement as well as when visualizing the learned features. △ Less

Submitted 13 April, 2020; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: CVPR 2020 Biometrics Workshop

arXiv:1906.04651

Gated CRF Loss for Weakly Supervised Semantic Image Segmentation

Authors: Anton Obukhov, Stamatios Georgoulis, Dengxin Dai, Luc Van Gool

Abstract: State-of-the-art approaches for semantic segmentation rely on deep convolutional neural networks trained on fully annotated datasets, that have been shown to be notoriously expensive to collect, both in terms of time and money. To remedy this situation, weakly supervised methods leverage other forms of supervision that require substantially less annotation effort, but they typically present an ina… ▽ More State-of-the-art approaches for semantic segmentation rely on deep convolutional neural networks trained on fully annotated datasets, that have been shown to be notoriously expensive to collect, both in terms of time and money. To remedy this situation, weakly supervised methods leverage other forms of supervision that require substantially less annotation effort, but they typically present an inability to predict precise object boundaries due to approximate nature of the supervisory signals in those regions. While great progress has been made in improving the performance, many of these weakly supervised methods are highly tailored to their own specific settings. This raises challenges in reusing algorithms and making steady progress. In this paper, we intentionally avoid such practices when tackling weakly supervised semantic segmentation. In particular, we train standard neural networks with partial cross-entropy loss function for the labeled pixels and our proposed Gated CRF loss for the unlabeled pixels. The Gated CRF loss is designed to deliver several important assets: 1) it enables flexibility in the kernel construction to mask out influence from undesired pixel positions; 2) it offloads learning contextual relations to CNN and concentrates on semantic boundaries; 3) it does not rely on high-dimensional filtering and thus has a simple implementation. Throughout the paper we present the advantages of the loss function, analyze several aspects of weakly supervised training, and show that our `purist' approach achieves state-of-the-art performance for both click-based and scribble-based annotations. △ Less

Submitted 30 October, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

Comments: A portion of the reported numbers are incorrect, along with a few statements

arXiv:1904.02920 [pdf, other]

Branched Multi-Task Networks: Deciding What Layers To Share

Authors: Simon Vandenhende, Stamatios Georgoulis, Bert De Brabandere, Luc Van Gool

Abstract: In the context of multi-task learning, neural networks with branched architectures have often been employed to jointly tackle the tasks at hand. Such ramified networks typically start with a number of shared layers, after which different tasks branch out into their own sequence of layers. Understandably, as the number of possible network configurations is combinatorially large, deciding what layer… ▽ More In the context of multi-task learning, neural networks with branched architectures have often been employed to jointly tackle the tasks at hand. Such ramified networks typically start with a number of shared layers, after which different tasks branch out into their own sequence of layers. Understandably, as the number of possible network configurations is combinatorially large, deciding what layers to share and where to branch out becomes cumbersome. Prior works have either relied on ad hoc methods to determine the level of layer sharing, which is suboptimal, or utilized neural architecture search techniques to establish the network design, which is considerably expensive. In this paper, we go beyond these limitations and propose an approach to automatically construct branched multi-task networks, by leveraging the employed tasks' affinities. Given a specific budget, i.e. number of learnable parameters, the proposed approach generates architectures, in which shallow layers are task-agnostic, whereas deeper ones gradually grow more task-specific. Extensive experimental analysis across numerous, diverse multi-tasking datasets shows that, for a given budget, our method consistently yields networks with the highest performance, while for a certain performance threshold it requires the least amount of learnable parameters. △ Less

Submitted 13 August, 2020; v1 submitted 5 April, 2019; originally announced April 2019.

Comments: Accepted at BMVC 2020

arXiv:1805.11145 [pdf, other]

Exemplar Guided Unsupervised Image-to-Image Translation with Semantic Consistency

Authors: Liqian Ma, Xu Jia, Stamatios Georgoulis, Tinne Tuytelaars, Luc Van Gool

Abstract: Image-to-image translation has recently received significant attention due to advances in deep learning. Most works focus on learning either a one-to-one map** in an unsupervised way or a many-to-many map** in a supervised way. However, a more practical setting is many-to-many map** in an unsupervised way, which is harder due to the lack of supervision and the complex inner- and cross-domain… ▽ More Image-to-image translation has recently received significant attention due to advances in deep learning. Most works focus on learning either a one-to-one map** in an unsupervised way or a many-to-many map** in a supervised way. However, a more practical setting is many-to-many map** in an unsupervised way, which is harder due to the lack of supervision and the complex inner- and cross-domain variations. To alleviate these issues, we propose the Exemplar Guided & Semantically Consistent Image-to-image Translation (EGSC-IT) network which conditions the translation process on an exemplar image in the target domain. We assume that an image comprises of a content component which is shared across domains, and a style component specific to each domain. Under the guidance of an exemplar from the target domain we apply Adaptive Instance Normalization to the shared content component, which allows us to transfer the style information of the target domain to the source domain. To avoid semantic inconsistencies during translation that naturally appear due to the large inner- and cross-domain variations, we introduce the concept of feature masks that provide coarse semantic guidance without requiring the use of any semantic labels. Experimental results on various datasets show that EGSC-IT does not only translate the source image to diverse instances in the target domain, but also preserves the semantic consistency during the process. △ Less

Submitted 13 March, 2019; v1 submitted 28 May, 2018; originally announced May 2018.

Comments: To appear in ICLR 2019

arXiv:1802.05591 [pdf, other]

Towards End-to-End Lane Detection: an Instance Segmentation Approach

Authors: Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool

Abstract: Modern cars are incorporating an increasing number of driver assist features, among which automatic lane kee**. The latter allows the car to properly position itself within the road lanes, which is also crucial for any subsequent lane departure or trajectory planning decision in fully autonomous cars. Traditional lane detection methods rely on a combination of highly-specialized, hand-crafted fe… ▽ More Modern cars are incorporating an increasing number of driver assist features, among which automatic lane kee**. The latter allows the car to properly position itself within the road lanes, which is also crucial for any subsequent lane departure or trajectory planning decision in fully autonomous cars. Traditional lane detection methods rely on a combination of highly-specialized, hand-crafted features and heuristics, usually followed by post-processing techniques, that are computationally expensive and prone to scalability due to road scene variations. More recent approaches leverage deep learning models, trained for pixel-wise lane segmentation, even when no markings are present in the image due to their big receptive field. Despite their advantages, these methods are limited to detecting a pre-defined, fixed number of lanes, e.g. ego-lanes, and can not cope with lane changes. In this paper, we go beyond the aforementioned limitations and propose to cast the lane detection problem as an instance segmentation problem - in which each lane forms its own instance - that can be trained end-to-end. To parametrize the segmented lane instances before fitting the lane, we further propose to apply a learned perspective transformation, conditioned on the image, in contrast to a fixed "bird's-eye view" transformation. By doing so, we ensure a lane fitting which is robust against road plane changes, unlike existing approaches that rely on a fixed, pre-defined transformation. In summary, we propose a fast lane detection algorithm, running at 50 fps, which can handle a variable number of lanes and cope with lane changes. We verify our method on the tuSimple dataset and achieve competitive results. △ Less

Submitted 15 February, 2018; originally announced February 2018.

arXiv:1712.03812 [pdf, other]

Error Correction for Dense Semantic Image Labeling

Authors: Yu-Hui Huang, Xu Jia, Stamatios Georgoulis, Tinne Tuytelaars, Luc Van Gool

Abstract: Pixelwise semantic image labeling is an important, yet challenging, task with many applications. Typical approaches to tackle this problem involve either the training of deep networks on vast amounts of images to directly infer the labels or the use of probabilistic graphical models to jointly model the dependencies of the input (i.e. images) and output (i.e. labels). Yet, the former approaches do… ▽ More Pixelwise semantic image labeling is an important, yet challenging, task with many applications. Typical approaches to tackle this problem involve either the training of deep networks on vast amounts of images to directly infer the labels or the use of probabilistic graphical models to jointly model the dependencies of the input (i.e. images) and output (i.e. labels). Yet, the former approaches do not capture the structure of the output labels, which is crucial for the performance of dense labeling, and the latter rely on carefully hand-designed priors that require costly parameter tuning via optimization techniques, which in turn leads to long inference times. To alleviate these restrictions, we explore how to arrive at dense semantic pixel labels given both the input image and an initial estimate of the output labels. We propose a parallel architecture that: 1) exploits the context information through a LabelPropagation network to propagate correct labels from nearby pixels to improve the object boundaries, 2) uses a LabelReplacement network to directly replace possibly erroneous, initial labels with new ones, and 3) combines the different intermediate results via a Fusion network to obtain the final per-pixel label. We experimentally validate our approach on two different datasets for the semantic segmentation and face parsing tasks respectively, where we show improvements over the state-of-the-art. We also provide both a quantitative and qualitative analysis of the generated results. △ Less

Submitted 11 December, 2017; originally announced December 2017.

arXiv:1712.02621 [pdf, other]

Disentangled Person Image Generation

Authors: Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, Mario Fritz

Abstract: Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person… ▽ More Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding map** functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task. △ Less

Submitted 15 June, 2018; v1 submitted 7 December, 2017; originally announced December 2017.

Comments: Published at CVPR'18 (Spotlight). Corresponding author is Qianru Sun

arXiv:1708.02550 [pdf, other]

Fast Scene Understanding for Autonomous Driving

Authors: Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool

Abstract: Most approaches for instance-aware semantic labeling traditionally focus on accuracy. Other aspects like runtime and memory footprint are arguably as important for real-time applications such as autonomous driving. Motivated by this observation and inspired by recent works that tackle multiple tasks with a single integrated architecture, in this paper we present a real-time efficient implementatio… ▽ More Most approaches for instance-aware semantic labeling traditionally focus on accuracy. Other aspects like runtime and memory footprint are arguably as important for real-time applications such as autonomous driving. Motivated by this observation and inspired by recent works that tackle multiple tasks with a single integrated architecture, in this paper we present a real-time efficient implementation based on ENet that solves three autonomous driving related tasks at once: semantic scene segmentation, instance segmentation and monocular depth estimation. Our approach builds upon a branched ENet architecture with a shared encoder but different decoder branches for each of the three tasks. The presented method can run at 21 fps at a resolution of 1024x512 on the Cityscapes dataset without sacrificing accuracy compared to running each task separately. △ Less

Submitted 8 August, 2017; originally announced August 2017.

Comments: Published at "Deep Learning for Vehicle Perception", workshop at the IEEE Symposium on Intelligent Vehicles 2017

arXiv:1611.09325 [pdf, other]

What Is Around The Camera?

Authors: Stamatios Georgoulis, Konstantinos Rematas, Tobias Ritschel, Mario Fritz, Tinne Tuytelaars, Luc Van Gool

Abstract: How much does a single image reveal about the environment it was taken in? In this paper, we investigate how much of that information can be retrieved from a foreground object, combined with the background (i.e. the visible part of the environment). Assuming it is not perfectly diffuse, the foreground object acts as a complexly shaped and far-from-perfect mirror. An additional challenge is that it… ▽ More How much does a single image reveal about the environment it was taken in? In this paper, we investigate how much of that information can be retrieved from a foreground object, combined with the background (i.e. the visible part of the environment). Assuming it is not perfectly diffuse, the foreground object acts as a complexly shaped and far-from-perfect mirror. An additional challenge is that its appearance confounds the light coming from the environment with the unknown materials it is made of. We propose a learning-based approach to predict the environment from multiple reflectance maps that are computed from approximate surface normals. The proposed method allows us to jointly model the statistics of environments and material properties. We train our system from synthesized training data, but demonstrate its applicability to real-world data. Interestingly, our analysis shows that the information obtained from objects made out of multiple materials often is complementary and leads to better performance. △ Less

Submitted 31 July, 2017; v1 submitted 28 November, 2016; originally announced November 2016.

Comments: Accepted to ICCV. Project: http://homes.esat.kuleuven.be/~sgeorgou/multinatillum/

arXiv:1603.08240 [pdf, other]

DeLight-Net: Decomposing Reflectance Maps into Specular Materials and Natural Illumination

Authors: Stamatios Georgoulis, Konstantinos Rematas, Tobias Ritschel, Mario Fritz, Luc Van Gool, Tinne Tuytelaars

Abstract: In this paper we are extracting surface reflectance and natural environmental illumination from a reflectance map, i.e. from a single 2D image of a sphere of one material under one illumination. This is a notoriously difficult problem, yet key to various re-rendering applications. With the recent advances in estimating reflectance maps from 2D images their further decomposition has become increasi… ▽ More In this paper we are extracting surface reflectance and natural environmental illumination from a reflectance map, i.e. from a single 2D image of a sphere of one material under one illumination. This is a notoriously difficult problem, yet key to various re-rendering applications. With the recent advances in estimating reflectance maps from 2D images their further decomposition has become increasingly relevant. To this end, we propose a Convolutional Neural Network (CNN) architecture to reconstruct both material parameters (i.e. Phong) as well as illumination (i.e. high-resolution spherical illumination maps), that is solely trained on synthetic data. We demonstrate that decomposition of synthetic as well as real photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the first time, on Low Dynamic Range (LDR) as well. Results are compared to previous approaches quantitatively as well as qualitatively in terms of re-renderings where illumination, material, view or shape are changed. △ Less

Submitted 27 March, 2016; originally announced March 2016.

Comments: Stamatios Georgoulis and Konstantinos Rematas contributed equally to this work

Showing 1–27 of 27 results for author: Georgoulis, S