Search | arXiv e-print repository

AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

Authors: Yitao Xu, Tong Zhang, Sabine Süsstrunk

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global… ▽ More Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs. △ Less

Submitted 20 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: 26 pages, 11 figures

arXiv:2404.07504 [pdf, other]

Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Authors: Yanhao Wu, Tong Zhang, Wei Ke, Congpei Qiu, Sabine Susstrunk, Mathieu Salzmann

Abstract: In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we i… ▽ More In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy, where pairs of objects with comparable sizes are exchanged across different scenes, effectively disentangling the strong contextual dependencies. Subsequently, we introduce a context-aware feature learning strategy, which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques, further showing its better robustness to environmental changes. Moreover, we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.06406 [pdf, other]

Emergent Dynamics in Neural Cellular Automata

Authors: Yitao Xu, Ehsan Pajouheshgar, Sabine Süsstrunk

Abstract: Neural Cellular Automata (NCA) models are trainable variations of traditional Cellular Automata (CA). Emergent motion in the patterns created by NCA has been successfully applied to synthesize dynamic textures. However, the conditions required for an NCA to display dynamic patterns remain unexplored. Here, we investigate the relationship between the NCA architecture and the emergent dynamics of th… ▽ More Neural Cellular Automata (NCA) models are trainable variations of traditional Cellular Automata (CA). Emergent motion in the patterns created by NCA has been successfully applied to synthesize dynamic textures. However, the conditions required for an NCA to display dynamic patterns remain unexplored. Here, we investigate the relationship between the NCA architecture and the emergent dynamics of the trained models. Specifically, we vary the number of channels in the cell state and the number of hidden neurons in the MultiLayer Perceptron (MLP), and draw a relationship between the combination of these two variables and the motion strength between successive frames. Our analysis reveals that the disparity and proportionality between these two variables have a strong correlation with the emergent dynamics in the NCA output. We thus propose a design principle for creating dynamic NCA. △ Less

Submitted 20 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 2 pages

arXiv:2404.06279 [pdf, other]

NoiseNCA: Noisy Seed Improves Spatio-Temporal Continuity of Neural Cellular Automata

Authors: Ehsan Pajouheshgar, Yitao Xu, Sabine Süsstrunk

Abstract: Neural Cellular Automata (NCA) is a class of Cellular Automata where the update rule is parameterized by a neural network that can be trained using gradient descent. In this paper, we focus on NCA models used for texture synthesis, where the update rule is inspired by partial differential equations (PDEs) describing reaction-diffusion systems. To train the NCA model, the spatio-temporal domain is… ▽ More Neural Cellular Automata (NCA) is a class of Cellular Automata where the update rule is parameterized by a neural network that can be trained using gradient descent. In this paper, we focus on NCA models used for texture synthesis, where the update rule is inspired by partial differential equations (PDEs) describing reaction-diffusion systems. To train the NCA model, the spatio-temporal domain is discretized, and Euler integration is used to numerically simulate the PDE. However, whether a trained NCA truly learns the continuous dynamic described by the corresponding PDE or merely overfits the discretization used in training remains an open question. We study NCA models at the limit where space-time discretization approaches continuity. We find that existing NCA models tend to overfit the training discretization, especially in the proximity of the initial condition, also called "seed". To address this, we propose a solution that utilizes uniform noise as the initial condition. We demonstrate the effectiveness of our approach in preserving the consistency of NCA dynamics across a wide range of spatio-temporal granularities. Our improved NCA model enables two new test-time interactions by allowing continuous control over the speed of pattern formation and the scale of the synthesized patterns. We demonstrate this new NCA feature in our interactive online demo. Our work reveals that NCA models can learn continuous dynamics and opens new venues for NCA research from a dynamical system's perspective. △ Less

Submitted 14 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 9 pages, 12 figures

Journal ref: Artificial Life (ALife) 2024

arXiv:2403.06546 [pdf, other]

OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation

Authors: Baran Ozaydin, Tong Zhang, Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Unsupervised Semantic Segmentation (USS) involves segmenting images without relying on predefined labels, aiming to alleviate the burden of extensive human labeling. Existing methods utilize features generated by self-supervised models and specific priors for clustering. However, their clustering objectives are not involved in the optimization of the features during training. Additionally, due to… ▽ More Unsupervised Semantic Segmentation (USS) involves segmenting images without relying on predefined labels, aiming to alleviate the burden of extensive human labeling. Existing methods utilize features generated by self-supervised models and specific priors for clustering. However, their clustering objectives are not involved in the optimization of the features during training. Additionally, due to the lack of clear class definitions in USS, the resulting segments may not align well with the clustering objective. In this paper, we introduce a novel approach called Optimally Matched Hierarchy (OMH) to simultaneously address the above issues. The core of our method lies in imposing structured sparsity on the feature space, which allows the features to encode information with different levels of granularity. The structure of this sparsity stems from our hierarchy (OMH). To achieve this, we learn a soft but sparse hierarchy among parallel clusters through Optimal Transport. Our OMH yields better unsupervised segmentation performance compared to existing USS methods. Our extensive experiments demonstrate the benefits of OMH when utilizing our differentiable paradigm. We will make our code publicly available. △ Less

Submitted 5 April, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 11 pages

arXiv:2312.03053 [pdf, other]

DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration

Authors: Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds. We propose formulating PCR as a denoising diffusion probabilistic process, map** noisy transformations to the ground truth. However, using diffusion models for PCR has nontrivial challenges, such as adapting a generative model to a discriminative task and leveraging the estimated nonlinear transf… ▽ More Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds. We propose formulating PCR as a denoising diffusion probabilistic process, map** noisy transformations to the ground truth. However, using diffusion models for PCR has nontrivial challenges, such as adapting a generative model to a discriminative task and leveraging the estimated nonlinear transformation from the previous step. Instead of training a diffusion model to directly map pure noise to ground truth, we map the predictions of an off-the-shelf PCR model to ground truth. The predictions of off-the-shelf models are often imperfect, especially in challenging cases where the two points clouds have low overlap, and thus could be seen as noisy versions of the real rigid transformation. In addition, we transform the rotation matrix into a spherical linear space for interpolation between samples in the forward process, and convert rigid transformations into auxiliary information to implicitly exploit last-step estimations in the reverse process. As a result, conditioned on time step, the denoising model adapts to the increasing accuracy across steps and refines registrations. Our extensive experiments showcase the effectiveness of our DiffusionPCR, yielding state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and 3DLoMatch. The code will be made public upon publication. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2311.10788 [pdf, other]

Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors

Authors: Peter Grönquist, Yufan Ren, Qingyi He, Alessio Verardo, Sabine Süsstrunk

Abstract: Video DeepFakes are fake media created with Deep Learning (DL) that manipulate a person's expression or identity. Most current DeepFake detection methods analyze each frame independently, ignoring inconsistencies and unnatural movements between frames. Some newer methods employ optical flow models to capture this temporal aspect, but they are computationally expensive. In contrast, we propose usin… ▽ More Video DeepFakes are fake media created with Deep Learning (DL) that manipulate a person's expression or identity. Most current DeepFake detection methods analyze each frame independently, ignoring inconsistencies and unnatural movements between frames. Some newer methods employ optical flow models to capture this temporal aspect, but they are computationally expensive. In contrast, we propose using the related but often ignored Motion Vectors (MVs) and Information Masks (IMs) from the H.264 video codec, to detect temporal inconsistencies in DeepFakes. Our experiments show that this approach is effective and has minimal computational costs, compared with per-frame RGB-only methods. This could lead to new, real-time temporally-aware DeepFake detection methods for video calls and streaming. △ Less

Submitted 22 February, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

ACM Class: I.5.4; I.4.8; I.2.10; I.4.2

arXiv:2311.02820 [pdf, other]

Mesh Neural Cellular Automata

Authors: Ehsan Pajouheshgar, Yitao Xu, Alexander Mordvintsev, Eyvind Niklasson, Tong Zhang, Sabine Süsstrunk

Abstract: Texture modeling and synthesis are essential for enhancing the realism of virtual environments. Methods that directly synthesize textures in 3D offer distinct advantages to the UV-map**-based methods as they can create seamless textures and align more closely with the ways textures form in nature. We propose Mesh Neural Cellular Automata (MeshNCA), a method that directly synthesizes dynamic text… ▽ More Texture modeling and synthesis are essential for enhancing the realism of virtual environments. Methods that directly synthesize textures in 3D offer distinct advantages to the UV-map**-based methods as they can create seamless textures and align more closely with the ways textures form in nature. We propose Mesh Neural Cellular Automata (MeshNCA), a method that directly synthesizes dynamic textures on 3D meshes without requiring any UV maps. MeshNCA is a generalized type of cellular automata that can operate on a set of cells arranged on non-grid structures such as the vertices of a 3D mesh. MeshNCA accommodates multi-modal supervision and can be trained using different targets such as images, text prompts, and motion vector fields. Only trained on an Icosphere mesh, MeshNCA shows remarkable test-time generalization and can synthesize textures on unseen meshes in real time. We conduct qualitative and quantitative comparisons to demonstrate that MeshNCA outperforms other 3D texture synthesis methods in terms of generalization and producing high-quality textures. Moreover, we introduce a way of grafting trained MeshNCA instances, enabling interpolation between textures. MeshNCA allows several user interactions including texture density/orientation controls, grafting/regenerate brushes, and motion speed/direction controls. Finally, we implement the forward pass of our MeshNCA model using the WebGL shading language and showcase our trained models in an online interactive demo, which is accessible on personal computers and smartphones and is available at https://meshnca.github.io. △ Less

Submitted 16 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

Comments: ACM Transactions on Graphics (TOG) - SIGGRAPH 2024

arXiv:2309.15842 [pdf, other]

Exploiting the Signal-Leak Bias in Diffusion Models

Authors: Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta

Abstract: There is a bias in the inference pipeline of most diffusion models. This bias arises from a signal leak whose distribution deviates from the noise distribution, creating a discrepancy between training and inference processes. We demonstrate that this signal-leak bias is particularly significant when models are tuned to a specific style, causing sub-optimal style matching. Recent research tries to… ▽ More There is a bias in the inference pipeline of most diffusion models. This bias arises from a signal leak whose distribution deviates from the noise distribution, creating a discrepancy between training and inference processes. We demonstrate that this signal-leak bias is particularly significant when models are tuned to a specific style, causing sub-optimal style matching. Recent research tries to avoid the signal leakage during training. We instead show how we can exploit this signal-leak bias in existing diffusion models to allow more control over the generated images. This enables us to generate images with more varied brightness, and images that better match a desired style or color. By modeling the distribution of the signal leak in the spatial frequency and pixel domains, and including a signal leak in the initial latent, we generate images that better match expected results without any additional training. △ Less

Submitted 24 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: corrected the author names in reference [24]

arXiv:2308.12372 [pdf, other]

Vision Transformer Adapters for Generalizable Multitask Learning

Authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann

Abstract: We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contr… ▽ More We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at \url{https://ivrl.github.io/VTAGML}. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023

arXiv:2307.08071 [pdf, other]

Dense Multitask Learning to Reconfigure Comics

Authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann

Abstract: In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comic panels. This is a significantly c… ▽ More In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comic panels. This is a significantly challenging problem because comics comprise disparate artistic styles, illustrations, layouts, and object scales that depend on the authors creative process. Typically, dense image-based prediction techniques require a large corpus of data. Finding an automated solution for dense prediction in the comics domain, therefore, becomes more difficult with the lack of ground-truth dense annotations for the comics images. To address these challenges, we develop the following solutions: 1) we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; 2) we utilize the results of the translations to develop our multitasking approach that is based on a vision transformer backbone and a domain transferable attention module; 3) we study the feasibility of integrating our MTL dense-prediction method with an existing retargeting method, thereby reconfiguring comics. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: CVPR 2023 Workshop. arXiv admin note: text overlap with arXiv:2205.08303

arXiv:2305.15094 [pdf, other]

InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields

Authors: Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk

Abstract: We propose InNeRF360, an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions, which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover, unbounded scenes are more p… ▽ More We propose InNeRF360, an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions, which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover, unbounded scenes are more prone to floater artifacts in the inpainted region than frontal-facing scenes, as the change of object appearance and background across views is more sensitive to inaccurate segmentations and inconsistent inpainting. With a trained NeRF and a text description, our method efficiently removes specified objects and inpaints visually consistent content without artifacts. We apply depth-space war** to enforce consistency across multiview text-encoded segmentations, and then refine the inpainted NeRF model using perceptual priors and 3D diffusion-based geometric priors to ensure visual plausibility. Through extensive experiments in segmentation and inpainting on 360-degree and frontal-facing NeRFs, we show that our approach is effective and enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360. △ Less

Submitted 26 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: CVPR 2024

arXiv:2303.16947 [pdf, other]

De-coupling and De-positioning Dense Self-supervised Learning

Authors: Congpei Qiu, Tong Zhang, Wei Ke, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects. Although the dense features extracted by employing segmentation maps and bounding boxes allow networks to perform SSL for each object, we show that they suffer from coupling and positional bias, which arise from the receptive field increasing… ▽ More Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects. Although the dense features extracted by employing segmentation maps and bounding boxes allow networks to perform SSL for each object, we show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding. We address this by introducing three data augmentation strategies, and leveraging them in (i) a decoupling module that aims to robustify the network to variations in the object's surroundings, and (ii) a de-positioning module that encourages the network to discard positional object information. We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection. Our extensive experiments evidence the better generalization of our method compared to the SOTA dense SSL methods △ Less

Submitted 29 March, 2023; originally announced March 2023.

arXiv:2303.16235 [pdf, other]

Spatiotemporal Self-supervised Learning for Point Clouds in the Wild

Authors: Yanhao Wu, Tong Zhang, Wei Ke, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Self-supervised learning (SSL) has the potential to benefit many applications, particularly those where manually annotating data is cumbersome. One such situation is the semantic segmentation of point clouds. In this context, existing methods employ contrastive learning strategies and define positive pairs by performing various augmentation of point clusters in a single frame. As such, these metho… ▽ More Self-supervised learning (SSL) has the potential to benefit many applications, particularly those where manually annotating data is cumbersome. One such situation is the semantic segmentation of point clouds. In this context, existing methods employ contrastive learning strategies and define positive pairs by performing various augmentation of point clusters in a single frame. As such, these methods do not exploit the temporal nature of LiDAR data. In this paper, we introduce an SSL strategy that leverages positive pairs in both the spatial and temporal domain. To this end, we design (i) a point-to-cluster learning strategy that aggregates spatial information to distinguish objects; and (ii) a cluster-to-cluster learning strategy based on unsupervised object tracking that exploits temporal correspondences. We demonstrate the benefits of our approach via extensive experiments performed by self-supervised training on two large-scale LiDAR datasets and transferring the resulting models to other point cloud segmentation benchmarks. Our results evidence that our method outperforms the state-of-the-art point cloud SSL methods. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: CVPR accepted

arXiv:2303.11963 [pdf, other]

NEMTO: Neural Environment Matting for Novel View and Relighting Synthesis of Transparent Objects

Authors: Dongqing Wang, Tong Zhang, Sabine Süsstrunk

Abstract: We propose NEMTO, the first end-to-end neural rendering pipeline to model 3D transparent objects with complex geometry and unknown indices of refraction. Commonly used appearance modeling such as the Disney BSDF model cannot accurately address this challenging problem due to the complex light paths bending through refractions and the strong dependency of surface appearance on illumination. With 2D… ▽ More We propose NEMTO, the first end-to-end neural rendering pipeline to model 3D transparent objects with complex geometry and unknown indices of refraction. Commonly used appearance modeling such as the Disney BSDF model cannot accurately address this challenging problem due to the complex light paths bending through refractions and the strong dependency of surface appearance on illumination. With 2D images of the transparent object as input, our method is capable of high-quality novel view and relighting synthesis. We leverage implicit Signed Distance Functions (SDF) to model the object geometry and propose a refraction-aware ray bending network to model the effects of light refraction within the object. Our ray bending network is more tolerant to geometric inaccuracies than traditional physically-based methods for rendering transparent objects. We provide extensive evaluations on both synthetic and real-world datasets to demonstrate our high-quality synthesis and the applicability of our method. △ Less

Submitted 4 April, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: ICCV 2023

arXiv:2301.02315 [pdf, other]

TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction

Authors: Bahar Aydemir, Ludo Hoffstetter, Tong Zhang, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information, such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps i… ▽ More Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information, such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark. Our code will be publicly available on GitHub. △ Less

Submitted 5 January, 2023; originally announced January 2023.

Comments: 10 pages, 7 figures

arXiv:2212.13253 [pdf, other]

DSI2I: Dense Style for Unpaired Image-to-Image Translation

Authors: Baran Ozaydin, Tong Zhang, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar, without ground-truth input-translation pairs. Existing UEI2I methods represent style using one vector per image or rely on semantic supervision to define one style vector per object. Here, in contrast, we propose to represent style as a den… ▽ More Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar, without ground-truth input-translation pairs. Existing UEI2I methods represent style using one vector per image or rely on semantic supervision to define one style vector per object. Here, in contrast, we propose to represent style as a dense feature map, allowing for a finer-grained transfer to the source image without requiring any external semantic information. We then rely on perceptual and adversarial losses to disentangle our dense style and content representations. To stylize the source content with the exemplar style, we extract unsupervised cross-domain semantic correspondences and warp the exemplar style to the source content. We demonstrate the effectiveness of our method on four datasets using standard metrics together with a localized style metric we propose, which measures style similarity in a class-wise manner. Our results show that the translations produced by our approach are more diverse, preserve the source content better, and are closer to the exemplars when compared to the state-of-the-art methods. Project page: https://github.com/IVRL/dsi2i △ Less

Submitted 1 May, 2024; v1 submitted 26 December, 2022; originally announced December 2022.

Comments: To appear on TMLR '24, Reviewed on OpenReview: https://openreview.net/forum?id=mrJi5kdKA4

arXiv:2212.08067 [pdf, other]

VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction

Authors: Yufan Ren, Fang**hua Wang, Tong Zhang, Marc Pollefeys, Sabine Süsstrunk

Abstract: The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction. However, most existing neural implicit reconstruction methods optimize per-scene parameters and therefore lack generalizability to new scenes. We introduce VolRecon, a novel generalizable implicit reconstruction method with Signed Ray Distance Function… ▽ More The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction. However, most existing neural implicit reconstruction methods optimize per-scene parameters and therefore lack generalizability to new scenes. We introduce VolRecon, a novel generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct the scene with fine details and little noise, VolRecon combines projection features aggregated from multi-view features, and volume features interpolated from a coarse global feature volume. Using a ray transformer, we compute SRDF values of sampled points on a ray and then render color and depth. On DTU dataset, VolRecon outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable accuracy as MVSNet in full view reconstruction. Furthermore, our approach exhibits good generalization performance on the large-scale ETH3D benchmark. △ Less

Submitted 3 April, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

arXiv:2211.11417 [pdf, other]

DyNCA: Real-time Dynamic Texture Synthesis Using Neural Cellular Automata

Authors: Ehsan Pajouheshgar, Yitao Xu, Tong Zhang, Sabine Süsstrunk

Abstract: Current Dynamic Texture Synthesis (DyTS) models can synthesize realistic videos. However, they require a slow iterative optimization process to synthesize a single fixed-size short video, and they do not offer any post-training control over the synthesis process. We propose Dynamic Neural Cellular Automata (DyNCA), a framework for real-time and controllable dynamic texture synthesis. Our method is… ▽ More Current Dynamic Texture Synthesis (DyTS) models can synthesize realistic videos. However, they require a slow iterative optimization process to synthesize a single fixed-size short video, and they do not offer any post-training control over the synthesis process. We propose Dynamic Neural Cellular Automata (DyNCA), a framework for real-time and controllable dynamic texture synthesis. Our method is built upon the recently introduced NCA models and can synthesize infinitely long and arbitrary-sized realistic video textures in real time. We quantitatively and qualitatively evaluate our model and show that our synthesized videos appear more realistic than the existing results. We improve the SOTA DyTS performance by $2\sim 4$ orders of magnitude. Moreover, our model offers several real-time video controls including motion speed, motion direction, and an editing brush tool. We exhibit our trained models in an online interactive demo that runs on local hardware and is accessible on personal computers and smartphones. △ Less

Submitted 30 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: Link to the demo: https://dynca.github.io/

arXiv:2210.04866 [pdf, other]

doi 10.1109/LSP.2022.3227522

PoGaIN: Poisson-Gaussian Image Noise Modeling from Paired Samples

Authors: Nicolas Bähler, Majed El Helou, Étienne Objois, Kaan Okumuş, Sabine Süsstrunk

Abstract: Image noise can often be accurately fitted to a Poisson-Gaussian distribution. However, estimating the distribution parameters from a noisy image only is a challenging task. Here, we study the case when paired noisy and noise-free samples are accessible. No method is currently available to exploit the noise-free information, which may help to achieve more accurate estimations. To fill this gap, we… ▽ More Image noise can often be accurately fitted to a Poisson-Gaussian distribution. However, estimating the distribution parameters from a noisy image only is a challenging task. Here, we study the case when paired noisy and noise-free samples are accessible. No method is currently available to exploit the noise-free information, which may help to achieve more accurate estimations. To fill this gap, we derive a novel, cumulant-based, approach for Poisson-Gaussian noise modeling from paired image samples. We show its improved performance over different baselines, with special emphasis on MSE, effect of outliers, image dependence, and bias. We additionally derive the log-likelihood function for further insights and discuss real-world applicability. △ Less

Submitted 19 December, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: 5 pages, 4 figures, and 3 tables. Code is available at https://github.com/IVRL/PoGaIN

arXiv:2208.12327 [pdf, other]

DSR: Towards Drone Image Super-Resolution

Authors: Xiaoyu Lin, Baran Ozaydin, Vidit Vidit, Majed El Helou, Sabine Süsstrunk

Abstract: Despite achieving remarkable progress in recent years, single-image super-resolution methods are developed with several limitations. Specifically, they are trained on fixed content domains with certain degradations (whether synthetic or real). The priors they learn are prone to overfitting the training configuration. Therefore, the generalization to novel domains such as drone top view data, and a… ▽ More Despite achieving remarkable progress in recent years, single-image super-resolution methods are developed with several limitations. Specifically, they are trained on fixed content domains with certain degradations (whether synthetic or real). The priors they learn are prone to overfitting the training configuration. Therefore, the generalization to novel domains such as drone top view data, and across altitudes, is currently unknown. Nonetheless, pairing drones with proper image super-resolution is of great value. It would enable drones to fly higher covering larger fields of view, while maintaining a high image quality. To answer these questions and pave the way towards drone image super-resolution, we explore this application with particular focus on the single-image case. We propose a novel drone image dataset, with scenes captured at low and high resolutions, and across a span of altitudes. Our results show that off-the-shelf state-of-the-art networks witness a significant drop in performance on this different domain. We additionally show that simple fine-tuning, and incorporating altitude awareness into the network's architecture, both improve the reconstruction performance. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Accepted at ECCVW 2022

arXiv:2206.02417 [pdf, other]

Fast Adversarial Training with Adaptive Step Size

Authors: Zhichao Huang, Yanbo Fan, Chen Liu, Weizhong Zhang, Yong Zhang, Mathieu Salzmann, Sabine Süsstrunk, Jue Wang

Abstract: While adversarial training and its variants have shown to be the most effective algorithms to defend against adversarial attacks, their extremely slow training process makes it hard to scale to large datasets like ImageNet. The key idea of recent works to accelerate adversarial training is to substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g., FGSM). However, these single-ste… ▽ More While adversarial training and its variants have shown to be the most effective algorithms to defend against adversarial attacks, their extremely slow training process makes it hard to scale to large datasets like ImageNet. The key idea of recent works to accelerate adversarial training is to substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g., FGSM). However, these single-step methods suffer from catastrophic overfitting, where the accuracy against PGD attack suddenly drops to nearly 0% during training, destroying the robustness of the networks. In this work, we study the phenomenon from the perspective of training instances. We show that catastrophic overfitting is instance-dependent and fitting instances with larger gradient norm is more likely to cause catastrophic overfitting. Based on our findings, we propose a simple but effective method, Adversarial Training with Adaptive Step size (ATAS). ATAS learns an instancewise adaptive step size that is inversely proportional to its gradient norm. The theoretical analysis shows that ATAS converges faster than the commonly adopted non-adaptive counterparts. Empirically, ATAS consistently mitigates catastrophic overfitting and achieves higher robust accuracy on CIFAR10, CIFAR100 and ImageNet when evaluated on various adversarial budgets. △ Less

Submitted 6 June, 2022; originally announced June 2022.

arXiv:2205.08303 [pdf, other]

MulT: An End-to-End Multitask Learning Transformer

Authors: Deblina Bhattacharjee, Tong Zhang, Sabine Süsstrunk, Mathieu Salzmann

Abstract: We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading, surface normal estimation, 2D keypoint detection, and edge detection. Based on the Swin transformer model, our framework encodes the input image into a shared representation and makes predictions for e… ▽ More We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading, surface normal estimation, 2D keypoint detection, and edge detection. Based on the Swin transformer model, our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads. At the heart of our approach is a shared attention mechanism modeling the dependencies across the tasks. We evaluate our model on several multitask benchmarks, showing that our MulT framework outperforms both the state-of-the art multitask convolutional neural network models and all the respective single task transformer models. Our experiments further highlight the benefits of sharing attention across all the tasks, and demonstrate that our MulT model is robust and generalizes well to new domains. Our project website is at https://ivrl.github.io/MulT/. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: Accepted to CVPR 2022

arXiv:2203.17205 [pdf, other]

Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy

Authors: Tong Zhang, Congpei Qiu, Wei Ke, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Self-supervised learning (SSL) methods aim to learn view-invariant representations by maximizing the similarity between the features extracted from different crops of the same image regardless of crop** size and content. In essence, this strategy ignores the fact that two crops may truly contain different image information, e.g., background and small objects, and thus tends to restrain the diver… ▽ More Self-supervised learning (SSL) methods aim to learn view-invariant representations by maximizing the similarity between the features extracted from different crops of the same image regardless of crop** size and content. In essence, this strategy ignores the fact that two crops may truly contain different image information, e.g., background and small objects, and thus tends to restrain the diversity of the learned representations. In this work, we address this issue by introducing a new self-supervised learning strategy, LoGo, that explicitly reasons about Local and Global crops. To achieve view invariance, LoGo encourages similarity between global crops from the same image, as well as between a global and a local crop. However, to correctly encode the fact that the content of smaller crops may differ entirely, LoGo promotes two local crops to have dissimilar representations, while being close to global crops. Our LoGo strategy can easily be applied to existing SSL methods. Our extensive experiments on a variety of datasets and using different self-supervised learning frameworks validate its superiority over existing approaches. Noticeably, we achieve better results than supervised models on transfer learning when using only 1/10 of the data. △ Less

Submitted 13 April, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: accepted in CVPR 2022

arXiv:2203.03949 [pdf, other]

RC-MVSNet: Unsupervised Multi-View Stereo with Neural Rendering

Authors: Di Chang, Aljaž Božič, Tong Zhang, Qingsong Yan, Yingcong Chen, Sabine Süsstrunk, Matthias Nießner

Abstract: Finding accurate correspondences among different views is the Achilles' heel of unsupervised Multi-View Stereo (MVS). Existing methods are built upon the assumption that corresponding pixels share similar photometric features. However, multi-view images in real scenarios observe non-Lambertian surfaces and experience occlusions. In this work, we propose a novel approach with neural rendering (RC-M… ▽ More Finding accurate correspondences among different views is the Achilles' heel of unsupervised Multi-View Stereo (MVS). Existing methods are built upon the assumption that corresponding pixels share similar photometric features. However, multi-view images in real scenarios observe non-Lambertian surfaces and experience occlusions. In this work, we propose a novel approach with neural rendering (RC-MVSNet) to solve such ambiguity issues of correspondences among views. Specifically, we impose a depth rendering consistency loss to constrain the geometry features close to the object surface to alleviate occlusions. Concurrently, we introduce a reference view synthesis loss to generate consistent supervision, even for non-Lambertian surfaces. Extensive experiments on DTU and Tanks\&Temples benchmarks demonstrate that our RC-MVSNet approach achieves state-of-the-art performance over unsupervised MVS frameworks and competitive performance to many supervised methods.The code is released at https://github.com/Boese0601/RC-MVSNet △ Less

Submitted 21 August, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted by ECCV 2022, Project Page: https://boese0601.github.io/rc-mvsnet/

arXiv:2202.01341 [pdf, other]

Robust Binary Models by Pruning Randomly-initialized Networks

Authors: Chen Liu, Ziqi Zhao, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Robustness to adversarial attacks was shown to require a larger model capacity, and thus a larger memory footprint. In this paper, we introduce an approach to obtain robust yet compact models by pruning randomly-initialized binary networks. Unlike adversarial training, which learns the model parameters, we initialize the model parameters as either +1 or -1, keep them fixed, and find a subnetwork s… ▽ More Robustness to adversarial attacks was shown to require a larger model capacity, and thus a larger memory footprint. In this paper, we introduce an approach to obtain robust yet compact models by pruning randomly-initialized binary networks. Unlike adversarial training, which learns the model parameters, we initialize the model parameters as either +1 or -1, keep them fixed, and find a subnetwork structure that is robust to attacks. Our method confirms the Strong Lottery Ticket Hypothesis in the presence of adversarial attacks, and extends this to binary networks. Furthermore, it yields more compact networks with competitive performance than existing works by 1) adaptively pruning different network layers; 2) exploiting an effective binary initialization scheme; 3) incorporating a last batch normalization layer to improve training stability. Our experiments demonstrate that our approach not only always outperforms the state-of-the-art robust binary networks, but also can achieve accuracy better than full-precision ones on some datasets. Finally, we show the structured patterns of our pruned binary networks. △ Less

Submitted 15 October, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: Accepted as NeurIPS 2022 paper

arXiv:2201.00429 [pdf, other]

Image Denoising with Control over Deep Network Hallucination

Authors: Qiyuan Liang, Florian Cassayre, Haley Owsianko, Majed El Helou, Sabine Süsstrunk

Abstract: Deep image denoisers achieve state-of-the-art results but with a hidden cost. As witnessed in recent literature, these deep networks are capable of overfitting their training distributions, causing inaccurate hallucinations to be added to the output and generalizing poorly to varying data. For better control and interpretability over a deep denoiser, we propose a novel framework exploiting a denoi… ▽ More Deep image denoisers achieve state-of-the-art results but with a hidden cost. As witnessed in recent literature, these deep networks are capable of overfitting their training distributions, causing inaccurate hallucinations to be added to the output and generalizing poorly to varying data. For better control and interpretability over a deep denoiser, we propose a novel framework exploiting a denoising network. We call it controllable confidence-based image denoising (CCID). In this framework, we exploit the outputs of a deep denoising network alongside an image convolved with a reliable filter. Such a filter can be a simple convolution kernel which does not risk adding hallucinated information. We propose to fuse the two components with a frequency-domain approach that takes into account the reliability of the deep network outputs. With our framework, the user can control the fusion of the two components in the frequency domain. We also provide a user-friendly map estimating spatially the confidence in the output that potentially contains network hallucination. Results show that our CCID not only provides more interpretability and control, but can even outperform both the quantitative performance of the deep denoiser and that of the reliable filter, especially when the test data diverge from the training data. △ Less

Submitted 2 January, 2022; originally announced January 2022.

Comments: Published in Electronic Imaging 2022, code available at https://github.com/IVRL/CCID

arXiv:2112.07324 [pdf, other]

On the Impact of Hard Adversarial Instances on Overfitting in Adversarial Training

Authors: Chen Liu, Zhichao Huang, Mathieu Salzmann, Tong Zhang, Sabine Süsstrunk

Abstract: Adversarial training is a popular method to robustify models against adversarial attacks. However, it exhibits much more severe overfitting than training on clean inputs. In this work, we investigate this phenomenon from the perspective of training instances, i.e., training input-target pairs. Based on a quantitative metric measuring instances' difficulty, we analyze the model's behavior on traini… ▽ More Adversarial training is a popular method to robustify models against adversarial attacks. However, it exhibits much more severe overfitting than training on clean inputs. In this work, we investigate this phenomenon from the perspective of training instances, i.e., training input-target pairs. Based on a quantitative metric measuring instances' difficulty, we analyze the model's behavior on training instances of different difficulty levels. This lets us show that the decay in generalization performance of adversarial training is a result of the model's attempt to fit hard adversarial instances. We theoretically verify our observations for both linear and general nonlinear models, proving that models trained on hard instances have worse generalization performance than ones trained on easy instances. Furthermore, we prove that the difference in the generalization gap between models trained by instances of different difficulty levels increases with the size of the adversarial budget. Finally, we conduct case studies on methods mitigating adversarial overfitting in several scenarios. Our analysis shows that methods successfully mitigating adversarial overfitting all avoid fitting hard adversarial instances, while ones fitting hard adversarial instances do not achieve true robustness. △ Less

Submitted 14 December, 2021; originally announced December 2021.

arXiv:2111.12583 [pdf, other]

Optimizing Latent Space Directions For GAN-based Local Image Editing

Authors: Ehsan Pajouheshgar, Tong Zhang, Sabine Süsstrunk

Abstract: Generative Adversarial Network (GAN) based localized image editing can suffer from ambiguity between semantic attributes. We thus present a novel objective function to evaluate the locality of an image edit. By introducing the supervision from a pre-trained segmentation network and optimizing the objective function, our framework, called Locally Effective Latent Space Direction (LELSD), is applica… ▽ More Generative Adversarial Network (GAN) based localized image editing can suffer from ambiguity between semantic attributes. We thus present a novel objective function to evaluate the locality of an image edit. By introducing the supervision from a pre-trained segmentation network and optimizing the objective function, our framework, called Locally Effective Latent Space Direction (LELSD), is applicable to any dataset and GAN architecture. Our method is also computationally fast and exhibits a high extent of disentanglement, which allows users to interactively perform a sequence of edits on an image. Our experiments on both GAN-generated and real images qualitatively demonstrate the high quality and advantages of our method. △ Less

Submitted 17 February, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: 4 pages, 5 figures, 1 table

arXiv:2110.03575 [pdf, other]

Estimating Image Depth in the Comics Domain

Authors: Deblina Bhattacharjee, Martin Everaert, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Estimating the depth of comics images is challenging as such images a) are monocular; b) lack ground-truth depth annotations; c) differ across different artistic styles; d) are sparse and noisy. We thus, use an off-the-shelf unsupervised image to image translation method to translate the comics images to natural ones and then use an attention-guided monocular depth estimator to predict their depth… ▽ More Estimating the depth of comics images is challenging as such images a) are monocular; b) lack ground-truth depth annotations; c) differ across different artistic styles; d) are sparse and noisy. We thus, use an off-the-shelf unsupervised image to image translation method to translate the comics images to natural ones and then use an attention-guided monocular depth estimator to predict their depth. This lets us leverage the depth annotations of existing natural images to train the depth estimator. Furthermore, our model learns to distinguish between text and images in the comics panels to reduce text-based artefacts in the depth estimates. Our method consistently outperforms the existing state-ofthe-art approaches across all metrics on both the DCM and eBDtheque images. Finally, we introduce a dataset to evaluate depth prediction on comics. Our project website can be accessed at https://github.com/IVRL/ComicsDepth. △ Less

Submitted 15 August, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Accepted to WACV 2022 : Winter Conference on Applications of Computer Vision

arXiv:2106.00673 [pdf, other]

doi 10.1109/LSP.2021.3104769

Fidelity Estimation Improves Noisy-Image Classification With Pretrained Networks

Authors: Xiaoyu Lin, Deblina Bhattacharjee, Majed El Helou, Sabine Süsstrunk

Abstract: Image classification has significantly improved using deep learning. This is mainly due to convolutional neural networks (CNNs) that are capable of learning rich feature extractors from large datasets. However, most deep learning classification methods are trained on clean images and are not robust when handling noisy ones, even if a restoration preprocessing step is applied. While novel methods a… ▽ More Image classification has significantly improved using deep learning. This is mainly due to convolutional neural networks (CNNs) that are capable of learning rich feature extractors from large datasets. However, most deep learning classification methods are trained on clean images and are not robust when handling noisy ones, even if a restoration preprocessing step is applied. While novel methods address this problem, they rely on modified feature extractors and thus necessitate retraining. We instead propose a method that can be applied on a $pretrained$ classifier. Our method exploits a fidelity map estimate that is fused into the internal representations of the feature extractor, thereby guiding the attention of the network and making it more robust to noisy data. We improve the noisy-image classification (NIC) results by significantly large margins, especially at high noise levels, and come close to the fully retrained approaches. Furthermore, as proof of concept, we show that when using our oracle fidelity map we even outperform the fully retrained methods, whether trained on noisy or restored images. △ Less

Submitted 4 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: Published in IEEE Signal Processing Letters

Journal ref: IEEE Signal Processing Letters 28 (2021) 1719 - 1723

arXiv:2104.13365 [pdf, other]

NTIRE 2021 Depth Guided Image Relighting Challenge

Authors: Majed El Helou, Ruofan Zhou, Sabine Susstrunk, Radu Timofte

Abstract: Image relighting is attracting increasing interest due to its various applications. From a research perspective, image relighting can be exploited to conduct both image normalization for domain adaptation, and also for data augmentation. It also has multiple direct uses for photo montage and aesthetic enhancement. In this paper, we review the NTIRE 2021 depth guided image relighting challenge. W… ▽ More Image relighting is attracting increasing interest due to its various applications. From a research perspective, image relighting can be exploited to conduct both image normalization for domain adaptation, and also for data augmentation. It also has multiple direct uses for photo montage and aesthetic enhancement. In this paper, we review the NTIRE 2021 depth guided image relighting challenge. We rely on the VIDIT dataset for each of our two challenge tracks, including depth information. The first track is on one-to-one relighting where the goal is to transform the illumination setup of an input image (color temperature and light source position) to the target illumination setup. In the second track, the any-to-any relighting challenge, the objective is to transform the illumination settings of the input image to match those of another guide image, similar to style transfer. In both tracks, participants were given depth information about the captured scenes. We had nearly 250 registered participants, leading to 18 confirmed team submissions in the final competition stage. The competitions, methods, and final results are presented in this paper. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: Code and data available on https://github.com/majedelhelou/VIDIT

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition Workshops 2021

arXiv:2104.03864 [pdf, other]

Modeling Object Dissimilarity for Deep Saliency Prediction

Authors: Bahar Aydemir, Deblina Bhattacharjee, Tong Zhang, Seungryong Kim, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Saliency prediction has made great strides over the past two decades, with current techniques modeling low-level information, such as color, intensity and size contrasts, and high-level ones, such as attention and gaze direction for entire objects. Despite this, these methods fail to account for the dissimilarity between objects, which affects human visual attention. In this paper, we introduce a… ▽ More Saliency prediction has made great strides over the past two decades, with current techniques modeling low-level information, such as color, intensity and size contrasts, and high-level ones, such as attention and gaze direction for entire objects. Despite this, these methods fail to account for the dissimilarity between objects, which affects human visual attention. In this paper, we introduce a detection-guided saliency prediction network that explicitly models the differences between multiple objects, such as their appearance and size dissimilarities. Our approach allows us to fuse our object dissimilarities with features extracted by any deep saliency prediction network. As evidenced by our experiments, this consistently boosts the accuracy of the baseline networks, enabling us to outperform the state-of-the-art models on three saliency benchmarks, namely SALICON, MIT300 and CAT2000. Our project page is at https://github.com/IVRL/DisSal. △ Less

Submitted 24 November, 2022; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: Transactions on Machine Learning Research (TMLR 2022) https://openreview.net/forum?id=NmTMc3uD1G

arXiv:2101.04631 [pdf, other]

Deep Gaussian Denoiser Epistemic Uncertainty and Decoupled Dual-Attention Fusion

Authors: Xiaoqi Ma, Xiaoyu Lin, Majed El Helou, Sabine Süsstrunk

Abstract: Following the performance breakthrough of denoising networks, improvements have come chiefly through novel architecture designs and increased depth. While novel denoising networks were designed for real images coming from different distributions, or for specific applications, comparatively small improvement was achieved on Gaussian denoising. The denoising solutions suffer from epistemic uncertain… ▽ More Following the performance breakthrough of denoising networks, improvements have come chiefly through novel architecture designs and increased depth. While novel denoising networks were designed for real images coming from different distributions, or for specific applications, comparatively small improvement was achieved on Gaussian denoising. The denoising solutions suffer from epistemic uncertainty that can limit further advancements. This uncertainty is traditionally mitigated through different ensemble approaches. However, such ensembles are prohibitively costly with deep networks, which are already large in size. Our work focuses on pushing the performance limits of state-of-the-art methods on Gaussian denoising. We propose a model-agnostic approach for reducing epistemic uncertainty while using only a single pretrained network. We achieve this by tap** into the epistemic uncertainty through augmented and frequency-manipulated images to obtain denoised images with varying error. We propose an ensemble method with two decoupled attention paths, over the pixel domain and over that of our different manipulations, to learn the final fusion. Our results significantly improve over the state-of-the-art baselines and across varying noise levels. △ Less

Submitted 31 May, 2021; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: Code and models are publicly available on https://github.com/IVRL/DEU

arXiv:2011.01406 [pdf, other]

BIGPrior: Towards Decoupling Learned Prior Hallucination and Data Fidelity in Image Restoration

Authors: Majed El Helou, Sabine Süsstrunk

Abstract: Classic image-restoration algorithms use a variety of priors, either implicitly or explicitly. Their priors are hand-designed and their corresponding weights are heuristically assigned. Hence, deep learning methods often produce superior image restoration quality. Deep networks are, however, capable of inducing strong and hardly predictable hallucinations. Networks implicitly learn to be jointly f… ▽ More Classic image-restoration algorithms use a variety of priors, either implicitly or explicitly. Their priors are hand-designed and their corresponding weights are heuristically assigned. Hence, deep learning methods often produce superior image restoration quality. Deep networks are, however, capable of inducing strong and hardly predictable hallucinations. Networks implicitly learn to be jointly faithful to the observed data while learning an image prior; and the separation of original data and hallucinated data downstream is then not possible. This limits their wide-spread adoption in image restoration. Furthermore, it is often the hallucinated part that is victim to degradation-model overfitting. We present an approach with decoupled network-prior based hallucination and data fidelity terms. We refer to our framework as the Bayesian Integration of a Generative Prior (BIGPrior). Our method is rooted in a Bayesian framework and tightly connected to classic restoration methods. In fact, it can be viewed as a generalization of a large family of classic restoration algorithms. We use network inversion to extract image prior information from a generative network. We show that, on image colorization, inpainting and denoising, our framework consistently improves the inversion results. Our method, though partly reliant on the quality of the generative network inversion, is competitive with state-of-the-art supervised and task-specific restoration methods. It also provides an additional metric that sets forth the degree of prior reliance per pixel relative to data fidelity. △ Less

Submitted 8 January, 2022; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: IEEE TIP 2022. Code available on https://github.com/majedelhelou/BIGPrior. Main change relative to v1: added Table VI and computation times

arXiv:2009.12798 [pdf, other]

AIM 2020: Scene Relighting and Illumination Estimation Challenge

Authors: Majed El Helou, Ruofan Zhou, Sabine Süsstrunk, Radu Timofte, Mahmoud Afifi, Michael S. Brown, Kele Xu, Hengxing Cai, Yuzhong Liu, Li-Wen Wang, Zhi-Song Liu, Chu-Tak Li, Sourya Dipta Das, Nisarg A. Shah, Akashdeep Jassal, Tongtong Zhao, Shanshan Zhao, Sabari Nathan, M. Parisa Beham, R. Suganya, Qing Wang, Zhongyun Hu, Xin Huang, Yaning Li, Maitreya Suin , et al. (12 additional authors not shown)

Abstract: We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illum… ▽ More We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation (i.e., light source position). The goal of the second track was to estimate illumination settings, namely the color temperature and orientation, from a given image. Lastly, the third track dealt with any-to-any relighting, thus a generalization of the first track. The target color temperature and orientation, rather than being pre-determined, are instead given by a guide image. Participants were allowed to make use of their track 1 and 2 solutions for track 3. The tracks had 94, 52, and 56 registered participants, respectively, leading to 20 confirmed submissions in the final competition stage. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: ECCVW 2020. Data and more information on https://github.com/majedelhelou/VIDIT

arXiv:2007.09433 [pdf, other]

Volumetric Transformer Networks

Authors: Seungryong Kim, Sabine Süsstrunk, Mathieu Salzmann

Abstract: Existing techniques to encode spatial invariance within deep convolutional neural networks (CNNs) apply the same war** field to all the feature channels. This does not account for the fact that the individual feature channels can represent different semantic parts, which can undergo different spatial transformations w.r.t. a canonical configuration. To overcome this limitation, we introduce a le… ▽ More Existing techniques to encode spatial invariance within deep convolutional neural networks (CNNs) apply the same war** field to all the feature channels. This does not account for the fact that the individual feature channels can represent different semantic parts, which can undergo different spatial transformations w.r.t. a canonical configuration. To overcome this limitation, we introduce a learnable module, the volumetric transformer network (VTN), that predicts channel-wise war** fields so as to reconfigure intermediate CNN features spatially and channel-wisely. We design our VTN as an encoder-decoder network, with modules dedicated to letting the information flow across the feature channels, to account for the dependencies between the semantic parts. We further propose a loss function defined between the warped features of pairs of instances, which improves the localization ability of VTN. Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2006.08403 [pdf, other]

On the Loss Landscape of Adversarial Training: Identifying Challenges and How to Overcome Them

Authors: Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, Sabine Süsstrunk

Abstract: We analyze the influence of adversarial training on the loss landscape of machine learning models. To this end, we first provide analytical studies of the properties of adversarial loss functions under different adversarial budgets. We then demonstrate that the adversarial loss landscape is less favorable to optimization, due to increased curvature and more scattered gradients. Our conclusions are… ▽ More We analyze the influence of adversarial training on the loss landscape of machine learning models. To this end, we first provide analytical studies of the properties of adversarial loss functions under different adversarial budgets. We then demonstrate that the adversarial loss landscape is less favorable to optimization, due to increased curvature and more scattered gradients. Our conclusions are validated by numerical analyses, which show that training under large adversarial budgets impede the escape from suboptimal random initialization, cause non-vanishing gradients and make the model find sharper minima. Based on these observations, we show that a periodic adversarial scheduling (PAS) strategy can effectively overcome these challenges, yielding better results than vanilla adversarial training while being much less sensitive to the choice of learning rate. △ Less

Submitted 2 November, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

arXiv:2005.05460 [pdf, other]

VIDIT: Virtual Image Dataset for Illumination Transfer

Authors: Majed El Helou, Ruofan Zhou, Johan Barthas, Sabine Süsstrunk

Abstract: Deep image relighting is gaining more interest lately, as it allows photo enhancement through illumination-specific retouching without human effort. Aside from aesthetic enhancement and photo montage, image relighting is valuable for domain adaptation, whether to augment datasets for training or to normalize input test data. Accurate relighting is, however, very challenging for various reasons, su… ▽ More Deep image relighting is gaining more interest lately, as it allows photo enhancement through illumination-specific retouching without human effort. Aside from aesthetic enhancement and photo montage, image relighting is valuable for domain adaptation, whether to augment datasets for training or to normalize input test data. Accurate relighting is, however, very challenging for various reasons, such as the difficulty in removing and recasting shadows and the modeling of different surfaces. We present a novel dataset, the Virtual Image Dataset for Illumination Transfer (VIDIT), in an effort to create a reference evaluation benchmark and to push forward the development of illumination manipulation methods. Virtual datasets are not only an important step towards achieving real-image performance but have also proven capable of improving training even when real datasets are possible to acquire and available. VIDIT contains 300 virtual scenes used for training, where every scene is captured 40 times in total: from 8 equally-spaced azimuthal angles, each lit with 5 different illuminants. △ Less

Submitted 13 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: For further information and data, see https://github.com/majedelhelou/VIDIT

arXiv:2004.14367 [pdf, other]

Editing in Style: Uncovering the Local Semantics of GANs

Authors: Edo Collins, Raja Bala, Bob Price, Sabine Süsstrunk

Abstract: While the quality of GAN image synthesis has improved tremendously in recent years, our ability to control and condition the output is still limited. Focusing on StyleGAN, we introduce a simple and effective method for making local, semantically-aware edits to a target output image. This is accomplished by borrowing elements from a source image, also a GAN output, via a novel manipulation of style… ▽ More While the quality of GAN image synthesis has improved tremendously in recent years, our ability to control and condition the output is still limited. Focusing on StyleGAN, we introduce a simple and effective method for making local, semantically-aware edits to a target output image. This is accomplished by borrowing elements from a source image, also a GAN output, via a novel manipulation of style vectors. Our method requires neither supervision from an external model, nor involves complex spatial morphing operations. Instead, it relies on the emergent disentanglement of semantic objects that is learned by StyleGAN during its training. Semantic editing is demonstrated on GANs producing human faces, indoor scenes, cats, and cars. We measure the locality and photorealism of the edits produced by our method, and find that it accomplishes both. △ Less

Submitted 21 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: IEEE Conference on Computer Vision and Patten Recognition (CVPR), 2020. Code: https://github.com/IVRL/GANLocalEditing

arXiv:2004.06409 [pdf, other]

Divergence-Based Adaptive Extreme Video Completion

Authors: Majed El Helou, Ruofan Zhou, Frank Schmutz, Fabrice Guibert, Sabine Süsstrunk

Abstract: Extreme image or video completion, where, for instance, we only retain 1% of pixels in random locations, allows for very cheap sampling in terms of the required pre-processing. The consequence is, however, a reconstruction that is challenging for humans and inpainting algorithms alike. We propose an extension of a state-of-the-art extreme image completion algorithm to extreme video completion. We… ▽ More Extreme image or video completion, where, for instance, we only retain 1% of pixels in random locations, allows for very cheap sampling in terms of the required pre-processing. The consequence is, however, a reconstruction that is challenging for humans and inpainting algorithms alike. We propose an extension of a state-of-the-art extreme image completion algorithm to extreme video completion. We analyze a color-motion estimation approach based on color KL-divergence that is suitable for extremely sparse scenarios. Our algorithm leverages the estimate to adapt between its spatial and temporal filtering when reconstructing the sparse randomly-sampled video. We validate our results on 50 publicly-available videos using reconstruction PSNR and mean opinion scores. △ Less

Submitted 14 April, 2020; originally announced April 2020.

Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

arXiv:2003.08514 [pdf, other]

doi 10.1049/iet-ipr.2019.0787

Evaluating Salient Object Detection in Natural Images with Multiple Objects having Multi-level Saliency

Authors: Gökhan Yildirim, Debashis Sen, Mohan Kankanhalli, Sabine Süsstrunk

Abstract: Salient object detection is evaluated using binary ground truth with the labels being salient object class and background. In this paper, we corroborate based on three subjective experiments on a novel image dataset that objects in natural images are inherently perceived to have varying levels of importance. Our dataset, named SalMoN (saliency in multi-object natural images), has 588 images contai… ▽ More Salient object detection is evaluated using binary ground truth with the labels being salient object class and background. In this paper, we corroborate based on three subjective experiments on a novel image dataset that objects in natural images are inherently perceived to have varying levels of importance. Our dataset, named SalMoN (saliency in multi-object natural images), has 588 images containing multiple objects. The subjective experiments performed record spontaneous attention and perception through eye fixation duration, point clicking and rectangle drawing. As object saliency in a multi-object image is inherently multi-level, we propose that salient object detection must be evaluated for the capability to detect all multi-level salient objects apart from the salient object class detection capability. For this purpose, we generate multi-level maps as ground truth corresponding to all the dataset images using the results of the subjective experiments, with the labels being multi-level salient objects and background. We then propose the use of mean absolute error, Kendall's rank correlation and average area under precision-recall curve to evaluate existing salient object detection methods on our multi-level saliency ground truth dataset. Approaches that represent saliency detection on images as local-global hierarchical processing of a graph perform well in our dataset. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: Accepted Article

Journal ref: IET Image Processing, 2019

arXiv:2003.07119 [pdf, other]

Stochastic Frequency Masking to Improve Super-Resolution and Denoising Networks

Authors: Majed El Helou, Ruofan Zhou, Sabine Süsstrunk

Abstract: Super-resolution and denoising are ill-posed yet fundamental image restoration tasks. In blind settings, the degradation kernel or the noise level are unknown. This makes restoration even more challenging, notably for learning-based methods, as they tend to overfit to the degradation seen during training. We present an analysis, in the frequency domain, of degradation-kernel overfitting in super-r… ▽ More Super-resolution and denoising are ill-posed yet fundamental image restoration tasks. In blind settings, the degradation kernel or the noise level are unknown. This makes restoration even more challenging, notably for learning-based methods, as they tend to overfit to the degradation seen during training. We present an analysis, in the frequency domain, of degradation-kernel overfitting in super-resolution and introduce a conditional learning perspective that extends to both super-resolution and denoising. Building on our formulation, we propose a stochastic frequency masking of images used in training to regularize the networks and address the overfitting problem. Our technique improves state-of-the-art methods on blind super-resolution with different synthetic kernels, real super-resolution, blind Gaussian denoising, and real-image denoising. △ Less

Submitted 23 July, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: ECCV 2020. Project page: https://github.com/majedelhelou/SFM

arXiv:2003.05961 [pdf, other]

W2S: Microscopy Data with Joint Denoising and Super-Resolution for Widefield to SIM Map**

Authors: Ruofan Zhou, Majed El Helou, Daniel Sage, Thierry Laroche, Arne Seitz, Sabine Süsstrunk

Abstract: In fluorescence microscopy live-cell imaging, there is a critical trade-off between the signal-to-noise ratio and spatial resolution on one side, and the integrity of the biological sample on the other side. To obtain clean high-resolution (HR) images, one can either use microscopy techniques, such as structured-illumination microscopy (SIM), or apply denoising and super-resolution (SR) algorithms… ▽ More In fluorescence microscopy live-cell imaging, there is a critical trade-off between the signal-to-noise ratio and spatial resolution on one side, and the integrity of the biological sample on the other side. To obtain clean high-resolution (HR) images, one can either use microscopy techniques, such as structured-illumination microscopy (SIM), or apply denoising and super-resolution (SR) algorithms. However, the former option requires multiple shots that can damage the samples, and although efficient deep learning based algorithms exist for the latter option, no benchmark exists to evaluate these algorithms on the joint denoising and SR (JDSR) tasks. To study JDSR on microscopy data, we propose such a novel JDSR dataset, Widefield2SIM (W2S), acquired using a conventional fluorescence widefield and SIM imaging. W2S includes 144,000 real fluorescence microscopy images, resulting in a total of 360 sets of images. A set is comprised of noisy low-resolution (LR) widefield images with different noise levels, a noise-free LR image, and a corresponding high-quality HR SIM image. W2S allows us to benchmark the combinations of 6 denoising methods and 6 SR methods. We show that state-of-the-art SR networks perform very poorly on noisy inputs. Our evaluation also reveals that applying the best denoiser in terms of reconstruction error followed by the best SR method does not necessarily yield the best final result. Both quantitative and qualitative results show that SR networks are sensitive to noise and the sequential application of denoising and SR algorithms is sub-optimal. Lastly, we demonstrate that SR networks retrained end-to-end for JDSR outperform any combination of state-of-the-art deep denoising and SR networks △ Less

Submitted 24 August, 2020; v1 submitted 12 March, 2020; originally announced March 2020.

Comments: ECCVW 2020. Project page: \<https://github.com/ivrl/w2s>

arXiv:2003.03633 [pdf, other]

AL2: Progressive Activation Loss for Learning General Representations in Classification Neural Networks

Authors: Majed El Helou, Frederike Dümbgen, Sabine Süsstrunk

Abstract: The large capacity of neural networks enables them to learn complex functions. To avoid overfitting, networks however require a lot of training data that can be expensive and time-consuming to collect. A common practical approach to attenuate overfitting is the use of network regularization techniques. We propose a novel regularization method that progressively penalizes the magnitude of activatio… ▽ More The large capacity of neural networks enables them to learn complex functions. To avoid overfitting, networks however require a lot of training data that can be expensive and time-consuming to collect. A common practical approach to attenuate overfitting is the use of network regularization techniques. We propose a novel regularization method that progressively penalizes the magnitude of activations during training. The combined activation signals produced by all neurons in a given layer form the representation of the input image in that feature space. We propose to regularize this representation in the last feature layer before classification layers. Our method's effect on generalization is analyzed with label randomization tests and cumulative ablations. Experimental results show the advantages of our approach in comparison with commonly-used regularizers on standard benchmark datasets. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

arXiv:1912.09299 [pdf, other]

Image Restoration using Plug-and-Play CNN MAP Denoisers

Authors: Siavash Bigdeli, David Honzátko, Sabine Süsstrunk, L. Andrea Dunbar

Abstract: Plug-and-play denoisers can be used to perform generic image restoration tasks independent of the degradation type. These methods build on the fact that the Maximum a Posteriori (MAP) optimization can be solved using smaller sub-problems, including a MAP denoising optimization. We present the first end-to-end approach to MAP estimation for image denoising using deep neural networks. We show that o… ▽ More Plug-and-play denoisers can be used to perform generic image restoration tasks independent of the degradation type. These methods build on the fact that the Maximum a Posteriori (MAP) optimization can be solved using smaller sub-problems, including a MAP denoising optimization. We present the first end-to-end approach to MAP estimation for image denoising using deep neural networks. We show that our method is guaranteed to minimize the MAP denoising objective, which is then used in an optimization algorithm for generic image restoration. We provide theoretical analysis of our approach and show the quantitative performance of our method in several experiments. Our experimental results show that the proposed method can achieve 70x faster performance compared to the state-of-the-art, while maintaining the theoretical perspective of MAP. △ Less

Submitted 20 December, 2019; v1 submitted 18 December, 2019; originally announced December 2019.

Comments: Code and models available at https://github.com/DawyD/cnn-map-denoiser . Accepted for publication in VISAPP 2020

arXiv:1912.04792 [pdf, other]

Training Provably Robust Models by Polyhedral Envelope Regularization

Authors: Chen Liu, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Training certifiable neural networks enables one to obtain models with robustness guarantees against adversarial attacks. In this work, we introduce a framework to bound the adversary-free region in the neighborhood of the input data by a polyhedral envelope, which yields finer-grained certified robustness. We further introduce polyhedral envelope regularization (PER) to encourage larger polyhedra… ▽ More Training certifiable neural networks enables one to obtain models with robustness guarantees against adversarial attacks. In this work, we introduce a framework to bound the adversary-free region in the neighborhood of the input data by a polyhedral envelope, which yields finer-grained certified robustness. We further introduce polyhedral envelope regularization (PER) to encourage larger polyhedral envelopes and thus improve the provable robustness of the models. We demonstrate the flexibility and effectiveness of our framework on standard benchmarks; it applies to networks of different architectures and general activation functions. Compared with the state-of-the-art methods, PER has very little computational overhead and better robustness guarantees without over-regularizing the model. △ Less

Submitted 20 September, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

Journal ref: IEEE Transactions on Neural Networks and Learning Systems 2021

arXiv:1907.03029 [pdf]

doi 10.1109/TIP.2020.2976814

Blind Universal Bayesian Image Denoising with Gaussian Noise Level Learning

Authors: Majed El Helou, Sabine Süsstrunk

Abstract: Blind and universal image denoising consists of using a unique model that denoises images with any level of noise. It is especially practical as noise levels do not need to be known when the model is developed or at test time. We propose a theoretically-grounded blind and universal deep learning image denoiser for additive Gaussian noise removal. Our network is based on an optimal denoising soluti… ▽ More Blind and universal image denoising consists of using a unique model that denoises images with any level of noise. It is especially practical as noise levels do not need to be known when the model is developed or at test time. We propose a theoretically-grounded blind and universal deep learning image denoiser for additive Gaussian noise removal. Our network is based on an optimal denoising solution, which we call fusion denoising. It is derived theoretically with a Gaussian image prior assumption. Synthetic experiments show our network's generalization strength to unseen additive noise levels. We also adapt the fusion denoising network architecture for image denoising on real images. Our approach improves real-world grayscale additive image denoising PSNR results for training noise levels and further on noise levels not seen during training. It also improves state-of-the-art color image denoising performance on every single noise level, by an average of 0.1dB, whether trained on or not. △ Less

Submitted 7 March, 2020; v1 submitted 5 July, 2019; originally announced July 2019.

Comments: Final uncompressed TIP version available online in open access (DOI attached)

ACM Class: I.4.4

Journal ref: IEEE Transactions on Image Processing, vol. 29, pp. 4885-4897, 2020

arXiv:1905.08214 [pdf, other]

Drone Shadow Tracking

Authors: Xiaoyan Zou, Ruofan Zhou, Majed El Helou, Sabine Süsstrunk

Abstract: Aerial videos taken by a drone not too far above the surface may contain the drone's shadow projected on the scene. This deteriorates the aesthetic quality of videos. With the presence of other shadows, shadow removal cannot be directly applied, and the shadow of the drone must be tracked. Tracking a drone's shadow in a video is, however, challenging. The varying size, shape, change of orientation… ▽ More Aerial videos taken by a drone not too far above the surface may contain the drone's shadow projected on the scene. This deteriorates the aesthetic quality of videos. With the presence of other shadows, shadow removal cannot be directly applied, and the shadow of the drone must be tracked. Tracking a drone's shadow in a video is, however, challenging. The varying size, shape, change of orientation and drone altitude pose difficulties. The shadow can also easily disappear over dark areas. However, a shadow has specific properties that can be leveraged, besides its geometric shape. In this paper, we incorporate knowledge of the shadow's physical properties, in the form of shadow detection masks, into a correlation-based tracking algorithm. We capture a test set of aerial videos taken with different settings and compare our results to those of a state-of-the-art tracking algorithm. △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: 5 pages, 4 figures

arXiv:1905.03590 [pdf, other]

Fast and Efficient Zero-Learning Image Fusion

Authors: Fayez Lahoud, Sabine Süsstrunk

Abstract: We propose a real-time image fusion method using pre-trained neural networks. Our method generates a single image containing features from multiple sources. We first decompose images into a base layer representing large scale intensity variations, and a detail layer containing small scale changes. We use visual saliency to fuse the base layers, and deep feature maps extracted from a pre-trained ne… ▽ More We propose a real-time image fusion method using pre-trained neural networks. Our method generates a single image containing features from multiple sources. We first decompose images into a base layer representing large scale intensity variations, and a detail layer containing small scale changes. We use visual saliency to fuse the base layers, and deep feature maps extracted from a pre-trained neural network to fuse the detail layers. We conduct ablation studies to analyze our method's parameters such as decomposition filters, weight construction methods, and network depth and architecture. Then, we validate its effectiveness and speed on thermal, medical, and multi-focus fusion. We also apply it to multiple image inputs such as multi-exposure sequences. The experimental results demonstrate that our technique achieves state-of-the-art performance in visual quality, objective assessment, and runtime efficiency. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: 13 pages, 10 figures

Showing 1–50 of 59 results for author: Susstrunk, S