Search | arXiv e-print repository

arXiv:2405.20469 [pdf, other]

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

Authors: Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, Stefan Roth

Abstract: A long-standing challenge in develo** machine learning approaches has been the lack of high-quality labeled data. Recently, models trained with purely synthetic data, here termed synthetic clones, generated using large-scale pre-trained diffusion models have shown promising results in overcoming this annotation bottleneck. As these synthetic clone models progress, they are likely to be deployed… ▽ More A long-standing challenge in develo** machine learning approaches has been the lack of high-quality labeled data. Recently, models trained with purely synthetic data, here termed synthetic clones, generated using large-scale pre-trained diffusion models have shown promising results in overcoming this annotation bottleneck. As these synthetic clone models progress, they are likely to be deployed in challenging real-world settings, yet their suitability remains understudied. Our work addresses this gap by providing the first benchmark for three classes of synthetic clone models, namely supervised, self-supervised, and multi-modal ones, across a range of robustness measures. We show that existing synthetic self-supervised and multi-modal clones are comparable to or outperform state-of-the-art real-image baselines for a range of robustness metrics - shape bias, background bias, calibration, etc. However, we also find that synthetic clones are much more susceptible to adversarial and real-world noise than models trained with real data. To address this, we find that combining both real and synthetic data further increases the robustness, and that the choice of prompt used for generating synthetic images plays an important part in the robustness of synthetic clones. △ Less

Submitted 30 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted at CVPR 2024 Workshop: SyntaGen-Harnessing Generative Models for Synthetic Visual Datasets. Project page at https://synbenchmark.github.io/SynCloneBenchmark Comments: Fix typo in Fig. 1

arXiv:2404.16818 [pdf, other]

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Authors: Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present Pri… ▽ More Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Code: https://github.com/visinf/primaps

arXiv:2403.17128 [pdf, other]

Benchmarking Video Frame Interpolation

Authors: Simon Kiefhaber, Simon Niklaus, Feng Liu, Simone Schaub-Meyer

Abstract: Video frame interpolation, the task of synthesizing new frames in between two or more given ones, is becoming an increasingly popular research target. However, the current evaluation of frame interpolation techniques is not ideal. Due to the plethora of test datasets available and inconsistent computation of error metrics, a coherent and fair comparison across papers is very challenging. Furthermo… ▽ More Video frame interpolation, the task of synthesizing new frames in between two or more given ones, is becoming an increasingly popular research target. However, the current evaluation of frame interpolation techniques is not ideal. Due to the plethora of test datasets available and inconsistent computation of error metrics, a coherent and fair comparison across papers is very challenging. Furthermore, new test sets have been proposed as part of method papers so they are unable to provide the in-depth evaluation of a dedicated benchmarking paper. Another severe downside is that these test sets violate the assumption of linearity when given two input frames, making it impossible to solve without an oracle. We hence strongly believe that the community would greatly benefit from a benchmarking paper, which is what we propose. Specifically, we present a benchmark which establishes consistent error metrics by utilizing a submission website that computes them, provides insights by analyzing the interpolation quality with respect to various per-pixel attributes such as the motion magnitude, contains a carefully designed test set adhering to the assumption of linearity by utilizing synthetic data, and evaluates the computational efficiency in a coherent manner. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: http://sniklaus.com/vfibench

arXiv:2308.06248 [pdf, other]

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation… ▽ More The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by map** individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023. Code: https://github.com/visinf/funnybirds

arXiv:2305.09504 [pdf, other]

Content-Adaptive Downsampling in Convolutional Neural Networks

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: Many convolutional neural networks (CNNs) rely on progressive downsampling of their feature maps to increase the network's receptive field and decrease computational cost. However, this comes at the price of losing granularity in the feature maps, limiting the ability to correctly understand images or recover fine detail in dense prediction tasks. To address this, common practice is to replace the… ▽ More Many convolutional neural networks (CNNs) rely on progressive downsampling of their feature maps to increase the network's receptive field and decrease computational cost. However, this comes at the price of losing granularity in the feature maps, limiting the ability to correctly understand images or recover fine detail in dense prediction tasks. To address this, common practice is to replace the last few downsampling operations in a CNN with dilated convolutions, allowing to retain the feature map resolution without reducing the receptive field, albeit increasing the computational cost. This allows to trade off predictive performance against cost, depending on the output feature resolution. By either regularly downsampling or not downsampling the entire feature map, existing work implicitly treats all regions of the input image and subsequent feature maps as equally important, which generally does not hold. We propose an adaptive downsampling scheme that generalizes the above idea by allowing to process informative regions at a higher resolution than less informative ones. In a variety of experiments, we demonstrate the versatility of our adaptive downsampling strategy and empirically show that it improves the cost-accuracy trade-off of various established CNNs. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: Accepted at CVPR 2023 Workshop on Efficient Deep Learning for Computer Vision (ECV). Code: https://github.com/visinf/cad

arXiv:2211.14005 [pdf, other]

Efficient Feature Extraction for High-resolution Video Frame Interpolation

Authors: Moritz Nottebaum, Stefan Roth, Simone Schaub-Meyer

Abstract: Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limit… ▽ More Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limited. The feature extraction layers help to compress the input and extract relevant information for the latter stages, such as motion estimation. However, these layers are often costly in parameters, computation time, and memory. We show how ideas from dimensionality reduction combined with a lightweight optimization can be used to compress the input representation while kee** the extracted information suitable for frame interpolation. Further, we require neither a pretrained flow network nor a synthesis network, additionally reducing the number of trainable parameters and required memory. When evaluating on three 4K benchmarks, we achieve state-of-the-art image quality among the methods without pretrained flow while having the lowest network complexity and memory requirements overall. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted to BMVC 2022. Code: https://github.com/visinf/fldr-vfi

arXiv:2211.12209 [pdf, other]

$S^2$-Flow: Joint Semantic and Style Editing of Facial Images

Authors: Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

Abstract: The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing… ▽ More The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing methods allow for either controlled style edits or controlled semantic edits. In addition, methods that use semantic masks to edit images have difficulty preserving the identity and are unable to perform controlled style edits. We propose a method to disentangle a GAN$\text{'}$s latent space into semantic and style spaces, enabling controlled semantic and style edits for face images independently within the same framework. To achieve this, we design an encoder-decoder based network architecture ($S^2$-Flow), which incorporates two proposed inductive biases. We show the suitability of $S^2$-Flow quantitatively and qualitatively by performing various semantic and style edits. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted to BMVC 2022

arXiv:2209.15404 [pdf, other]

Entropy-driven Unsupervised Keypoint Representation Learning in Videos

Authors: Ali Younes, Simone Schaub-Meyer, Georgia Chalvatzaki

Abstract: Extracting informative representations from videos is fundamental for effectively learning various downstream tasks. We present a novel approach for unsupervised learning of meaningful representations from videos, leveraging the concept of image spatial entropy (ISE) that quantifies the per-pixel information in an image. We argue that \textit{local entropy} of pixel neighborhoods and their tempora… ▽ More Extracting informative representations from videos is fundamental for effectively learning various downstream tasks. We present a novel approach for unsupervised learning of meaningful representations from videos, leveraging the concept of image spatial entropy (ISE) that quantifies the per-pixel information in an image. We argue that \textit{local entropy} of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features. Building on this idea, we abstract visual features into a concise representation of keypoints that act as dynamic information transmitters, and design a deep learning model that learns, purely unsupervised, spatially and temporally consistent representations \textit{directly} from video frames. Two original information-theoretic losses, computed from local entropy, guide our model to discover consistent keypoint representations; a loss that maximizes the spatial information covered by the keypoints and a loss that optimizes the keypoints' information transportation over time. We compare our keypoint representation to strong baselines for various downstream tasks, \eg, learning object dynamics. Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene. △ Less

Submitted 6 June, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: 29 pages, 14 figures, Accepted at ICML 2023

arXiv:2111.07668 [pdf, other]

Fast Axiomatic Attribution for Neural Networks

Authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth

Abstract: Mitigating the dependence on spurious correlations present in the training dataset is a quickly emerging and important topic of deep learning. Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features. However, until now one needed to trade off high-quality attributions, satisfying desirable ax… ▽ More Mitigating the dependence on spurious correlations present in the training dataset is a quickly emerging and important topic of deep learning. Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features. However, until now one needed to trade off high-quality attributions, satisfying desirable axioms, against the time required to compute them. This in turn either led to long training times or ineffective attribution priors. In this work, we break this trade-off by considering a special class of efficiently axiomatically attributable DNNs for which an axiomatic feature attribution can be computed with only a single forward/backward pass. We formally prove that nonnegatively homogeneous DNNs, here termed $\mathcal{X}$-DNNs, are efficiently axiomatically attributable and show that they can be effortlessly constructed from a wide range of regular DNNs by simply removing the bias term of each layer. Various experiments demonstrate the advantages of $\mathcal{X}$-DNNs, beating state-of-the-art generic attribution methods on regular DNNs for training with attribution priors. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: To appear at NeurIPS*2021. Project page and code: https://visinf.github.io/fast-axiomatic-attribution

arXiv:2111.06265 [pdf, other]

Dense Unsupervised Learning for Video Segmentation

Authors: Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Abstract: We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train su… ▽ More We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: To appear at NeurIPS*2021. Code: https://github.com/visinf/dense-ulearn-vos

Showing 1–10 of 10 results for author: Schaub-Meyer, S