-
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
Authors:
Jeremias Traub,
Till J. Bungert,
Carsten T. Lüth,
Michael Baumgartner,
Klaus H. Maier-Hein,
Lena Maier-Hein,
Paul F Jaeger
Abstract:
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general perfo…
▽ More
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
How Deduction Systems Can Help You To Verify Stability Properties
Authors:
Mario Gleirscher,
Rehab Massoud,
Dieter Hutter,
Christoph Lüth
Abstract:
Mathematical proofs are a cornerstone of control theory, and it is important to get them right. Deduction systems can help with this by mechanically checking the proofs. However, the structure and level of detail at which a proof is represented in a deduction system differ significantly from a proof read and written by mathematicians and engineers, hampering understanding and adoption of these sys…
▽ More
Mathematical proofs are a cornerstone of control theory, and it is important to get them right. Deduction systems can help with this by mechanically checking the proofs. However, the structure and level of detail at which a proof is represented in a deduction system differ significantly from a proof read and written by mathematicians and engineers, hampering understanding and adoption of these systems.
This paper aims at hel** to bridge the gap between machine-checked proofs and proofs in engineering and mathematics by presenting a machine-checked proof for stability using Lyapunov's theorem in a human-readable way. The structure of the proof is analyzed in detail, and potential benefits of such a proof are discussed, such as generalizability, reusability and increased trust in correctness.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
BinSym: Binary-Level Symbolic Execution using Formal Descriptions of Instruction Semantics
Authors:
Sören Tempel,
Tobias Brandt,
Christoph Lüth,
Rolf Drechsler
Abstract:
BinSym is a framework for symbolic program analysis of software in binary form. Contrary to prior work, it operates directly on binary code instructions and does not require lifting them to an intermediate representation (IR). This is achieved by formulating the symbolic semantics on top of a formal description of binary code instruction semantics. By building on existing formal descriptions, BinS…
▽ More
BinSym is a framework for symbolic program analysis of software in binary form. Contrary to prior work, it operates directly on binary code instructions and does not require lifting them to an intermediate representation (IR). This is achieved by formulating the symbolic semantics on top of a formal description of binary code instruction semantics. By building on existing formal descriptions, BinSym eliminates the manual effort required by prior work to implement transformations to an IR, thereby reducing the margin for errors. Furthermore, BinSym's symbolic semantics can be directly related to the binary code, which improves symbolic execution speed by reducing solver query complexity.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Embarrassingly Simple Scribble Supervision for 3D Medical Segmentation
Authors:
Karol Gotkowski,
Carsten Lüth,
Paul F. Jäger,
Sebastian Ziegler,
Lars Krämer,
Stefan Denner,
Shuhan Xiao,
Nico Disch,
Klaus H. Maier-Hein,
Fabian Isensee
Abstract:
Traditionally, segmentation algorithms require dense annotations for training, demanding significant annotation efforts, particularly within the 3D medical imaging field. Scribble-supervised learning emerges as a possible solution to this challenge, promising a reduction in annotation efforts when creating large-scale datasets. Recently, a plethora of methods for optimized learning from scribbles…
▽ More
Traditionally, segmentation algorithms require dense annotations for training, demanding significant annotation efforts, particularly within the 3D medical imaging field. Scribble-supervised learning emerges as a possible solution to this challenge, promising a reduction in annotation efforts when creating large-scale datasets. Recently, a plethora of methods for optimized learning from scribbles have been proposed, but have so far failed to position scribble annotation as a beneficial alternative. We relate this shortcoming to two major issues: 1) the complex nature of many methods which deeply ties them to the underlying segmentation model, thus preventing a migration to more powerful state-of-the-art models as the field progresses and 2) the lack of a systematic evaluation to validate consistent performance across the broader medical domain, resulting in a lack of trust when applying these methods to new segmentation problems. To address these issues, we propose a comprehensive scribble supervision benchmark consisting of seven datasets covering a diverse set of anatomies and pathologies imaged with varying modalities. We furthermore propose the systematic use of partial losses, i.e. losses that are only computed on annotated voxels. Contrary to most existing methods, these losses can be seamlessly integrated into state-of-the-art segmentation methods, enabling them to learn from scribble annotations while preserving their original loss formulations. Our evaluation using nnU-Net reveals that while most existing methods suffer from a lack of generalization, the proposed approach consistently delivers state-of-the-art performance. Thanks to its simplicity, our approach presents an embarrassingly simple yet effective solution to the challenges of scribble supervision. Source code as well as our extensive scribble benchmarking suite will be made publicly available upon publication.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation
Authors:
Kim-Celine Kahl,
Carsten T. Lüth,
Maximilian Zenk,
Klaus Maier-Hein,
Paul F. Jaeger
Abstract:
Uncertainty estimation is an essential and heavily-studied component for the reliable application of semantic segmentation methods. While various studies exist claiming methodological advances on the one hand, and successful application on the other hand, the field is currently hampered by a gap between theory and practice leaving fundamental questions unanswered: Can data-related and model-relate…
▽ More
Uncertainty estimation is an essential and heavily-studied component for the reliable application of semantic segmentation methods. While various studies exist claiming methodological advances on the one hand, and successful application on the other hand, the field is currently hampered by a gap between theory and practice leaving fundamental questions unanswered: Can data-related and model-related uncertainty really be separated in practice? Which components of an uncertainty method are essential for real-world performance? Which uncertainty method works well for which application? In this work, we link this research gap to a lack of systematic and comprehensive evaluation of uncertainty methods. Specifically, we identify three key pitfalls in current literature and present an evaluation framework that bridges the research gap by providing 1) a controlled environment for studying data ambiguities as well as distribution shifts, 2) systematic ablations of relevant method components, and 3) test-beds for the five predominant uncertainty applications: OoD-detection, active learning, failure detection, calibration, and ambiguity modeling. Empirical results on simulated as well as real-world data demonstrate how the proposed framework is able to answer the predominant questions in the field revealing for instance that 1) separation of uncertainty types works on simulated data but does not necessarily translate to real-world data, 2) aggregation of scores is a crucial but currently neglected component of uncertainty methods, 3) While ensembles are performing most robustly across the different downstream tasks and settings, test-time augmentation often constitutes a light-weight alternative. Code is at: https://github.com/IML-DKFZ/values
△ Less
Submitted 3 May, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Deriving Rewards for Reinforcement Learning from Symbolic Behaviour Descriptions of Bipedal Walking
Authors:
Daniel Harnack,
Christoph Lüth,
Lukas Gross,
Shivesh Kumar,
Frank Kirchner
Abstract:
Generating physical movement behaviours from their symbolic description is a long-standing challenge in artificial intelligence (AI) and robotics, requiring insights into numerical optimization methods as well as into formalizations from symbolic AI and reasoning. In this paper, a novel approach to finding a reward function from a symbolic description is proposed. The intended system behaviour is…
▽ More
Generating physical movement behaviours from their symbolic description is a long-standing challenge in artificial intelligence (AI) and robotics, requiring insights into numerical optimization methods as well as into formalizations from symbolic AI and reasoning. In this paper, a novel approach to finding a reward function from a symbolic description is proposed. The intended system behaviour is modelled as a hybrid automaton, which reduces the system state space to allow more efficient reinforcement learning. The approach is applied to bipedal walking, by modelling the walking robot as a hybrid automaton over state space orthants, and used with the compass walker to derive a reward that incentivizes following the hybrid automaton cycle. As a result, training times of reinforcement learning controllers are reduced while final walking speed is increased. The approach can serve as a blueprint how to generate reward functions from symbolic AI and reasoning.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
cOOpD: Reformulating COPD classification on chest CT scans as anomaly detection using contrastive representations
Authors:
Silvia D. Almeida,
Carsten T. Lüth,
Tobias Norajitra,
Tassilo Wald,
Marco Nolden,
Paul F. Jaeger,
Claus P. Heussel,
Jürgen Biederer,
Oliver Weinheimer,
Klaus Maier-Hein
Abstract:
Classification of heterogeneous diseases is challenging due to their complexity, variability of symptoms and imaging findings. Chronic Obstructive Pulmonary Disease (COPD) is a prime example, being underdiagnosed despite being the third leading cause of death. Its sparse, diffuse and heterogeneous appearance on computed tomography challenges supervised binary classification. We reformulate COPD bi…
▽ More
Classification of heterogeneous diseases is challenging due to their complexity, variability of symptoms and imaging findings. Chronic Obstructive Pulmonary Disease (COPD) is a prime example, being underdiagnosed despite being the third leading cause of death. Its sparse, diffuse and heterogeneous appearance on computed tomography challenges supervised binary classification. We reformulate COPD binary classification as an anomaly detection task, proposing cOOpD: heterogeneous pathological regions are detected as Out-of-Distribution (OOD) from normal homogeneous lung regions. To this end, we learn representations of unlabeled lung regions employing a self-supervised contrastive pretext model, potentially capturing specific characteristics of diseased and healthy unlabeled regions. A generative model then learns the distribution of healthy representations and identifies abnormalities (stemming from COPD) as deviations. Patient-level scores are obtained by aggregating region OOD scores. We show that cOOpD achieves the best performance on two public datasets, with an increase of 8.2% and 7.7% in terms of AUROC compared to the previous supervised state-of-the-art. Additionally, cOOpD yields well-interpretable spatial anomaly maps and patient-level scores which we show to be of additional value in identifying individuals in the early stage of progression. Experiments in artificially designed real-world prevalence settings further support that anomaly detection is a powerful way of tackling COPD classification.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Navigating the Pitfalls of Active Learning Evaluation: A Systematic Framework for Meaningful Performance Assessment
Authors:
Carsten T. Lüth,
Till J. Bungert,
Lukas Klein,
Paul F. Jaeger
Abstract:
Active Learning (AL) aims to reduce the labeling burden by interactively selecting the most informative samples from a pool of unlabeled data. While there has been extensive research on improving AL query methods in recent years, some studies have questioned the effectiveness of AL compared to emerging paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL), or a simple…
▽ More
Active Learning (AL) aims to reduce the labeling burden by interactively selecting the most informative samples from a pool of unlabeled data. While there has been extensive research on improving AL query methods in recent years, some studies have questioned the effectiveness of AL compared to emerging paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL), or a simple optimization of classifier configurations. Thus, today's AL literature presents an inconsistent and contradictory landscape, leaving practitioners uncertain about whether and how to use AL in their tasks. In this work, we make the case that this inconsistency arises from a lack of systematic and realistic evaluation of AL methods. Specifically, we identify five key pitfalls in the current literature that reflect the delicate considerations required for AL evaluation. Further, we present an evaluation framework that overcomes these pitfalls and thus enables meaningful statements about the performance of AL methods. To demonstrate the relevance of our protocol, we present a large-scale empirical study and benchmark for image classification spanning various data sets, query methods, AL settings, and training paradigms. Our findings clarify the inconsistent picture in the literature and enable us to give hands-on recommendations for practitioners. The benchmark is hosted at https://github.com/IML-DKFZ/realistic-al .
△ Less
Submitted 3 November, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
CRADL: Contrastive Representations for Unsupervised Anomaly Detection and Localization
Authors:
Carsten T. Lüth,
David Zimmerer,
Gregor Koehler,
Paul F. Jaeger,
Fabian Isensee,
Jens Petersen,
Klaus H. Maier-Hein
Abstract:
Unsupervised anomaly detection in medical imaging aims to detect and localize arbitrary anomalies without requiring annotated anomalous data during training. Often, this is achieved by learning a data distribution of normal samples and detecting anomalies as regions in the image which deviate from this distribution. Most current state-of-the-art methods use latent variable generative models operat…
▽ More
Unsupervised anomaly detection in medical imaging aims to detect and localize arbitrary anomalies without requiring annotated anomalous data during training. Often, this is achieved by learning a data distribution of normal samples and detecting anomalies as regions in the image which deviate from this distribution. Most current state-of-the-art methods use latent variable generative models operating directly on the images. However, generative models have been shown to mostly capture low-level features, s.a. pixel-intensities, instead of rich semantic features, which also applies to their representations. We circumvent this problem by proposing CRADL whose core idea is to model the distribution of normal samples directly in the low-dimensional representation space of an encoder trained with a contrastive pretext-task. By utilizing the representations of contrastive learning, we aim to fix the over-fixation on low-level features and learn more semantic-rich representations. Our experiments on anomaly detection and localization tasks using three distinct evaluation datasets show that 1) contrastive representations are superior to representations of generative latent variable models and 2) the CRADL framework shows competitive or superior performance to state-of-the-art.
△ Less
Submitted 5 January, 2023;
originally announced January 2023.
-
Analytic Estimation of Region of Attraction of an LQR Controller for Torque Limited Simple Pendulum
Authors:
Lukas Gross,
Lasse Maywald,
Shivesh Kumar,
Frank Kirchner,
Christoph Lüth
Abstract:
Linear-quadratic regulators (LQR) are a well known and widely used tool in control theory for both linear and nonlinear dynamics. For nonlinear problems, an LQR-based controller is usually only locally viable, thus, raising the problem of estimating the region of attraction (ROA). The need for good ROA estimations becomes especially pressing for underactuated systems, as a failure of controls migh…
▽ More
Linear-quadratic regulators (LQR) are a well known and widely used tool in control theory for both linear and nonlinear dynamics. For nonlinear problems, an LQR-based controller is usually only locally viable, thus, raising the problem of estimating the region of attraction (ROA). The need for good ROA estimations becomes especially pressing for underactuated systems, as a failure of controls might lead to unsafe and unrecoverable system states. Known approaches based on optimization or sampling, while working well, might be too slow in time critical applications and are hard to verify formally. In this work, we propose a novel approach to estimate the ROA based on the analytic solutions to linear ODEs for the torque limited simple pendulum. In simulation and physical experiments, we compared our approach to a Lyapunov-sampling baseline approach and found that our approach was faster to compute, while yielding ROA estimations of similar phase space area.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification
Authors:
Paul F. Jaeger,
Carsten T. Lüth,
Lukas Klein,
Till J. Bungert
Abstract:
Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assess…
▽ More
Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.
△ Less
Submitted 5 April, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
Motion-Aware Robotic 3D Ultrasound
Authors:
Zhongliang Jiang,
Hanyu Wang,
Zhenyu Li,
Matthias Grimm,
Mingchuan Zhou,
Ulrich Eck,
Sandra V. Brecht,
Tim C. Lueth,
Thomas Wendler,
Nassir Navab
Abstract:
Robotic three-dimensional (3D) ultrasound (US) imaging has been employed to overcome the drawbacks of traditional US examinations, such as high inter-operator variability and lack of repeatability. However, object movement remains a challenge as unexpected motion decreases the quality of the 3D compounding. Furthermore, attempted adjustment of objects, e.g., adjusting limbs to display the entire l…
▽ More
Robotic three-dimensional (3D) ultrasound (US) imaging has been employed to overcome the drawbacks of traditional US examinations, such as high inter-operator variability and lack of repeatability. However, object movement remains a challenge as unexpected motion decreases the quality of the 3D compounding. Furthermore, attempted adjustment of objects, e.g., adjusting limbs to display the entire limb artery tree, is not allowed for conventional robotic US systems. To address this challenge, we propose a vision-based robotic US system that can monitor the object's motion and automatically update the sweep trajectory to provide 3D compounded images of the target anatomy seamlessly. To achieve these functions, a depth camera is employed to extract the manually planned sweep trajectory after which the normal direction of the object is estimated using the extracted 3D trajectory. Subsequently, to monitor the movement and further compensate for this motion to accurately follow the trajectory, the position of firmly attached passive markers is tracked in real-time. Finally, a step-wise compounding was performed. The experiments on a gel phantom demonstrate that the system can resume a sweep when the object is not stationary during scanning.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
Conditional Invertible Neural Networks for Diverse Image-to-Image Translation
Authors:
Lynton Ardizzone,
Jakob Kruse,
Carsten Lüth,
Niels Bracher,
Carsten Rother,
Ullrich Köthe
Abstract:
We introduce a new architecture called a conditional invertible neural network (cINN), and use it to address the task of diverse image-to-image translation for natural images. This is not easily possible with existing INN models due to some fundamental limitations. The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the condi…
▽ More
We introduce a new architecture called a conditional invertible neural network (cINN), and use it to address the task of diverse image-to-image translation for natural images. This is not easily possible with existing INN models due to some fundamental limitations. The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning image into maximally informative features. All parameters of a cINN are jointly optimized with a stable, maximum likelihood-based training procedure. Even though INN-based models have received far less attention in the literature than GANs, they have been shown to have some remarkable properties absent in GANs, e.g. apparent immunity to mode collapse. We find that our cINNs leverage these properties for image-to-image translation, demonstrated on day to night translation and image colorization. Furthermore, we take advantage of our bidirectional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Guided Image Generation with Conditional Invertible Neural Networks
Authors:
Lynton Ardizzone,
Carsten Lüth,
Jakob Kruse,
Carsten Rother,
Ullrich Köthe
Abstract:
In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized wi…
▽ More
In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.
△ Less
Submitted 10 July, 2019; v1 submitted 4 July, 2019;
originally announced July 2019.
-
Interactive Proof Presentations with Cobra
Authors:
Martin Ring,
Christoph Lüth
Abstract:
We present Cobra, a modern proof presentation framework, leveraging cutting-edge presentation technology together with a state of the art interactive theorem prover to present formalized mathematics as active documents. Cobra provides both an easy way to present proofs and a novel approach to auditorium interaction. The presentation is checked live by the theorem prover, and moreover allows for li…
▽ More
We present Cobra, a modern proof presentation framework, leveraging cutting-edge presentation technology together with a state of the art interactive theorem prover to present formalized mathematics as active documents. Cobra provides both an easy way to present proofs and a novel approach to auditorium interaction. The presentation is checked live by the theorem prover, and moreover allows for live changes both by the presenter and the audience.
△ Less
Submitted 24 January, 2017;
originally announced January 2017.
-
Proceedings 10th International Workshop On User Interfaces for Theorem Provers
Authors:
Cezary Kaliszyk,
Christoph Lüth
Abstract:
This EPTCS volume collects the post-proceedings of the 10th International Workshop On User Interfaces for Theorem Provers (UITP 2012), held as part of the Conferences on Intelligent Computer Mathematics (CICM 2012) in Bremen on July 11th 2012. The UITP workshop series aims at bringing together reasearchers interested in designing, develo** and evaluating interfaces for interactive proof systems,…
▽ More
This EPTCS volume collects the post-proceedings of the 10th International Workshop On User Interfaces for Theorem Provers (UITP 2012), held as part of the Conferences on Intelligent Computer Mathematics (CICM 2012) in Bremen on July 11th 2012. The UITP workshop series aims at bringing together reasearchers interested in designing, develo** and evaluating interfaces for interactive proof systems, such as theorem provers, formal method tools, and other tools manipulating and presenting mathematical formulae. Started in 1995, it can look back on seventeen years of history by now.
The papers in the present volume give a good indication of the range of questions currently addressed in the UITP community; this ranges from interface design (Windsteiger; Dunchev et al) to using technologies such as machine learning to assist the user (Komendantskaya et al). The web features prominently (Tankink), and new technology necessitates changes right down to the very basic modes of interaction (Wenzel) - the old REPL (read, evaluate, print, loop) mode of interaction can not take advantage of modern technology, such as the web and multi-core machines.
△ Less
Submitted 5 July, 2013;
originally announced July 2013.