-
Unicorns Do Not Exist: Employing and Appreciating Community Managers in Open Source
Authors:
Raphael Sonabend,
Anna Carnegie,
Anne Lee Steele,
Marie Nugent,
Malvika Sharan
Abstract:
Open-source software is released under an open-source licence, which means the software can be shared, adapted, and reshared without prejudice. In the context of open-source software, community managers manage the communities that contribute to the development and upkeep of open-source tools. Despite playing a crucial role in maintaining open-source software, community managers are often overlooke…
▽ More
Open-source software is released under an open-source licence, which means the software can be shared, adapted, and reshared without prejudice. In the context of open-source software, community managers manage the communities that contribute to the development and upkeep of open-source tools. Despite playing a crucial role in maintaining open-source software, community managers are often overlooked. In this paper we look at why this happens and the troubling future we are heading towards if this trend continues. Namely if community managers are driven to focus on corporate needs and become conflicted with the communities they are meant to be managing. We suggest methods to overcome this by stressing the need for the specialisation of roles and by advocating for transparent metrics that highlight the real work of the community manager. Following these guidelines can allow this vital role to be treated with the transparency and respect that it deserves, alongside more traditional roles including software developers and engineers.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data
Authors:
Lukas Burk,
John Zobolas,
Bernd Bischl,
Andreas Bender,
Marvin N. Wright,
Raphael Sonabend
Abstract:
This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on h…
▽ More
This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Training Survival Models using Scoring Rules
Authors:
Philipp Kopper,
David Rügamer,
Raphael Sonabend,
Bernd Bischl,
Andreas Bender
Abstract:
Survival Analysis provides critical insights for partially incomplete time-to-event data in various domains. It is also an important example of probabilistic machine learning. The probabilistic nature of the predictions can be exploited by using (proper) scoring rules in the model fitting process instead of likelihood-based optimization. Our proposal does so in a generic manner and can be used for…
▽ More
Survival Analysis provides critical insights for partially incomplete time-to-event data in various domains. It is also an important example of probabilistic machine learning. The probabilistic nature of the predictions can be exploited by using (proper) scoring rules in the model fitting process instead of likelihood-based optimization. Our proposal does so in a generic manner and can be used for a variety of model classes. We establish different parametric and non-parametric sub-frameworks that allow different degrees of flexibility. Incorporated into neural networks, it leads to a computationally efficient and scalable optimization routine, yielding state-of-the-art predictive performance. Finally, we show that using our framework, we can recover various parametric models and demonstrate that optimization works equally well when compared to likelihood-based methods.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
FAIR-USE4OS: Guidelines for Creating Impactful Open-Source Software
Authors:
Raphael Sonabend,
Hugo Gruson,
Leo Wolansky,
Agnes Kiragga,
Daniel S. Katz
Abstract:
This paper extends the FAIR (Findable, Accessible, Interoperable, Reusable) guidelines to provide criteria for assessing if software conforms to best practices in open source. By adding 'USE' (User-Centered, Sustainable, Equitable), software development can adhere to open source best practice by incorporating user-input early on, ensuring front-end designs are accessible to all possible stakeholde…
▽ More
This paper extends the FAIR (Findable, Accessible, Interoperable, Reusable) guidelines to provide criteria for assessing if software conforms to best practices in open source. By adding 'USE' (User-Centered, Sustainable, Equitable), software development can adhere to open source best practice by incorporating user-input early on, ensuring front-end designs are accessible to all possible stakeholders, and planning long-term sustainability alongside software design. The FAIR-USE4OS guidelines will allow funders and researchers to more effectively evaluate and plan open source software projects. There is good evidence of funders increasingly mandating that all funded research software is open source; however, even under the FAIR guidelines, this could simply mean software released on public repositories with a Zenodo DOI. By creating FAIR-USE software, best practice can be demonstrated from the very beginning of the design process and the software has the greatest chance of success by being impactful.
△ Less
Submitted 3 April, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Deep Learning for Survival Analysis: A Review
Authors:
Simon Wiegrebe,
Philipp Kopper,
Raphael Sonabend,
Bernd Bischl,
Andreas Bender
Abstract:
The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival-…
▽ More
The influx of deep learning (DL) techniques into the field of survival analysis in recent years has led to substantial methodological progress; for instance, learning from unstructured or high-dimensional data such as images, text or omics data. In this work, we conduct a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In summary, the reviewed methods often address only a small subset of tasks relevant to time-to-event data - e.g., single-risk right-censored data - and neglect to incorporate more complex settings. Our findings are summarized in an editable, open-source, interactive table: https://survival-org.github.io/DL4Survival. As this research area is advancing rapidly, we encourage community contribution in order to keep this database up to date.
△ Less
Submitted 22 February, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Examining properness in the external validation of survival models with squared and logarithmic losses
Authors:
Raphael Sonabend,
John Zobolas,
Philipp Kopper,
Lukas Burk,
Andreas Bender
Abstract:
Scoring rules promote rational and honest decision-making, which is becoming increasingly important for automated procedures in `auto-ML'. In this paper we survey common squared and logarithmic scoring rules for survival analysis and determine which losses are proper and improper. We prove that commonly utilised squared and logarithmic scoring rules that are claimed to be proper are in fact improp…
▽ More
Scoring rules promote rational and honest decision-making, which is becoming increasingly important for automated procedures in `auto-ML'. In this paper we survey common squared and logarithmic scoring rules for survival analysis and determine which losses are proper and improper. We prove that commonly utilised squared and logarithmic scoring rules that are claimed to be proper are in fact improper, such as the Integrated Survival Brier Score (ISBS). We further prove that under a strict set of assumptions a class of scoring rules is strictly proper for, what we term, `approximate' survival losses. Despite the difference in properness, experiments in simulated and real-world datasets show there is no major difference between improper and proper versions of the widely-used ISBS, ensuring that we can reasonably trust previous experiments utilizing the original score for evaluation purposes. We still advocate for the use of proper scoring rules, as even minor differences between losses can have important implications in automated processes such as model tuning. We hope our findings encourage further research into the properties of survival measures so that robust and honest evaluation of survival models can be achieved.
△ Less
Submitted 3 June, 2024; v1 submitted 10 December, 2022;
originally announced December 2022.
-
Flexible Group Fairness Metrics for Survival Analysis
Authors:
Raphael Sonabend,
Florian Pfisterer,
Alan Mishler,
Moritz Schauer,
Lukas Burk,
Sumantrak Mukherjee,
Sebastian Vollmer
Abstract:
Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an…
▽ More
Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an event occurring over time. Survival predictions are particularly important in sensitive settings such as when utilising machine learning for diagnosis and prognosis of patients. In this paper we explore how to utilise existing survival metrics to measure bias with group fairness metrics. We explore this in an empirical experiment with 29 survival datasets and 8 measures. We find that measures of discrimination are able to capture bias well whereas there is less clarity with measures of calibration and scoring rules. We suggest further areas for research including prediction-based fairness metrics for distribution predictions.
△ Less
Submitted 22 July, 2022; v1 submitted 26 May, 2022;
originally announced June 2022.
-
Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures
Authors:
Raphael Sonabend,
Andreas Bender,
Sebastian Vollmer
Abstract:
In this paper we consider how to evaluate survival distribution predictions with measures of discrimination. This is a non-trivial problem as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantage…
▽ More
In this paper we consider how to evaluate survival distribution predictions with measures of discrimination. This is a non-trivial problem as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages. Whilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons. We find that the most robust method of reducing a distribution to a risk is to sum over the predicted cumulative hazard. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation. The code used in the final experiment is available at https://github.com/RaphaelS1/distribution_discrimination.
△ Less
Submitted 9 March, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Designing Machine Learning Toolboxes: Concepts, Principles and Patterns
Authors:
Franz J. Király,
Markus Löning,
Anthony Blaom,
Ahmed Guecioueur,
Raphael Sonabend
Abstract:
Machine learning (ML) and AI toolboxes such as scikit-learn or Weka are workhorses of contemporary data scientific practice -- their central role being enabled by usable yet powerful designs that allow to easily specify, train and validate complex modeling pipelines. However, despite their universal success, the key design principles in their construction have never been fully analyzed. In this pa…
▽ More
Machine learning (ML) and AI toolboxes such as scikit-learn or Weka are workhorses of contemporary data scientific practice -- their central role being enabled by usable yet powerful designs that allow to easily specify, train and validate complex modeling pipelines. However, despite their universal success, the key design principles in their construction have never been fully analyzed. In this paper, we attempt to provide an overview of key patterns in the design of AI modeling toolboxes, taking inspiration, in equal parts, from the field of software engineering, implementation patterns found in contemporary toolboxes, and our own experience from develo** ML toolboxes. In particular, we develop a conceptual model for the AI/ML domain, with a new type system, called scientific types, at its core. Scientific types capture the scientific meaning of common elements in ML workflows based on the set of operations that we usually perform with them (i.e. their interface) and their statistical properties. From our conceptual analysis, we derive a set of design principles and patterns. We illustrate that our analysis can not only explain the design of existing toolboxes, but also guide the development of new ones. We intend our contribution to be a state-of-art reference for future toolbox engineers, a summary of best practices, a collection of ML design patterns which may become useful for future research, and, potentially, the first steps towards a higher-level programming paradigm for constructing AI.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
distr6: R6 Object-Oriented Probability Distributions Interface in R
Authors:
Raphael Sonabend,
Franz Kiraly
Abstract:
distr6 is an object-oriented (OO) probability distributions interface leveraging the extensibility and scalability of R6, and the speed and efficiency of Rcpp. Over 50 probability distributions are currently implemented in the package with `core' methods including density, distribution, and generating functions, and more `exotic' ones including hazards and distribution function anti-derivatives. I…
▽ More
distr6 is an object-oriented (OO) probability distributions interface leveraging the extensibility and scalability of R6, and the speed and efficiency of Rcpp. Over 50 probability distributions are currently implemented in the package with `core' methods including density, distribution, and generating functions, and more `exotic' ones including hazards and distribution function anti-derivatives. In addition to simple distributions, distr6 supports compositions such as truncation, mixtures, and product distributions. This paper presents the core functionality of the package and demonstrates examples for key use-cases. In addition this paper provides a critical review of the object-oriented programming paradigms in R and describes some novel implementations for design patterns and core object-oriented features introduced by the package for supporting distr6 components.
△ Less
Submitted 20 March, 2021; v1 submitted 7 September, 2020;
originally announced September 2020.
-
mlr3proba: An R Package for Machine Learning in Survival Analysis
Authors:
Raphael Sonabend,
Franz J. Király,
Andreas Bender,
Bernd Bischl,
Michel Lang
Abstract:
As machine learning has become increasingly popular over the last few decades, so too has the number of machine learning interfaces for implementing these models. Whilst many R libraries exist for machine learning, very few offer extended support for survival analysis. This is problematic considering its importance in fields like medicine, bioinformatics, economics, engineering, and more. mlr3prob…
▽ More
As machine learning has become increasingly popular over the last few decades, so too has the number of machine learning interfaces for implementing these models. Whilst many R libraries exist for machine learning, very few offer extended support for survival analysis. This is problematic considering its importance in fields like medicine, bioinformatics, economics, engineering, and more. mlr3proba provides a comprehensive machine learning interface for survival analysis and connects with mlr3's general model tuning and benchmarking facilities to provide a systematic infrastructure for survival modeling and evaluation.
△ Less
Submitted 14 December, 2020; v1 submitted 18 August, 2020;
originally announced August 2020.
-
NIPS - Not Even Wrong? A Systematic Review of Empirically Complete Demonstrations of Algorithmic Effectiveness in the Machine Learning and Artificial Intelligence Literature
Authors:
Franz J Király,
Bilal Mateen,
Raphael Sonabend
Abstract:
Objective: To determine the completeness of argumentative steps necessary to conclude effectiveness of an algorithm in a sample of current ML/AI supervised learning literature.
Data Sources: Papers published in the Neural Information Processing Systems (NeurIPS, née NIPS) journal where the official record showed a 2017 year of publication.
Eligibility Criteria: Studies reporting a (semi-)super…
▽ More
Objective: To determine the completeness of argumentative steps necessary to conclude effectiveness of an algorithm in a sample of current ML/AI supervised learning literature.
Data Sources: Papers published in the Neural Information Processing Systems (NeurIPS, née NIPS) journal where the official record showed a 2017 year of publication.
Eligibility Criteria: Studies reporting a (semi-)supervised model, or pre-processing fused with (semi-)supervised models for tabular data.
Study Appraisal: Three reviewers applied the assessment criteria to determine argumentative completeness. The criteria were split into three groups, including: experiments (e.g real and/or synthetic data), baselines (e.g uninformed and/or state-of-art) and quantitative comparison (e.g. performance quantifiers with confidence intervals and formal comparison of the algorithm against baselines).
Results: Of the 121 eligible manuscripts (from the sample of 679 abstracts), 99\% used real-world data and 29\% used synthetic data. 91\% of manuscripts did not report an uninformed baseline and 55\% reported a state-of-art baseline. 32\% reported confidence intervals for performance but none provided references or exposition for how these were calculated. 3\% reported formal comparisons.
Limitations: The use of one journal as the primary information source may not be representative of all ML/AI literature. However, the NeurIPS conference is recognised to be amongst the top tier concerning ML/AI studies, so it is reasonable to consider its corpus to be representative of high-quality research.
Conclusion: Using the 2017 sample of the NeurIPS supervised learning corpus as an indicator for the quality and trustworthiness of current ML/AI research, it appears that complete argumentative chains in demonstrations of algorithmic effectiveness are rare.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.