Search | arXiv e-print repository

Conformal Predictive Systems Under Covariate Shift

Authors: Jef Jonkers, Glenn Van Wallendael, Luc Duchateau, Sofie Van Hoecke

Abstract: Conformal Predictive Systems (CPS) offer a versatile framework for constructing predictive distributions, allowing for calibrated inference and informative decision-making. However, their applicability has been limited to scenarios adhering to the Independent and Identically Distributed (IID) model assumption. This paper extends CPS to accommodate scenarios characterized by covariate shifts. We th… ▽ More Conformal Predictive Systems (CPS) offer a versatile framework for constructing predictive distributions, allowing for calibrated inference and informative decision-making. However, their applicability has been limited to scenarios adhering to the Independent and Identically Distributed (IID) model assumption. This paper extends CPS to accommodate scenarios characterized by covariate shifts. We therefore propose Weighted CPS (WCPS), akin to Weighted Conformal Prediction (WCP), leveraging likelihood ratios between training and testing covariate distributions. This extension enables the construction of nonparametric predictive distributions capable of handling covariate shifts. We present theoretical underpinnings and conjectures regarding the validity and efficacy of WCPS and demonstrate its utility through empirical evaluations on both synthetic and real-world datasets. Our simulation experiments indicate that WCPS are probabilistically calibrated under covariate shift. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 13 pages, 4 figures

arXiv:2402.04906 [pdf, other]

Conformal Convolution and Monte Carlo Meta-learners for Predictive Inference of Individual Treatment Effects

Authors: Jef Jonkers, Jarne Verhaeghe, Glenn Van Wallendael, Luc Duchateau, Sofie Van Hoecke

Abstract: Knowledge of the effect of interventions, known as the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect using conditional average treatment effect (CATE) meta-learners often provide only a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired to enhance decision-making confidence. To address this… ▽ More Knowledge of the effect of interventions, known as the treatment effect, is paramount for decision-making. Approaches to estimating this treatment effect using conditional average treatment effect (CATE) meta-learners often provide only a point estimate of this treatment effect, while additional uncertainty quantification is frequently desired to enhance decision-making confidence. To address this, we introduce two novel approaches: the conformal convolution T-learner (CCT-learner) and conformal Monte Carlo (CMC) meta-learners. The approaches leverage weighted conformal predictive systems (WCPS), Monte Carlo sampling, and CATE meta-learners to generate predictive distributions of individual treatment effect (ITE) that could enhance individualized decision-making. Although we show how assumptions about the noise distribution of the outcome influence the uncertainty predictions, our experiments demonstrate that the CCT- and CMC meta-learners achieve strong coverage while maintaining narrow interval widths. They also generate probabilistically calibrated predictive distributions, providing reliable ranges of ITEs across various synthetic and semi-synthetic datasets. Code: https://github.com/predict-idlab/cct-cmc △ Less

Submitted 12 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: 25 pages, 14 figures

arXiv:2401.13518 [pdf, other]

Addressing Data Quality Challenges in Observational Ambulatory Studies: Analysis, Methodologies and Practical Solutions for Wrist-worn Wearable Monitoring

Authors: Jonas Van Der Donckt, Nicolas Vandenbussche, Jeroen Van Der Donckt, Stephanie Chen, Marija Stojchevska, Mathias De Brouwer, Bram Steenwinckel, Koen Paemeleire, Femke Ongenae, Sofie Van Hoecke

Abstract: Chronic disease management and follow-up are vital for realizing sustained patient well-being and optimal health outcomes. Recent advancements in wearable sensing technologies, particularly wrist-worn devices, offer promising solutions for longitudinal patient follow-up by shifting from subjective, intermittent self-reporting to objective, continuous monitoring. However, collecting and analyzing w… ▽ More Chronic disease management and follow-up are vital for realizing sustained patient well-being and optimal health outcomes. Recent advancements in wearable sensing technologies, particularly wrist-worn devices, offer promising solutions for longitudinal patient follow-up by shifting from subjective, intermittent self-reporting to objective, continuous monitoring. However, collecting and analyzing wearable data presents unique challenges, such as data entry errors, non-wear periods, missing wearable data, and wearable artifacts. We therefore present an in-depth exploration of data analysis challenges tied to wrist-worn wearables and ambulatory label acquisition, using two real-world datasets (i.e., mBrain21 and ETRI lifelog2020). We introduce novel practical countermeasures, including participant compliance visualizations, interaction-triggered questionnaires to assess personal bias, and an optimized wearable non-wear detection pipeline. Further, we propose a visual analytics approach to validate processing pipelines using scalable tools such as tsflex and Plotly-Resampler. Lastly, we investigate the impact of missing wearable data on "window-of-interest" analysis methodologies. Prioritizing transparency and reproducibility, we offer open access to our detailed code examples, facilitating adaptation in future wearable research. In conclusion, our contributions provide actionable approaches for wearable data collection and analysis in chronic disease management. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 29 pages, 16 figures

arXiv:2307.05389 [pdf, other]

tsdownsample: high-performance time series downsampling for scalable visualization

Authors: Jeroen Van Der Donckt, Jonas Van Der Donckt, Sofie Van Hoecke

Abstract: Interactive line chart visualizations greatly enhance the effective exploration of large time series. Although downsampling has emerged as a well-established approach to enable efficient interactive visualization of large datasets, it is not an inherent feature in most visualization tools. Furthermore, there is no library offering a convenient interface for high-performance implementations of prom… ▽ More Interactive line chart visualizations greatly enhance the effective exploration of large time series. Although downsampling has emerged as a well-established approach to enable efficient interactive visualization of large datasets, it is not an inherent feature in most visualization tools. Furthermore, there is no library offering a convenient interface for high-performance implementations of prominent downsampling algorithms. To address these shortcomings, we present tsdownsample, an open-source Python package specifically designed for CPU-based, in-memory time series downsampling. Our library focuses on performance and convenient integration, offering optimized implementations of leading downsampling algorithms. We achieve this optimization by leveraging low-level SIMD instructions and multithreading capabilities in Rust. In particular, SIMD instructions were employed to optimize the argmin and argmax operations. This SIMD optimization, along with some algorithmic tricks, proved crucial in enhancing the performance of various downsampling algorithms. We evaluate the performance of tsdownsample and demonstrate its interoperability with an established visualization framework. Our performance benchmarks indicate that the algorithmic runtime of tsdownsample approximates the CPU's memory bandwidth. This work marks a significant advancement in bringing high-performance time series downsampling to the Python ecosystem, enabling scalable visualization. The open-source code can be found at https://github.com/predict-idlab/tsdownsample △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: Submitted to SoftwareX

arXiv:2305.00332 [pdf, other]

MinMaxLTTB: Leveraging MinMax-Preselection to Scale LTTB

Authors: Jeroen Van Der Donckt, Jonas Van Der Donckt, Michael Rademaker, Sofie Van Hoecke

Abstract: Visualization plays an important role in analyzing and exploring time series data. To facilitate efficient visualization of large datasets, downsampling has emerged as a well-established approach. This work concentrates on LTTB (Largest-Triangle-Three-Buckets), a widely adopted downsampling algorithm for time series data point selection. Specifically, we propose MinMaxLTTB, a two-step algorithm th… ▽ More Visualization plays an important role in analyzing and exploring time series data. To facilitate efficient visualization of large datasets, downsampling has emerged as a well-established approach. This work concentrates on LTTB (Largest-Triangle-Three-Buckets), a widely adopted downsampling algorithm for time series data point selection. Specifically, we propose MinMaxLTTB, a two-step algorithm that marks a significant enhancement in the scalability of LTTB. MinMaxLTTB entails the following two steps: (i) the MinMax algorithm preselects a certain ratio of minimum and maximum data points, followed by (ii) applying the LTTB algorithm on only these preselected data points, effectively reducing LTTB's time complexity. The low computational cost of the MinMax algorithm, along with its parallelization capabilities, facilitates efficient preselection of data points. Additionally, the competitive performance of MinMax in terms of visual representativeness also makes it an effective reduction method. Experiments show that MinMaxLTTB outperforms LTTB by more than an order of magnitude in terms of computation time. Furthermore, preselecting a small multiple of the desired output size already provides similar visual representativeness compared to LTTB. In summary, MinMaxLTTB leverages the computational efficiency of MinMax to scale LTTB, without compromising on LTTB's favored visualization properties. The accompanying code and experiments of this paper can be found at https://github.com/predict-idlab/MinMaxLTTB. △ Less

Submitted 29 April, 2023; originally announced May 2023.

Comments: The first two authors contributed equally. Submitted to IEEE VIS 2023

arXiv:2304.00900 [pdf, other]

Data Point Selection for Line Chart Visualization: Methodological Assessment and Evidence-Based Guidelines

Authors: Jonas Van Der Donckt, Jeroen Van Der Donckt, Michael Rademaker, Sofie Van Hoecke

Abstract: Time series visualization plays a crucial role in identifying patterns and extracting insights across various domains. However, as datasets continue to grow in size, visualizing them effectively becomes challenging. Downsampling, which involves data aggregation or selection, is a well-established approach to overcome this challenge. This work focuses on data selection algorithms, which accomplish… ▽ More Time series visualization plays a crucial role in identifying patterns and extracting insights across various domains. However, as datasets continue to grow in size, visualizing them effectively becomes challenging. Downsampling, which involves data aggregation or selection, is a well-established approach to overcome this challenge. This work focuses on data selection algorithms, which accomplish downsampling by selecting values from the original time series. Despite their widespread adoption in visualization platforms and time series databases, there is limited literature on the evaluation of these techniques. To address this, we propose an extensive metrics-based evaluation methodology. Our methodology analyzes visual representativeness by assessing how well a downsampled time series line chart visually approximates the original data. Moreover, our methodology includes a novel concept called "visual stability", which captures visual changes when updating (streaming) or interacting with the visualization (panning and zooming). We evaluated four data point selection algorithms across three open-source visualization toolkits using our proposed methodology, considering various figure-drawing properties. Following the analysis of our findings, we formulated a set of evidence-based guidelines for line chart visualization at scale with downsampling. To promote reproducibility and enable the qualitative evaluation of new advancements in time series data point selection, we have made our methodology and results openly accessible. The proposed evaluation methodology, along with the obtained insights from this study, establishes a foundation for future research in this domain. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: The first two authors contributed equally. Submitted to IEEE VIS 2023

arXiv:2211.05597 [pdf, other]

Perfectly predicting ICU length of stay: too good to be true

Authors: Sandeep Ramachandra, Gilles Vandewiele, David Vander Mijnsbrugge, Femke Ongenae, Sofie Van Hoecke

Abstract: A paper of Alsinglawi et al was recently accepted and published in Scientific Reports. In this paper, the authors aim to predict length of stay (LOS), discretized into either long (> 7 days) or short stays (< 7 days), of lung cancer patients in an ICU department using various machine learning techniques. The authors claim to achieve perfect results with an Area Under the Receiver Operating Charact… ▽ More A paper of Alsinglawi et al was recently accepted and published in Scientific Reports. In this paper, the authors aim to predict length of stay (LOS), discretized into either long (> 7 days) or short stays (< 7 days), of lung cancer patients in an ICU department using various machine learning techniques. The authors claim to achieve perfect results with an Area Under the Receiver Operating Characteristic curve (AUROC) of 100% with a Random Forest (RF) classifier with ADASYN class balancing over sampling technique, which if accurate could have significant implications for hospital management. However, we have identified several methodological flaws within the manuscript which cause the results to be overly optimistic and would have serious consequences if used in a clinical practice. Moreover, the reporting of the methodology is unclear and many important details are missing from the manuscript, which makes reproduction extremely difficult. We highlight the effect these oversights have had on the result and provide a more believable result of 88.91% AUROC when these oversights are corrected. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: 3 pages, 1 figure, 2 tables

arXiv:2207.07753 [pdf, other]

doi 10.1016/j.bspc.2022.104429

Do Not Sleep on Traditional Machine Learning: Simple and Interpretable Techniques Are Competitive to Deep Learning for Sleep Scoring

Authors: Jeroen Van Der Donckt, Jonas Van Der Donckt, Emiel Deprost, Nicolas Vandenbussche, Michael Rademaker, Gilles Vandewiele, Sofie Van Hoecke

Abstract: Over the last few years, research in automatic sleep scoring has mainly focused on develo** increasingly complex deep learning architectures. However, recently these approaches achieved only marginal improvements, often at the expense of requiring more data and more expensive training procedures. Despite all these efforts and their satisfactory performance, automatic sleep staging solutions are… ▽ More Over the last few years, research in automatic sleep scoring has mainly focused on develo** increasingly complex deep learning architectures. However, recently these approaches achieved only marginal improvements, often at the expense of requiring more data and more expensive training procedures. Despite all these efforts and their satisfactory performance, automatic sleep staging solutions are not widely adopted in a clinical context yet. We argue that most deep learning solutions for sleep scoring are limited in their real-world applicability as they are hard to train, deploy, and reproduce. Moreover, these solutions lack interpretability and transparency, which are often key to increase adoption rates. In this work, we revisit the problem of sleep stage classification using classical machine learning. Results show that competitive performance can be achieved with a conventional machine learning pipeline consisting of preprocessing, feature extraction, and a simple machine learning model. In particular, we analyze the performance of a linear model and a non-linear (gradient boosting) model. Our approach surpasses state-of-the-art (that uses the same data) on two public datasets: Sleep-EDF SC-20 (MF1 0.810) and Sleep-EDF ST (MF1 0.795), while achieving competitive results on Sleep-EDF SC-78 (MF1 0.775) and MASS SS3 (MF1 0.817). We show that, for the sleep stage scoring task, the expressiveness of an engineered feature vector is on par with the internally learned representations of deep learning models. This observation opens the door to clinical adoption, as a representative feature vector allows to leverage both the interpretability and successful track record of traditional machine learning models. △ Less

Submitted 14 December, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: The first two authors contributed equally. Accepted to Biomedical Signal Processing and Control

arXiv:2206.08703 [pdf, other]

Plotly-Resampler: Effective Visual Analytics for Large Time Series

Authors: Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, Sofie Van Hoecke

Abstract: Visual analytics is arguably the most important step in getting acquainted with your data. This is especially the case for time series, as this data type is hard to describe and cannot be fully understood when using for example summary statistics. To realize effective time series visualization, four requirements have to be met; a tool should be (1) interactive, (2) scalable to millions of data poi… ▽ More Visual analytics is arguably the most important step in getting acquainted with your data. This is especially the case for time series, as this data type is hard to describe and cannot be fully understood when using for example summary statistics. To realize effective time series visualization, four requirements have to be met; a tool should be (1) interactive, (2) scalable to millions of data points, (3) integrable in conventional data science environments, and (4) highly configurable. We observe that open source Python visualization toolkits empower data scientists in most visual analytics tasks, but lack the combination of scalability and interactivity to realize effective time series visualization. As a means to facilitate these requirements, we created Plotly-Resampler, an open source Python library. Plotly-Resampler is an add-on for Plotly's Python bindings, enhancing line chart scalability on top of an interactive toolkit by aggregating the underlying data depending on the current graph view. Plotly-Resampler is built to be snappy, as the reactivity of a tool qualitatively affects how analysts visually explore and analyze data. A benchmark task highlights how our toolkit scales better than alternatives in terms of number of samples and time series. Additionally, Plotly-Resampler's flexible data aggregation functionality paves the path towards researching novel aggregation techniques. Plotly-Resampler's integrability, together with its configurability, convenience, and high scalability, allows to effectively analyze high-frequency data in your day-to-day Python environment. △ Less

Submitted 17 July, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: The first two authors contributed equally. Accepted at IEEE VIS 2022

arXiv:2206.08394 [pdf, other]

Powershap: A Power-full Shapley Feature Selection Method

Authors: Jarne Verhaeghe, Jeroen Van Der Donckt, Femke Ongenae, Sofie Van Hoecke

Abstract: Feature selection is a crucial step in develo** robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with… ▽ More Feature selection is a crucial step in develo** robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed. △ Less

Submitted 6 July, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Accepted at ECML PKDD 2022

arXiv:2203.02424 [pdf, other]

R-GCN: The R Could Stand for Random

Authors: Vic Degraeve, Gilles Vandewiele, Femke Ongenae, Sofie Van Hoecke

Abstract: The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this pa… ▽ More The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned weights. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which leaves all parameters untrained and thus constructs node embeddings by aggregating randomly transformed random representations from neighbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings. △ Less

Submitted 6 May, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

arXiv:2111.12429 [pdf, other]

tsflex: flexible time series processing & feature extraction

Authors: Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, Sofie Van Hoecke

Abstract: Time series processing and feature extraction are crucial and time-intensive steps in conventional machine learning pipelines. Existing packages are limited in their applicability, as they cannot cope with irregularly-sampled or asynchronous data and make strong assumptions about the data format. Moreover, these packages do not focus on execution speed and memory efficiency, resulting in considera… ▽ More Time series processing and feature extraction are crucial and time-intensive steps in conventional machine learning pipelines. Existing packages are limited in their applicability, as they cannot cope with irregularly-sampled or asynchronous data and make strong assumptions about the data format. Moreover, these packages do not focus on execution speed and memory efficiency, resulting in considerable overhead. We present $\texttt{tsflex}$, a Python toolkit for time series processing and feature extraction, that focuses on performance and flexibility, enabling broad applicability. This toolkit leverages window-stride arguments of the same data type as the sequence-index, and maintains the sequence-index through all operations. $\texttt{tsflex}$ is flexible as it supports (1) multivariate time series, (2) multiple window-stride configurations, and (3) integrates with processing and feature functions from other packages, while (4) making no assumptions about the data sampling regularity, series alignment, and data type. Other functionalities include multiprocessing, detailed execution logging, chunking sequences, and serialization. Benchmarks show that $\texttt{tsflex}$ is faster and more memory-efficient compared to similar packages, while being more permissive and flexible in its utilization. △ Less

Submitted 22 December, 2021; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: The first two authors contributed equally. Submitted to SoftwareX

arXiv:2001.06296 [pdf, other]

doi 10.1016/j.artmed.2020.101987

Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling

Authors: Gilles Vandewiele, Isabelle Dehaene, György Kovács, Lucas Sterckx, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Johan Decruyenaere, Sofie Van Hoecke, Thomas Demeester

Abstract: Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram datab… ▽ More Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license. △ Less

Submitted 28 November, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Journal ref: Artificial Intelligence in Medicine. 111 (2021). 101987

arXiv:1812.02405 [pdf, other]

Web Applicable Computer-aided Diagnosis of Glaucoma Using Deep Learning

Authors: Mijung Kim, Olivier Janssens, Ho-min Park, Jasper Zuallaert, Sofie Van Hoecke, Wesley De Neve

Abstract: Glaucoma is a major eye disease, leading to vision loss in the absence of proper medical treatment. Current diagnosis of glaucoma is performed by ophthalmologists who are often analyzing several types of medical images generated by different types of medical equipment. Capturing and analyzing these medical images is labor-intensive and expensive. In this paper, we present a novel computational app… ▽ More Glaucoma is a major eye disease, leading to vision loss in the absence of proper medical treatment. Current diagnosis of glaucoma is performed by ophthalmologists who are often analyzing several types of medical images generated by different types of medical equipment. Capturing and analyzing these medical images is labor-intensive and expensive. In this paper, we present a novel computational approach towards glaucoma diagnosis and localization, only making use of eye fundus images that are analyzed by state-of-the-art deep learning techniques. Specifically, our approach leverages Convolutional Neural Networks (CNNs) and Gradient-weighted Class Activation Map** (Grad-CAM) for glaucoma diagnosis and localization, respectively. Quantitative and qualitative results, as obtained for a small-sized dataset with no segmentation ground truth, demonstrate that the proposed approach is promising, for instance achieving an accuracy of 0.91$\pm0.02$ and an ROC-AUC score of 0.94 for the diagnosis task. Furthermore, we present a publicly available prototype web application that integrates our predictive model, with the goal of making effective glaucoma diagnosis available to a wide audience. △ Less

Submitted 3 April, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:cs/0101200

Report number: ML4H/2018/191

arXiv:1611.05722 [pdf, other]

GENESIM: genetic extraction of a single, interpretable model

Authors: Gilles Vandewiele, Olivier Janssens, Femke Ongenae, Filip De Turck, Sofie Van Hoecke

Abstract: Models obtained by decision tree induction techniques excel in being interpretable.However, they can be prone to overfitting, which results in a low predictive performance. Ensemble techniques are able to achieve a higher accuracy. However, this comes at a cost of losing interpretability of the resulting model. This makes ensemble techniques impractical in applications where decision support, inst… ▽ More Models obtained by decision tree induction techniques excel in being interpretable.However, they can be prone to overfitting, which results in a low predictive performance. Ensemble techniques are able to achieve a higher accuracy. However, this comes at a cost of losing interpretability of the resulting model. This makes ensemble techniques impractical in applications where decision support, instead of decision making, is crucial. To bridge this gap, we present the GENESIM algorithm that transforms an ensemble of decision trees to a single decision tree with an enhanced predictive performance by using a genetic algorithm. We compared GENESIM to prevalent decision tree induction and ensemble techniques using twelve publicly available data sets. The results show that GENESIM achieves a better predictive performance on most of these data sets than decision tree induction techniques and a predictive performance in the same order of magnitude as the ensemble techniques. Moreover, the resulting model of GENESIM has a very low complexity, making it very interpretable, in contrast to ensemble techniques. △ Less

Submitted 17 November, 2016; originally announced November 2016.

Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems

arXiv:1512.07780 [pdf, other]

doi 10.1017/S1471068416000016

The Pragmatic Proof: Hypermedia API Composition and Execution

Authors: Ruben Verborgh, Dörthe Arndt, Sofie Van Hoecke, Jos De Roo, Giovanni Mels, Thomas Steiner, Joaquim Gabarro

Abstract: Machine clients are increasingly making use of the Web to perform tasks. While Web services traditionally mimic remote procedure calling interfaces, a new generation of so-called hypermedia APIs works through hyperlinks and forms, in a way similar to how people browse the Web. This means that existing composition techniques, which determine a procedural plan upfront, are not sufficient to consume… ▽ More Machine clients are increasingly making use of the Web to perform tasks. While Web services traditionally mimic remote procedure calling interfaces, a new generation of so-called hypermedia APIs works through hyperlinks and forms, in a way similar to how people browse the Web. This means that existing composition techniques, which determine a procedural plan upfront, are not sufficient to consume hypermedia APIs, which need to be navigated at runtime. Clients instead need a more dynamic plan that allows them to follow hyperlinks and use forms with a preset goal. Therefore, in this article, we show how compositions of hypermedia APIs can be created by generic Semantic Web reasoners. This is achieved through the generation of a proof based on semantic descriptions of the APIs' functionality. To pragmatically verify the applicability of compositions, we introduce the notion of pre-execution and post-execution proofs. The runtime interaction between a client and a server is guided by proofs but driven by hypermedia, allowing the client to react to the application's actual state indicated by the server's response. We describe how to generate compositions from descriptions, discuss a computer-assisted process to generate descriptions, and verify reasoner performance on various composition tasks using a benchmark suite. The experimental results lead to the conclusion that proof-based consumption of hypermedia APIs is a feasible strategy at Web scale. △ Less

Submitted 24 December, 2015; originally announced December 2015.

Comments: Under consideration in Theory and Practice of Logic Programming (TPLP)

arXiv:1410.1130 [pdf, other]

Real-time animation of human characters with fuzzy controllers

Authors: Koen Samyn, Sofie Van Hoecke, Bart Pieters, Charles Hollemeersch, Aljosha Demeulemeester, Rik van de Walle

Abstract: The production of animation is a resource intensive process in game companies. Therefore, techniques to synthesize animations have been developed. However, these procedural techniques offer limited adaptability by animation artists. In order to solve this, a fuzzy neural network model of the animation is proposed, where the parameters can be tuned either by machine learning techniques that use mot… ▽ More The production of animation is a resource intensive process in game companies. Therefore, techniques to synthesize animations have been developed. However, these procedural techniques offer limited adaptability by animation artists. In order to solve this, a fuzzy neural network model of the animation is proposed, where the parameters can be tuned either by machine learning techniques that use motion capture data as training data or by the animation artist himself. This paper illustrates how this real time procedural animation system can be developed, taking the human gait on flat terrain and inclined surfaces as example. Currently, the parametric model is capable of synthesizing animations for various limb sizes and step sizes. △ Less

Submitted 5 October, 2014; originally announced October 2014.

Showing 1–17 of 17 results for author: Van Hoecke, S