-
Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms
Authors:
Daniel Agyapong,
Jeffrey Ryan Propster,
Jane Marks,
Toby Dylan Hocking
Abstract:
Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and r…
▽ More
Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and relationships can provide insights into various diseases. Co-occurrence network inference algorithms help us understand the complex associations of micro-organisms, especially bacteria. Existing network inference algorithms employ techniques such as correlation, regularized linear regression, and conditional dependence, which have different hyper-parameters that determine the sparsity of the network. Previous methods for evaluating the quality of the inferred network include using external data, and network consistency across sub-samples, both which have several drawbacks that limit their applicability in real microbiome composition data sets. We propose a novel cross-validation method to evaluate co-occurrence network inference algorithms, and new methods for applying existing algorithms to predict on test data. Our empirical study shows that the proposed method is useful for hyper-parameter selection (training) and comparing the quality of the inferred networks between different algorithms (testing).
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss
Authors:
Kyle R. Rust,
Toby D. Hocking
Abstract:
Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are used to evaluate binary classification algorithms. Because the Area Under the Curve (AUC) is a constant function of the predicted values, learning algorithms instead optimize convex relaxations which involve a sum over all pairs of labeled positive and negative examples. Naive learni…
▽ More
Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are used to evaluate binary classification algorithms. Because the Area Under the Curve (AUC) is a constant function of the predicted values, learning algorithms instead optimize convex relaxations which involve a sum over all pairs of labeled positive and negative examples. Naive learning algorithms compute the gradient in quadratic time, which is too slow for learning using large batch sizes. We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time, and makes it possible to use gradient descent learning with large batch sizes. In our empirical study of supervised binary classification problems, we show that our new algorithm can achieve higher test AUC values on imbalanced data sets than previous algorithms, and make use of larger batch sizes than were previously feasible.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
Functional Labeled Optimal Partitioning
Authors:
Toby D. Hocking,
Jacob M. Kaufman,
Alyssa J. Stenberg
Abstract:
Peak detection is a problem in sequential data analysis that involves differentiating regions with higher counts (peaks) from regions with lower counts (background noise).
It is crucial to correctly predict areas that deviate from the background noise, in both the train and test sets of labels.
Dynamic programming changepoint algorithms have been proposed to solve the peak detection problem by…
▽ More
Peak detection is a problem in sequential data analysis that involves differentiating regions with higher counts (peaks) from regions with lower counts (background noise).
It is crucial to correctly predict areas that deviate from the background noise, in both the train and test sets of labels.
Dynamic programming changepoint algorithms have been proposed to solve the peak detection problem by constraining the mean to alternatively increase and then decrease.
The current constrained changepoint algorithms only create predictions on the test set, while completely ignoring the train set.
Changepoint algorithms that are both accurate when fitting the train set, and make predictions on the test set, have been proposed but not in the context of peak detection models.
We propose to resolve these issues by creating a new dynamic programming algorithm, FLOPART, that has zero train label errors, and is able to provide highly accurate predictions on the test set.
We provide an empirical analysis that shows FLOPART has a similar time complexity while being more accurate than the existing algorithms in terms of train and test label errors.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection
Authors:
Jonathan Hillman,
Toby Dylan Hocking
Abstract:
Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this…
▽ More
Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
A Greedy Graph Search Algorithm Based on Changepoint Analysis for Automatic QRS Complex Detection
Authors:
Atiyeh Fotoohinasab,
Toby Hocking,
Fatemeh Afghah
Abstract:
The electrocardiogram (ECG) signal is the most widely used non-invasive tool for the investigation of cardiovascular diseases. Automatic delineation of ECG fiducial points, in particular the R-peak, serves as the basis for ECG processing and analysis. This study proposes a new method of ECG signal analysis by introducing a new class of graphical models based on optimal changepoint detection models…
▽ More
The electrocardiogram (ECG) signal is the most widely used non-invasive tool for the investigation of cardiovascular diseases. Automatic delineation of ECG fiducial points, in particular the R-peak, serves as the basis for ECG processing and analysis. This study proposes a new method of ECG signal analysis by introducing a new class of graphical models based on optimal changepoint detection models, named the graph-constrained changepoint detection (GCCD) model. The GCCD model treats fiducial points delineation in the non-stationary ECG signal as a changepoint detection problem. The proposed model exploits the sparsity of changepoints to detect abrupt changes within the ECG signal; thereby, the R-peak detection task can be relaxed from any preprocessing step. In this novel approach, prior biological knowledge about the expected sequence of changes is incorporated into the model using the constraint graph, which can be defined manually or automatically. First, we define the constraint graph manually; then, we present a graph learning algorithm that can search for an optimal graph in a greedy scheme. Finally, we compare the manually defined graphs and learned graphs in terms of graph structure and detection accuracy. We evaluate the performance of the algorithm using the MIT-BIH Arrhythmia Database. The proposed model achieves an overall sensitivity of 99.64%, positive predictivity of 99.71%, and detection error rate of 0.19 for the manually defined constraint graph and overall sensitivity of 99.76%, positive predictivity of 99.68%, and detection error rate of 0.55 for the automatic learning constraint graph.
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
A Graph-Constrained Changepoint Learning Approach for Automatic QRS-Complex Detection
Authors:
Atiyeh Fotoohinasab,
Toby Hocking,
Fatemeh Afghah
Abstract:
This study presents a new viewpoint on ECG signal analysis by applying a graph-based changepoint detection model to locate R-peak positions. This model is based on a new graph learning algorithm to learn the constraint graph given the labeled ECG data. The proposed learning algorithm starts with a simple initial graph and iteratively edits the graph so that the final graph has the maximum accuracy…
▽ More
This study presents a new viewpoint on ECG signal analysis by applying a graph-based changepoint detection model to locate R-peak positions. This model is based on a new graph learning algorithm to learn the constraint graph given the labeled ECG data. The proposed learning algorithm starts with a simple initial graph and iteratively edits the graph so that the final graph has the maximum accuracy in R-peak detection. We evaluate the performance of the algorithm on the MIT-BIH Arrhythmia Database. The evaluation results demonstrate that the proposed method can obtain comparable results to other state-of-the-art approaches. The proposed method achieves the overall sensitivity of Sen = 99.64%, positive predictivity of PPR = 99.71%, and detection error rate of DER = 0.19.
△ Less
Submitted 6 February, 2021; v1 submitted 2 February, 2021;
originally announced February 2021.
-
Chatbots language design: the influence of language variation on user experience
Authors:
Ana Paula Chaves,
Jesse Egbert,
Toby Hocking,
Eck Doerry,
Marco Aurelio Gerosa
Abstract:
Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we se…
▽ More
Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in sha** the user's perception of the human-chatbot interaction. Ultimately, we want to determine whether register-specific language influences users' perceptions and experiences with chatbots. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users' preferences of chatbot's linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user's preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users' perceptions of chatbot interactions.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Labeled Optimal Partitioning
Authors:
Toby Dylan Hocking,
Anuraag Srivastava
Abstract:
In data sequences measured over space or time, an important problem is accurate detection of abrupt changes. In partially labeled data, it is important to correctly predict presence/absence of changes in positive/negative labeled regions, in both the train and test sets. One existing dynamic programming algorithm is designed for prediction in unlabeled test regions (and ignores the labels in the t…
▽ More
In data sequences measured over space or time, an important problem is accurate detection of abrupt changes. In partially labeled data, it is important to correctly predict presence/absence of changes in positive/negative labeled regions, in both the train and test sets. One existing dynamic programming algorithm is designed for prediction in unlabeled test regions (and ignores the labels in the train set); another is for accurate fitting of train labels (but does not predict changepoints in unlabeled test regions). We resolve these issues by proposing a new optimal changepoint detection model that is guaranteed to fit the labels in the train data, and can also provide predictions of unlabeled changepoints in test data. We propose a new dynamic programming algorithm, Labeled Optimal Partitioning (LOPART), and we provide a formal proof that it solves the resulting non-convex optimization problem. We provide theoretical and empirical analysis of the time complexity of our algorithm, in terms of the number of labels and the size of the data sequence to segment. Finally, we provide empirical evidence that our algorithm is more accurate than the existing baselines, in terms of train and test label error.
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
Survival regression with accelerated failure time model in XGBoost
Authors:
Avinash Barnwal,
Hyunsu Cho,
Toby Dylan Hocking
Abstract:
Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However,…
▽ More
Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However, existing state-of-the-art implementations of tree-based models have offered limited support for survival regression. In this work, we implement loss functions for learning accelerated failure time (AFT) models in XGBoost, to increase the support for survival modeling for different kinds of label censoring. We demonstrate with real and simulated experiments the effectiveness of AFT in XGBoost with respect to a number of baselines, in two respects: generalization performance and training speed. Furthermore, we take advantage of the support for NVIDIA GPUs in XGBoost to achieve substantial speedup over multi-core CPUs. To our knowledge, our work is the first implementation of AFT that utilizes the processing power of NVIDIA GPUs. Starting from the 1.2.0 release, the XGBoost package natively supports the AFT model. The addition of AFT in XGBoost has had significant impact in the open source community, and a few statistics packages now utilize the XGBoost AFT model.
△ Less
Submitted 21 August, 2021; v1 submitted 8 June, 2020;
originally announced June 2020.
-
A Graph-constrained Changepoint Detection Approach for ECG Segmentation
Authors:
Atiyeh Fotoohinasab,
Toby Hocking,
Fatemeh Afghah
Abstract:
Electrocardiogram (ECG) signal is the most commonly used non-invasive tool in the assessment of cardiovascular diseases. Segmentation of the ECG signal to locate its constitutive waves, in particular the R-peaks, is a key step in ECG processing and analysis. Over the years, several segmentation and QRS complex detection algorithms have been proposed with different features; however, their performa…
▽ More
Electrocardiogram (ECG) signal is the most commonly used non-invasive tool in the assessment of cardiovascular diseases. Segmentation of the ECG signal to locate its constitutive waves, in particular the R-peaks, is a key step in ECG processing and analysis. Over the years, several segmentation and QRS complex detection algorithms have been proposed with different features; however, their performance highly depends on applying preprocessing steps which makes them unreliable in real-time data analysis of ambulatory care settings and remote monitoring systems, where the collected data is highly noisy. Moreover, some issues still remain with the current algorithms in regard to the diverse morphological categories for the ECG signal and their high computation cost. In this paper, we introduce a novel graph-based optimal changepoint detection (GCCD) method for reliable detection of R-peak positions without employing any preprocessing step. The proposed model guarantees to compute the globally optimal changepoint detection solution. It is also generic in nature and can be applied to other time-series biomedical signals. Based on the MIT-BIH arrhythmia (MIT-BIH-AR) database, the proposed method achieves overall sensitivity Sen = 99.76, positive predictivity PPR = 99.68, and detection error rate DER = 0.55 which are comparable to other state-of-the-art approaches.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
Linear time dynamic programming for the exact path of optimal models selected from a finite set
Authors:
Toby Hocking,
Joseph Vargovich
Abstract:
Many learning algorithms are formulated in terms of finding model parameters which minimize a data-fitting loss function plus a regularizer. When the regularizer involves the l0 pseudo-norm, the resulting regularization path consists of a finite set of models. The fastest existing algorithm for computing the breakpoints in the regularization path is quadratic in the number of models, so it scales…
▽ More
Many learning algorithms are formulated in terms of finding model parameters which minimize a data-fitting loss function plus a regularizer. When the regularizer involves the l0 pseudo-norm, the resulting regularization path consists of a finite set of models. The fastest existing algorithm for computing the breakpoints in the regularization path is quadratic in the number of models, so it scales poorly to high dimensional problems. We provide new formal proofs that a dynamic programming algorithm can be used to compute the breakpoints in linear time. Empirical results on changepoint detection problems demonstrate the improved accuracy and speed relative to grid search and the previous quadratic time algorithm.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
Maximum Margin Interval Trees
Authors:
Alexandre Drouin,
Toby Dylan Hocking,
François Laviolette
Abstract:
Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We…
▽ More
Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.
△ Less
Submitted 27 October, 2017; v1 submitted 11 October, 2017;
originally announced October 2017.
-
Support vector comparison machines
Authors:
David Venuto,
Toby Dylan Hocking,
Lakjaree Sphanurattana,
Masashi Sugiyama
Abstract:
In ranking problems, the goal is to learn a ranking function from labeled pairs of input points. In this paper, we consider the related comparison problem, where the label indicates which element of the pair is better, or if there is no significant difference. We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM. We use simulated…
▽ More
In ranking problems, the goal is to learn a ranking function from labeled pairs of input points. In this paper, we consider the related comparison problem, where the label indicates which element of the pair is better, or if there is no significant difference. We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM. We use simulated nonlinear patterns, a real learning to rank sushi data set, and a chess data set to show that our proposed SVMcompare algorithm outperforms SVMrank when there are equality pairs.
△ Less
Submitted 23 July, 2020; v1 submitted 30 January, 2014;
originally announced January 2014.