Search | arXiv e-print repository

Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms

Authors: Daniel Agyapong, Jeffrey Ryan Propster, Jane Marks, Toby Dylan Hocking

Abstract: Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and r… ▽ More Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and relationships can provide insights into various diseases. Co-occurrence network inference algorithms help us understand the complex associations of micro-organisms, especially bacteria. Existing network inference algorithms employ techniques such as correlation, regularized linear regression, and conditional dependence, which have different hyper-parameters that determine the sparsity of the network. Previous methods for evaluating the quality of the inferred network include using external data, and network consistency across sub-samples, both which have several drawbacks that limit their applicability in real microbiome composition data sets. We propose a novel cross-validation method to evaluate co-occurrence network inference algorithms, and new methods for applying existing algorithms to predict on test data. Our empirical study shows that the proposed method is useful for hyper-parameter selection (training) and comparing the quality of the inferred networks between different algorithms (testing). △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2302.11062 [pdf, other]

A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss

Authors: Kyle R. Rust, Toby D. Hocking

Abstract: Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are used to evaluate binary classification algorithms. Because the Area Under the Curve (AUC) is a constant function of the predicted values, learning algorithms instead optimize convex relaxations which involve a sum over all pairs of labeled positive and negative examples. Naive learni… ▽ More Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are used to evaluate binary classification algorithms. Because the Area Under the Curve (AUC) is a constant function of the predicted values, learning algorithms instead optimize convex relaxations which involve a sum over all pairs of labeled positive and negative examples. Naive learning algorithms compute the gradient in quadratic time, which is too slow for learning using large batch sizes. We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time, and makes it possible to use gradient descent learning with large batch sizes. In our empirical study of supervised binary classification problems, we show that our new algorithm can achieve higher test AUC values on imbalanced data sets than previous algorithms, and make use of larger batch sizes than were previously feasible. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2210.02580 [pdf, other]

Functional Labeled Optimal Partitioning

Authors: Toby D. Hocking, Jacob M. Kaufman, Alyssa J. Stenberg

Abstract: Peak detection is a problem in sequential data analysis that involves differentiating regions with higher counts (peaks) from regions with lower counts (background noise). It is crucial to correctly predict areas that deviate from the background noise, in both the train and test sets of labels. Dynamic programming changepoint algorithms have been proposed to solve the peak detection problem by… ▽ More Peak detection is a problem in sequential data analysis that involves differentiating regions with higher counts (peaks) from regions with lower counts (background noise). It is crucial to correctly predict areas that deviate from the background noise, in both the train and test sets of labels. Dynamic programming changepoint algorithms have been proposed to solve the peak detection problem by constraining the mean to alternatively increase and then decrease. The current constrained changepoint algorithms only create predictions on the test set, while completely ignoring the train set. Changepoint algorithms that are both accurate when fitting the train set, and make predictions on the test set, have been proposed but not in the context of peak detection models. We propose to resolve these issues by creating a new dynamic programming algorithm, FLOPART, that has zero train label errors, and is able to provide highly accurate predictions on the test set. We provide an empirical analysis that shows FLOPART has a similar time complexity while being more accurate than the existing algorithms in terms of train and test label errors. △ Less

Submitted 5 October, 2022; originally announced October 2022.

arXiv:2107.01285 [pdf, other]

Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection

Authors: Jonathan Hillman, Toby Dylan Hocking

Abstract: Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this… ▽ More Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines. △ Less

Submitted 2 July, 2021; originally announced July 2021.

arXiv:2102.03538 [pdf, other]

doi 10.1016/j.compbiomed.2021.104208

A Greedy Graph Search Algorithm Based on Changepoint Analysis for Automatic QRS Complex Detection

Authors: Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

Abstract: The electrocardiogram (ECG) signal is the most widely used non-invasive tool for the investigation of cardiovascular diseases. Automatic delineation of ECG fiducial points, in particular the R-peak, serves as the basis for ECG processing and analysis. This study proposes a new method of ECG signal analysis by introducing a new class of graphical models based on optimal changepoint detection models… ▽ More The electrocardiogram (ECG) signal is the most widely used non-invasive tool for the investigation of cardiovascular diseases. Automatic delineation of ECG fiducial points, in particular the R-peak, serves as the basis for ECG processing and analysis. This study proposes a new method of ECG signal analysis by introducing a new class of graphical models based on optimal changepoint detection models, named the graph-constrained changepoint detection (GCCD) model. The GCCD model treats fiducial points delineation in the non-stationary ECG signal as a changepoint detection problem. The proposed model exploits the sparsity of changepoints to detect abrupt changes within the ECG signal; thereby, the R-peak detection task can be relaxed from any preprocessing step. In this novel approach, prior biological knowledge about the expected sequence of changes is incorporated into the model using the constraint graph, which can be defined manually or automatically. First, we define the constraint graph manually; then, we present a graph learning algorithm that can search for an optimal graph in a greedy scheme. Finally, we compare the manually defined graphs and learned graphs in terms of graph structure and detection accuracy. We evaluate the performance of the algorithm using the MIT-BIH Arrhythmia Database. The proposed model achieves an overall sensitivity of 99.64%, positive predictivity of 99.71%, and detection error rate of 0.19 for the manually defined constraint graph and overall sensitivity of 99.76%, positive predictivity of 99.68%, and detection error rate of 0.55 for the automatic learning constraint graph. △ Less

Submitted 6 February, 2021; originally announced February 2021.

arXiv:2102.01319 [pdf, other]

A Graph-Constrained Changepoint Learning Approach for Automatic QRS-Complex Detection

Authors: Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

Abstract: This study presents a new viewpoint on ECG signal analysis by applying a graph-based changepoint detection model to locate R-peak positions. This model is based on a new graph learning algorithm to learn the constraint graph given the labeled ECG data. The proposed learning algorithm starts with a simple initial graph and iteratively edits the graph so that the final graph has the maximum accuracy… ▽ More This study presents a new viewpoint on ECG signal analysis by applying a graph-based changepoint detection model to locate R-peak positions. This model is based on a new graph learning algorithm to learn the constraint graph given the labeled ECG data. The proposed learning algorithm starts with a simple initial graph and iteratively edits the graph so that the final graph has the maximum accuracy in R-peak detection. We evaluate the performance of the algorithm on the MIT-BIH Arrhythmia Database. The evaluation results demonstrate that the proposed method can obtain comparable results to other state-of-the-art approaches. The proposed method achieves the overall sensitivity of Sen = 99.64%, positive predictivity of PPR = 99.71%, and detection error rate of DER = 0.19. △ Less

Submitted 6 February, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: accepted in Asilomar 2020 conference

arXiv:2101.11089 [pdf, other]

doi 10.1145/3349537.3351901

Chatbots language design: the influence of language variation on user experience

Authors: Ana Paula Chaves, Jesse Egbert, Toby Hocking, Eck Doerry, Marco Aurelio Gerosa

Abstract: Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we se… ▽ More Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in sha** the user's perception of the human-chatbot interaction. Ultimately, we want to determine whether register-specific language influences users' perceptions and experiences with chatbots. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users' preferences of chatbot's linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user's preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users' perceptions of chatbot interactions. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Journal ref: 7th International Conference on Human-Agent Interaction (HAI 2019), Kyoto, Japan

arXiv:2006.13967 [pdf, other]

Labeled Optimal Partitioning

Authors: Toby Dylan Hocking, Anuraag Srivastava

Abstract: In data sequences measured over space or time, an important problem is accurate detection of abrupt changes. In partially labeled data, it is important to correctly predict presence/absence of changes in positive/negative labeled regions, in both the train and test sets. One existing dynamic programming algorithm is designed for prediction in unlabeled test regions (and ignores the labels in the t… ▽ More In data sequences measured over space or time, an important problem is accurate detection of abrupt changes. In partially labeled data, it is important to correctly predict presence/absence of changes in positive/negative labeled regions, in both the train and test sets. One existing dynamic programming algorithm is designed for prediction in unlabeled test regions (and ignores the labels in the train set); another is for accurate fitting of train labels (but does not predict changepoints in unlabeled test regions). We resolve these issues by proposing a new optimal changepoint detection model that is guaranteed to fit the labels in the train data, and can also provide predictions of unlabeled changepoints in test data. We propose a new dynamic programming algorithm, Labeled Optimal Partitioning (LOPART), and we provide a formal proof that it solves the resulting non-convex optimization problem. We provide theoretical and empirical analysis of the time complexity of our algorithm, in terms of the number of labels and the size of the data sequence to segment. Finally, we provide empirical evidence that our algorithm is more accurate than the existing baselines, in terms of train and test label error. △ Less

Submitted 24 June, 2020; originally announced June 2020.

arXiv:2006.04920 [pdf, other]

Survival regression with accelerated failure time model in XGBoost

Authors: Avinash Barnwal, Hyunsu Cho, Toby Dylan Hocking

Abstract: Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However,… ▽ More Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However, existing state-of-the-art implementations of tree-based models have offered limited support for survival regression. In this work, we implement loss functions for learning accelerated failure time (AFT) models in XGBoost, to increase the support for survival modeling for different kinds of label censoring. We demonstrate with real and simulated experiments the effectiveness of AFT in XGBoost with respect to a number of baselines, in two respects: generalization performance and training speed. Furthermore, we take advantage of the support for NVIDIA GPUs in XGBoost to achieve substantial speedup over multi-core CPUs. To our knowledge, our work is the first implementation of AFT that utilizes the processing power of NVIDIA GPUs. Starting from the 1.2.0 release, the XGBoost package natively supports the AFT model. The addition of AFT in XGBoost has had significant impact in the open source community, and a few statistics packages now utilize the XGBoost AFT model. △ Less

Submitted 21 August, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

arXiv:2004.13558 [pdf, other]

A Graph-constrained Changepoint Detection Approach for ECG Segmentation

Authors: Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

Abstract: Electrocardiogram (ECG) signal is the most commonly used non-invasive tool in the assessment of cardiovascular diseases. Segmentation of the ECG signal to locate its constitutive waves, in particular the R-peaks, is a key step in ECG processing and analysis. Over the years, several segmentation and QRS complex detection algorithms have been proposed with different features; however, their performa… ▽ More Electrocardiogram (ECG) signal is the most commonly used non-invasive tool in the assessment of cardiovascular diseases. Segmentation of the ECG signal to locate its constitutive waves, in particular the R-peaks, is a key step in ECG processing and analysis. Over the years, several segmentation and QRS complex detection algorithms have been proposed with different features; however, their performance highly depends on applying preprocessing steps which makes them unreliable in real-time data analysis of ambulatory care settings and remote monitoring systems, where the collected data is highly noisy. Moreover, some issues still remain with the current algorithms in regard to the diverse morphological categories for the ECG signal and their high computation cost. In this paper, we introduce a novel graph-based optimal changepoint detection (GCCD) method for reliable detection of R-peak positions without employing any preprocessing step. The proposed model guarantees to compute the globally optimal changepoint detection solution. It is also generic in nature and can be applied to other time-series biomedical signals. Based on the MIT-BIH arrhythmia (MIT-BIH-AR) database, the proposed method achieves overall sensitivity Sen = 99.76, positive predictivity PPR = 99.68, and detection error rate DER = 0.55 which are comparable to other state-of-the-art approaches. △ Less

Submitted 24 April, 2020; originally announced April 2020.

arXiv:2003.02808 [pdf, other]

Linear time dynamic programming for the exact path of optimal models selected from a finite set

Authors: Toby Hocking, Joseph Vargovich

Abstract: Many learning algorithms are formulated in terms of finding model parameters which minimize a data-fitting loss function plus a regularizer. When the regularizer involves the l0 pseudo-norm, the resulting regularization path consists of a finite set of models. The fastest existing algorithm for computing the breakpoints in the regularization path is quadratic in the number of models, so it scales… ▽ More Many learning algorithms are formulated in terms of finding model parameters which minimize a data-fitting loss function plus a regularizer. When the regularizer involves the l0 pseudo-norm, the resulting regularization path consists of a finite set of models. The fastest existing algorithm for computing the breakpoints in the regularization path is quadratic in the number of models, so it scales poorly to high dimensional problems. We provide new formal proofs that a dynamic programming algorithm can be used to compute the breakpoints in linear time. Empirical results on changepoint detection problems demonstrate the improved accuracy and speed relative to grid search and the previous quadratic time algorithm. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: 14 pages

arXiv:1710.04234 [pdf, other]

Maximum Margin Interval Trees

Authors: Alexandre Drouin, Toby Dylan Hocking, François Laviolette

Abstract: Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We… ▽ More Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets. △ Less

Submitted 27 October, 2017; v1 submitted 11 October, 2017; originally announced October 2017.

Comments: Accepted for presentation at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA

arXiv:1401.8008 [pdf, ps, other]

Support vector comparison machines

Authors: David Venuto, Toby Dylan Hocking, Lakjaree Sphanurattana, Masashi Sugiyama

Abstract: In ranking problems, the goal is to learn a ranking function from labeled pairs of input points. In this paper, we consider the related comparison problem, where the label indicates which element of the pair is better, or if there is no significant difference. We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM. We use simulated… ▽ More In ranking problems, the goal is to learn a ranking function from labeled pairs of input points. In this paper, we consider the related comparison problem, where the label indicates which element of the pair is better, or if there is no significant difference. We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM. We use simulated nonlinear patterns, a real learning to rank sushi data set, and a chess data set to show that our proposed SVMcompare algorithm outperforms SVMrank when there are equality pairs. △ Less

Submitted 23 July, 2020; v1 submitted 30 January, 2014; originally announced January 2014.

Showing 1–13 of 13 results for author: Hocking, T