-
Transforming Wearable Data into Health Insights using Large Language Model Agents
Authors:
Mike A. Merrill,
Akshay Paruchuri,
Naghmeh Rezaei,
Geza Kovacs,
Javier Perez,
Yun Liu,
Erik Schenck,
Nova Hammerquist,
Jake Sunshine,
Shyam Tailor,
Kumar Ayush,
Hao-Wei Su,
Qian He,
Cory Y. McLean,
Mark Malhotra,
Shwetak Patel,
Jiening Zhan,
Tim Althoff,
Daniel McDuff,
Xin Liu
Abstract:
Despite the proliferation of wearable health trackers and the importance of sleep and exercise to health, deriving actionable personalized insights from wearable data remains a challenge because doing so requires non-trivial open-ended analysis of these data. The recent rise of large language model (LLM) agents, which can use tools to reason about and interact with the world, presents a promising…
▽ More
Despite the proliferation of wearable health trackers and the importance of sleep and exercise to health, deriving actionable personalized insights from wearable data remains a challenge because doing so requires non-trivial open-ended analysis of these data. The recent rise of large language model (LLM) agents, which can use tools to reason about and interact with the world, presents a promising opportunity to enable such personalized analysis at scale. Yet, the application of LLM agents in analyzing personal health is still largely untapped. In this paper, we introduce the Personal Health Insights Agent (PHIA), an agent system that leverages state-of-the-art code generation and information retrieval tools to analyze and interpret behavioral health data from wearables. We curate two benchmark question-answering datasets of over 4000 health insights questions. Based on 650 hours of human and expert evaluation we find that PHIA can accurately address over 84% of factual numerical questions and more than 83% of crowd-sourced open-ended questions. This work has implications for advancing behavioral health across the population, potentially enabling individuals to interpret their own wearable data, and paving the way for a new era of accessible, personalized wellness regimens that are informed by data-driven insights.
△ Less
Submitted 11 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Enumerating the k-fold configurations in multi-class classification problems
Authors:
Attila Fazekas,
Gyorgy Kovacs
Abstract:
K-fold cross-validation is a widely used tool for assessing classifier performance. The reproducibility crisis faced by artificial intelligence partly results from the irreproducibility of reported k-fold cross-validation-based performance scores. Recently, we introduced numerical techniques to test the consistency of claimed performance scores and experimental setups. In a crucial use case, the m…
▽ More
K-fold cross-validation is a widely used tool for assessing classifier performance. The reproducibility crisis faced by artificial intelligence partly results from the irreproducibility of reported k-fold cross-validation-based performance scores. Recently, we introduced numerical techniques to test the consistency of claimed performance scores and experimental setups. In a crucial use case, the method relies on the combinatorial enumeration of all k-fold configurations, for which we proposed an algorithm in the binary classification case.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
The Conditioning Bias in Binary Decision Trees and Random Forests and Its Elimination
Authors:
Gábor Timár,
György Kovács
Abstract:
Decision tree and random forest classification and regression are some of the most widely used in machine learning approaches. Binary decision tree implementations commonly use conditioning in the form 'feature $\leq$ (or $<$) threshold', with the threshold being the midpoint between two observed feature values. In this paper, we investigate the bias introduced by the choice of conditioning operat…
▽ More
Decision tree and random forest classification and regression are some of the most widely used in machine learning approaches. Binary decision tree implementations commonly use conditioning in the form 'feature $\leq$ (or $<$) threshold', with the threshold being the midpoint between two observed feature values. In this paper, we investigate the bias introduced by the choice of conditioning operator (an intrinsic property of implementations) in the presence of features with lattice characteristics. We propose techniques to eliminate this bias, requiring an additional prediction with decision trees and incurring no cost for random forests. Using 20 classification and 20 regression datasets, we demonstrate that the bias can lead to statistically significant differences in terms of AUC and $r^2$ scores. The proposed techniques successfully mitigate the bias, compared to the worst-case scenario, statistically significant improvements of up to 0.1-0.2 percentage points of AUC and $r^2$ scores were achieved and the improvement of 1.5 percentage points of $r^2$ score was measured in the most sensitive case of random forest regression. The implementation of the study is available on GitHub at the following repository: \url{https://github.com/gykovacs/conditioning_bias}.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI
Authors:
Aleksis Pirinen,
Nosheen Abid,
Nuria Agues Paszkowsky,
Thomas Ohlson Timoudas,
Ronald Scheirer,
Chiara Ceccobello,
György Kovács,
Anders Persson
Abstract:
Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover map**, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filteri…
▽ More
Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover map**, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true when it comes to cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, that we subsequently leverage for obtaining reliable and versatile cloud masks on real data. In our dataset, top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multispectral Imagery (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. In particular, by thresholding COT estimates from our ML models, we show on two satellite image datasets (one that is publicly available, and one which we have collected and annotated) that reliable cloud masks can be obtained. The synthetic data, the collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.
△ Less
Submitted 15 March, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning
Authors:
György Kovács,
Attila Fazekas
Abstract:
Addressing the reproducibility crisis in artificial intelligence through the validation of reported experimental results is a challenging task. It necessitates either the reimplementation of techniques or a meticulous assessment of papers for deviations from the scientific method and best statistical practices. To facilitate the validation of reported results, we have developed numerical technique…
▽ More
Addressing the reproducibility crisis in artificial intelligence through the validation of reported experimental results is a challenging task. It necessitates either the reimplementation of techniques or a meticulous assessment of papers for deviations from the scientific method and best statistical practices. To facilitate the validation of reported results, we have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems, including binary/multiclass classification and regression. These consistency tests are integrated into the open-source package mlscorecheck, which also provides specific test bundles designed to detect systematically recurring flaws in various fields, such as retina image processing and synthetic minority oversampling.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Testing the Consistency of Performance Scores Reported for Binary Classification Problems
Authors:
Attila Fazekas,
György Kovács
Abstract:
Binary classification is a fundamental task in machine learning, with applications spanning various scientific domains. Whether scientists are conducting fundamental research or refining practical applications, they typically assess and rank classification techniques based on performance metrics such as accuracy, sensitivity, and specificity. However, reported performance scores may not always ser…
▽ More
Binary classification is a fundamental task in machine learning, with applications spanning various scientific domains. Whether scientists are conducting fundamental research or refining practical applications, they typically assess and rank classification techniques based on performance metrics such as accuracy, sensitivity, and specificity. However, reported performance scores may not always serve as a reliable basis for research ranking. This can be attributed to undisclosed or unconventional practices related to cross-validation, typographical errors, and other factors. In a given experimental setup, with a specific number of positive and negative test items, most performance scores can assume specific, interrelated values. In this paper, we introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup. Importantly, the proposed approach does not rely on statistical inference but uses numerical methods to identify inconsistencies with certainty. Through three different applications related to medicine, we demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields. To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Large Language Models are Few-Shot Health Learners
Authors:
Xin Liu,
Daniel McDuff,
Geza Kovacs,
Isaac Galatzer-Levy,
Jacob Sunshine,
Jiening Zhan,
Ming-Zher Poh,
Shun Liao,
Paolo Di Achille,
Shwetak Patel
Abstract:
Large language models (LLMs) can capture rich representations of concepts that are useful for real-world tasks. However, language alone is limited. While existing LLMs excel at text-based inferences, health applications require that models be grounded in numerical data (e.g., vital signs, laboratory values in clinical domains; steps, movement in the wellness domain) that is not easily or readily e…
▽ More
Large language models (LLMs) can capture rich representations of concepts that are useful for real-world tasks. However, language alone is limited. While existing LLMs excel at text-based inferences, health applications require that models be grounded in numerical data (e.g., vital signs, laboratory values in clinical domains; steps, movement in the wellness domain) that is not easily or readily expressed as text in existing training corpus. We demonstrate that with only few-shot tuning, a large language model is capable of grounding various physiological and behavioral time-series data and making meaningful inferences on numerous health tasks for both clinical and wellness contexts. Using data from wearable and medical sensor recordings, we evaluate these capabilities on the tasks of cardiac signal analysis, physical activity recognition, metabolic calculation (e.g., calories burned), and estimation of stress reports and mental health screeners.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset
Authors:
Sana Sabah Al-Azzawi,
György Kovács,
Filip Nilsson,
Tosin Adewumi,
Marcus Liwicki
Abstract:
In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBER…
▽ More
In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBERTa, and DeBERTa). To alleviate problems related to class imbalance, and to improve the generalization capability of our model, we also experiment with data augmentation and semi-supervised learning. In particular, for data augmentation, we use back-translation, either on all classes, or on the underrepresented classes only. We analyze the impact of these strategies on the overall performance of the pipeline through extensive experiments. while for semi-supervised learning, we found that with a substantial amount of unlabelled, in-domain data available, semi-supervised learning can enhance the performance of certain models. Our proposed method (for which the source code is available on Github attains an F1-score of 0.8613 for sub-taskA, which ranked us 10th in the competition
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
Automatic Correction of Human Translations
Authors:
Jessy Lin,
Geza Kovacs,
Aditya Shastry,
Joern Wuebker,
John DeNero
Abstract:
We introduce translation error correction (TEC), the task of automatically correcting human-generated translations. Imperfections in machine translations (MT) have long motivated systems for improving translations post-hoc with automatic post-editing. In contrast, little attention has been devoted to the problem of automatically correcting human translations, despite the intuition that humans make…
▽ More
We introduce translation error correction (TEC), the task of automatically correcting human-generated translations. Imperfections in machine translations (MT) have long motivated systems for improving translations post-hoc with automatic post-editing. In contrast, little attention has been devoted to the problem of automatically correcting human translations, despite the intuition that humans make distinct errors that machines would be well-suited to assist with, from typos to inconsistencies in translation conventions. To investigate this, we build and release the Aced corpus with three TEC datasets. We show that human errors in TEC exhibit a more diverse range of errors and far fewer translation fluency errors than the MT errors in automatic post-editing datasets, suggesting the need for dedicated TEC models that are specialized to correct human errors. We show that pre-training instead on synthetic errors based on human errors improves TEC F-score by as much as 5.1 points. We conducted a human-in-the-loop user study with nine professional translation editors and found that the assistance of our TEC system led them to produce significantly higher quality revised translations.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
HaT5: Hate Language Identification using Text-to-Text Transfer Transformer
Authors:
Sana Sabah Sabry,
Tosin Adewumi,
Nosheen Abid,
György Kovacs,
Foteini Liwicki,
Marcus Liwicki
Abstract:
We investigate the performance of a state-of-the art (SoTA) architecture T5 (available on the SuperGLUE) and compare with it 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using an autoregressive model. We achieve ne…
▽ More
We investigate the performance of a state-of-the art (SoTA) architecture T5 (available on the SuperGLUE) and compare with it 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using an autoregressive model. We achieve near-SoTA results on a couple of the tasks - macro F1 scores of 81.66% for task A of the OLID 2019 dataset and 82.54% for task A of the hate speech and offensive content (HASOC) 2021 dataset, where SoTA are 82.9% and 83.05%, respectively. We perform error analysis and explain why one of the models (Bi-LSTM) makes the predictions it does by using a publicly available algorithm: Integrated Gradient (IG). This is because explainable artificial intelligence (XAI) is essential for earning the trust of users. The main contributions of this work are the implementation method of T5, which is discussed; the data augmentation using a new conversational AI model checkpoint, which brought performance improvements; and the revelation on the shortcomings of HASOC 2021 dataset. It reveals the difficulties of poor data annotation by using a small set of examples where the T5 model made the correct predictions, even when the ground truth of the test set were incorrect (in our opinion). We also provide our model checkpoints on the HuggingFace hub1 to foster transparency.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
A general technique for the estimation of farm animal body part weights from CT scans and its applications in a rabbit breeding program
Authors:
Ádám Csóka,
György Kovács,
Virág Ács,
Zsolt Matics,
Zsolt Gerencsér,
Zsolt Szendrő,
István Nagy,
Örs Petneházy,
Imre Repa,
Mariann Moizs,
Tamás Donkó
Abstract:
Various applications of farm animal imaging are based on the estimation of weights of certain body parts and cuts from the CT images of animals. In many cases, the complexity of the problem is increased by the enormous variability of postures in CT images due to the scanning of non-sedated, living animals. In this paper, we propose a general and robust approach for the estimation of the weights of…
▽ More
Various applications of farm animal imaging are based on the estimation of weights of certain body parts and cuts from the CT images of animals. In many cases, the complexity of the problem is increased by the enormous variability of postures in CT images due to the scanning of non-sedated, living animals. In this paper, we propose a general and robust approach for the estimation of the weights of cuts and body parts from the CT images of (possibly) living animals. We adapt multi-atlas based segmentation driven by elastic registration and joint feature and model selection for the regression component to cape with the large number of features and low number of samples. The proposed technique is evaluated and illustrated through real applications in rabbit breeding programs, showing r^2 scores 12% higher than previous techniques and methods that used to drive the selection so far. The proposed technique is easily adaptable to similar problems, consequently, it is shared in an open source software package for the benefit of the community.
△ Less
Submitted 30 December, 2021;
originally announced December 2021.
-
A new baseline for retinal vessel segmentation: Numerical identification and correction of methodological inconsistencies affecting 100+ papers
Authors:
György Kovács,
Attila Fazekas
Abstract:
In the last 15 years, the segmentation of vessels in retinal images has become an intensively researched problem in medical imaging, with hundreds of algorithms published. One of the de facto benchmarking data sets of vessel segmentation techniques is the DRIVE data set. Since DRIVE contains a predefined split of training and test images, the published performance results of the various segmentati…
▽ More
In the last 15 years, the segmentation of vessels in retinal images has become an intensively researched problem in medical imaging, with hundreds of algorithms published. One of the de facto benchmarking data sets of vessel segmentation techniques is the DRIVE data set. Since DRIVE contains a predefined split of training and test images, the published performance results of the various segmentation techniques should provide a reliable ranking of the algorithms. Including more than 100 papers in the study, we performed a detailed numerical analysis of the coherence of the published performance scores. We found inconsistencies in the reported scores related to the use of the field of view (FoV), which has a significant impact on the performance scores. We attempted to eliminate the biases using numerical techniques to provide a more realistic picture of the state of the art. Based on the results, we have formulated several findings, most notably: despite the well-defined test set of DRIVE, most rankings in published papers are based on non-comparable figures; in contrast to the near-perfect accuracy scores reported in the literature, the highest accuracy score achieved to date is 0.9582 in the FoV region, which is 1% higher than that of human annotators. The methods we have developed for identifying and eliminating the evaluation biases can be easily applied to other domains where similar problems may arise.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
Reconstructing Detailed Browsing Activities from Browser History
Authors:
Geza Kovacs
Abstract:
Users' detailed browsing activity - such as what sites they are spending time on and for how long, and what tabs they have open and which one is focused at any given time - is useful for a number of research and practical applications. Gathering such data, however, requires that users install and use a monitoring tool over long periods of time. In contrast, browser extensions can gain instantaneou…
▽ More
Users' detailed browsing activity - such as what sites they are spending time on and for how long, and what tabs they have open and which one is focused at any given time - is useful for a number of research and practical applications. Gathering such data, however, requires that users install and use a monitoring tool over long periods of time. In contrast, browser extensions can gain instantaneous access months of browser history data. However, the browser history is incomplete: it records only navigation events, missing important information such as time spent or tab focused. In this work, we aim to reconstruct time spent on sites with only users' browsing histories. We gathered three months of browsing history and two weeks of ground-truth detailed browsing activity from 185 participants. We developed a machine learning algorithm that predicts whether the browser window is focused and active at one second-level granularity with an F1-score of 0.84. During periods when the browser is active, the algorithm can predict which the domain the user was looking at with 76.2% accuracy. We can use these results to reconstruct the total time spent online for each user with an R^2 value of 0.96, and the total time each user spent on each domain with an R^2 value of 0.92.
△ Less
Submitted 7 February, 2021;
originally announced February 2021.
-
Edvertisements: Adding Microlearning to Social News Feeds and Websites
Authors:
Geza Kovacs
Abstract:
Many long-term goals, such as learning a language, require people to regularly practice every day to achieve mastery. At the same time, people regularly surf the web and read social news feeds in their spare time. We have built a browser extension that teaches vocabulary to users in the context of Facebook feeds and arbitrary websites, by showing users interactive quizzes they can answer without l…
▽ More
Many long-term goals, such as learning a language, require people to regularly practice every day to achieve mastery. At the same time, people regularly surf the web and read social news feeds in their spare time. We have built a browser extension that teaches vocabulary to users in the context of Facebook feeds and arbitrary websites, by showing users interactive quizzes they can answer without leaving the website. On Facebook, the quizzes show up as part of the news feed, while on other sites, the quizzes appear where advertisements normally would. In our user study, we examined the effectiveness of inserting microlearning tasks into social news feeds. We compared vocabulary learning rates when we inserted interactive quizzes into feeds, versus inserting links that lead them to a website where they could do the quizzes. Our results suggest that users engage with and learn from our embedded quizzes, and engagement increases when the quizzes can be done directly within their feeds.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
QuizCram: A Quiz-Driven Lecture Viewing Interface
Authors:
Geza Kovacs,
Darren Edge
Abstract:
QuizCram is an interface for navigating lecture videos that uses quizzes to help users determine what they should view. We developed it in response to observing peaks in video seeking behaviors centered around Coursera's in-video quizzes. QuizCram shows users a question to answer, with an associated video segment. Users can use these questions to navigate through video segments, and find video seg…
▽ More
QuizCram is an interface for navigating lecture videos that uses quizzes to help users determine what they should view. We developed it in response to observing peaks in video seeking behaviors centered around Coursera's in-video quizzes. QuizCram shows users a question to answer, with an associated video segment. Users can use these questions to navigate through video segments, and find video segments they need to review. We also allow users to review using a timeline of previously answered questions and videos. To encourage users to review the material, QuizCram keeps track of their question-answering and video-watching history and schedules sections they likely have not mastered for review. QuizCram-format materials can be generated from existing lectures with in-video quizzes. Our user study comparing QuizCram to in-video quizzes found that users practice answering and reviewing questions more when using QuizCram, and are better able to remember answers to questions they encountered.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
Not Now, Ask Later: Users Weaken Their Behavior Change Regimen Over Time, But Expect To Re-Strengthen It Imminently
Authors:
Geza Kovacs,
Zhengxuan Wu,
Michael S. Bernstein
Abstract:
How effectively do we adhere to nudges and interventions that help us control our online browsing habits? If we have a temporary lapse and disable the behavior change system, do we later resume our adherence, or has the dam broken? In this paper, we investigate these questions through log analyses of 8,000+ users on HabitLab, a behavior change platform that helps users reduce their time online. We…
▽ More
How effectively do we adhere to nudges and interventions that help us control our online browsing habits? If we have a temporary lapse and disable the behavior change system, do we later resume our adherence, or has the dam broken? In this paper, we investigate these questions through log analyses of 8,000+ users on HabitLab, a behavior change platform that helps users reduce their time online. We find that, while users typically begin with high-challenge interventions, over time they allow themselves to slip into easier and easier interventions. Despite this, many still expect to return to the harder interventions imminently: they repeatedly choose to be asked to change difficulty again on the next visit, declining to have the system save their preference for easy interventions.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
The Impact of Text Presentation on Translator Performance
Authors:
Samuel Läubli,
Patrick Simianer,
Joern Wuebker,
Geza Kovacs,
Rico Sennrich,
Spence Green
Abstract:
Widely used computer-aided translation (CAT) tools divide documents into segments such as sentences and arrange them in a side-by-side, spreadsheet-like view. We present the first controlled evaluation of these design choices on translator performance, measuring speed and accuracy in three experimental text processing tasks. We find significant evidence that sentence-by-sentence presentation enabl…
▽ More
Widely used computer-aided translation (CAT) tools divide documents into segments such as sentences and arrange them in a side-by-side, spreadsheet-like view. We present the first controlled evaluation of these design choices on translator performance, measuring speed and accuracy in three experimental text processing tasks. We find significant evidence that sentence-by-sentence presentation enables faster text reproduction and within-sentence error identification compared to unsegmented text, and that a top-and-bottom arrangement of source and target sentences enables faster text reproduction compared to a side-by-side arrangement. For revision, on the other hand, our results suggest that presenting unsegmented text results in the highest accuracy and time efficiency. Our findings have direct implications for best practices in designing CAT tools.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Approximately Optimal Binning for the Piecewise Constant Approximation of the Normalized Unexplained Variance (nUV) Dissimilarity Measure
Authors:
Attila Fazekas,
György Kovács
Abstract:
The recently introduced Matching by Tone Map** (MTM) dissimilarity measure enables template matching under smooth non-linear distortions and also has a well-established mathematical background. MTM operates by binning the template, but the ideal binning for a particular problem is an open question. By pointing out an important analogy between the well known mutual information (MI) and MTM, we in…
▽ More
The recently introduced Matching by Tone Map** (MTM) dissimilarity measure enables template matching under smooth non-linear distortions and also has a well-established mathematical background. MTM operates by binning the template, but the ideal binning for a particular problem is an open question. By pointing out an important analogy between the well known mutual information (MI) and MTM, we introduce the term "normalized unexplained variance" (nUV) for MTM to emphasize its relevance and applicability beyond image processing. Then, we provide theoretical results on the optimal binning technique for the nUV measure and propose algorithms to find approximate solutions. The theoretical findings are supported by numerical experiments. Using the proposed techniques for binning shows 4-13% increase in terms of AUC scores with statistical significance, enabling us to conclude that the proposed binning techniques have the potential to improve the performance of the nUV measure in real applications.
△ Less
Submitted 24 July, 2020;
originally announced July 2020.
-
Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling
Authors:
Gilles Vandewiele,
Isabelle Dehaene,
György Kovács,
Lucas Sterckx,
Olivier Janssens,
Femke Ongenae,
Femke De Backere,
Filip De Turck,
Kristien Roelens,
Johan Decruyenaere,
Sofie Van Hoecke,
Thomas Demeester
Abstract:
Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram datab…
▽ More
Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license.
△ Less
Submitted 28 November, 2020; v1 submitted 15 January, 2020;
originally announced January 2020.
-
Subword Semantic Hashing for Intent Classification on Small Datasets
Authors:
Kumar Shridhar,
Ayushman Dash,
Amit Sahu,
Gustav Grund Pihlgren,
Pedro Alonso,
Vinaychandran Pondenkandath,
Gyorgy Kovacs,
Foteini Simistira,
Marcus Liwicki
Abstract:
In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classifi…
▽ More
In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: AskUbuntu, Chatbot, and Web Application. Our benchmarks are available online: https://github.com/kumar-shridhar/Know-Your-Intent
△ Less
Submitted 14 September, 2019; v1 submitted 16 October, 2018;
originally announced October 2018.