-
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Authors:
Luis Oala,
Manil Maskey,
Lilith Bat-Leah,
Alicia Parrish,
Nezihe Merve Gürel,
Tzu-Sheng Kuo,
Yang Liu,
Rotem Dror,
Danilo Brajovic,
Xiaozhe Yao,
Max Bartolo,
William A Gaviria Rojas,
Ryan Hileman,
Rainier Aliment,
Michael W. Mahoney,
Meg Risdal,
Matthew Lease,
Wojciech Samek,
Debojyoti Dutta,
Curtis G Northcutt,
Cody Coleman,
Braden Hancock,
Bernard Koch,
Girmaw Abebe Tadesse,
Bojan Karlaš
, et al. (13 additional authors not shown)
Abstract:
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow…
▽ More
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
△ Less
Submitted 1 June, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
BON: An extended public domain dataset for human activity recognition
Authors:
Girmaw Abebe Tadesse,
Oliver Bent,
Komminist Weldemariam,
Md. Abrar Istiak,
Taufiq Hasan,
Andrea Cavallaro
Abstract:
Body-worn first-person vision (FPV) camera enables to extract a rich source of information on the environment from the subject's viewpoint. However, the research progress in wearable camera-based egocentric office activity understanding is slow compared to other activity environments (e.g., kitchen and outdoor ambulatory), mainly due to the lack of adequate datasets to train more sophisticated (e.…
▽ More
Body-worn first-person vision (FPV) camera enables to extract a rich source of information on the environment from the subject's viewpoint. However, the research progress in wearable camera-based egocentric office activity understanding is slow compared to other activity environments (e.g., kitchen and outdoor ambulatory), mainly due to the lack of adequate datasets to train more sophisticated (e.g., deep learning) models for human activity recognition in office environments. This paper provides details of a large and publicly available office activity dataset (BON) collected in different office settings across three geographical locations: Barcelona (Spain), Oxford (UK) and Nairobi (Kenya), using a chest-mounted GoPro Hero camera. The BON dataset contains eighteen common office activities that can be categorised into person-to-person interactions (e.g., Chat with colleagues), person-to-object (e.g., Writing on a whiteboard), and proprioceptive (e.g., Walking). Annotation is provided for each segment of video with 5-seconds duration. Generally, BON contains 25 subjects and 2639 total segments. In order to facilitate further research in the sub-domain, we have also provided results that could be used as baselines for future studies.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data
Authors:
Girmaw Abebe Tadesse,
William Ogallo,
Celia Cintas,
Skyler Speakman
Abstract:
Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feat…
▽ More
Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feature dimension, which could be addressed using automatic discovery of divergent subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features that could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resource to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different from filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of feature selection time by a factor of 81x and 104x, averaged cross the existing methods in the MIMIC-III and Claims datasets respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3% of features selected by SAFS in the Claims dataset detected divergent samples similar to those detected by using the whole features with a Jaccard similarity of 0.95 but with a 16x reduction in detection time.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Sparsity-based Feature Selection for Anomalous Subgroup Discovery
Authors:
Girmaw Abebe Tadesse,
William Ogallo,
Catherine Wanjiru,
Charles Wachira,
Isaiah Onando Mulang',
Vibha Anand,
Aisha Walcott-Bryant,
Skyler Speakman
Abstract:
Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimi…
▽ More
Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimizing the performance of prediction outcomes rather than its systemic deviations from the expected. In this paper, we proposed a sparsity-based automated feature selection (SAFS) framework, which encodes systemic outcome deviations via the sparsity of feature-driven odds ratios. SAFS is a model-agnostic approach with usability across different discovery techniques. SAFS achieves more than $3\times$ reduction in computation time while maintaining detection performance when validated on publicly available critical care dataset. SAFS also results in a superior performance when compared against multiple baselines for feature selection.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Automated Supervised Feature Selection for Differentiated Patterns of Care
Authors:
Catherine Wanjiru,
William Ogallo,
Girmaw Abebe Tadesse,
Charles Wachira,
Isaiah Onando Mulang',
Aisha Walcott-Bryant
Abstract:
An automated feature selection pipeline was developed using several state-of-the-art feature selection techniques to select optimal features for Differentiating Patterns of Care (DPOC). The pipeline included three types of feature selection techniques; Filters, Wrappers and Embedded methods to select the top K features. Five different datasets with binary dependent variables were used and their di…
▽ More
An automated feature selection pipeline was developed using several state-of-the-art feature selection techniques to select optimal features for Differentiating Patterns of Care (DPOC). The pipeline included three types of feature selection techniques; Filters, Wrappers and Embedded methods to select the top K features. Five different datasets with binary dependent variables were used and their different top K optimal features selected. The selected features were tested in the existing multi-dimensional subset scanning (MDSS) where the most anomalous subpopulations, most anomalous subsets, propensity scores, and effect of measures were recorded to test their performance. This performance was compared with four similar metrics gained after using all covariates in the dataset in the MDSS pipeline. We found out that despite the different feature selection techniques used, the data distribution is key to note when determining the technique to use.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
DeepMI: Deep Multi-lead ECG Fusion for Identifying Myocardial Infarction and its Occurrence-time
Authors:
Girmaw Abebe Tadesse,
Hamza Javed,
Yong Liu,
** Liu,
Jiyan Chen,
Komminist Weldemariam,
Tingting Zhu
Abstract:
Myocardial Infarction (MI) has the highest mortality of all cardiovascular diseases (CVDs). Detection of MI and information regarding its occurrence-time in particular, would enable timely interventions that may improve patient outcomes, thereby reducing the global rise in CVD deaths. Electrocardiogram (ECG) recordings are currently used to screen MI patients. However, manual inspection of ECGs is…
▽ More
Myocardial Infarction (MI) has the highest mortality of all cardiovascular diseases (CVDs). Detection of MI and information regarding its occurrence-time in particular, would enable timely interventions that may improve patient outcomes, thereby reducing the global rise in CVD deaths. Electrocardiogram (ECG) recordings are currently used to screen MI patients. However, manual inspection of ECGs is time-consuming and prone to subjective bias. Machine learning methods have been adopted for automated ECG diagnosis, but most approaches require extraction of ECG beats or consider leads independently of one another. We propose an end-to-end deep learning approach, DeepMI, to classify MI from normal cases as well as identifying the time-occurrence of MI (defined as acute, recent and old), using a collection of fusion strategies on 12 ECG leads at data-, feature-, and decision-level. In order to minimise computational overhead, we employ transfer learning using existing computer vision networks. Moreover, we use recurrent neural networks to encode the longitudinal information inherent in ECGs. We validated DeepMI on a dataset collected from 17,381 patients, in which over 323,000 samples were extracted per ECG lead. We were able to classify normal cases as well as acute, recent and old onset cases of MI, with AUROCs of 96.7%, 82.9%, 68.6% and 73.8%, respectively. We have demonstrated a multi-lead fusion approach to detect the presence and occurrence-time of MI. Our end-to-end framework provides flexibility for different levels of multi-lead ECG fusion and performs feature extraction via transfer learning.
△ Less
Submitted 31 March, 2021;
originally announced April 2021.
-
Severity Detection Tool for Patients with Infectious Disease
Authors:
Girmaw Abebe Tadesse,
Tingting Zhu,
Nhan Le Nguyen Thanh,
Nguyen Thanh Hung,
Ha Thi Hai Duong,
Truong Huu Khanh,
Pham Van Quang,
Duc Duong Tran,
LamMinh Yen,
H Rogier Van Doorn,
Nguyen Van Hao,
John Prince,
Hamza Javed,
DaniKiyasseh,
Le Van Tan,
Louise Thwaites,
David A. Clifton
Abstract:
Hand, foot and mouth disease (HFMD) and tetanus are serious infectious diseases in low and middle income countries. Tetanus in particular has a high mortality rate and its treatment is resource-demanding. Furthermore, HFMD often affects a large number of infants and young children. As a result, its treatment consumes enormous healthcare resources, especially when outbreaks occur. Autonomic nervous…
▽ More
Hand, foot and mouth disease (HFMD) and tetanus are serious infectious diseases in low and middle income countries. Tetanus in particular has a high mortality rate and its treatment is resource-demanding. Furthermore, HFMD often affects a large number of infants and young children. As a result, its treatment consumes enormous healthcare resources, especially when outbreaks occur. Autonomic nervous system dysfunction (ANSD) is the main cause of death for both HFMD and tetanus patients. However, early detection of ANSD is a difficult and challenging problem. In this paper, we aim to provide a proof-of-principle to detect the ANSD level automatically by applying machine learning techniques to physiological patient data, such as electrocardiogram (ECG) and photoplethysmogram (PPG) waveforms, which can be collected using low-cost wearable sensors. Efficient features are extracted that encode variations in the waveforms in the time and frequency domains. A support vector machine is employed to classify the ANSD levels. The proposed approach is validated on multiple datasets of HFMD and tetanus patients in Vietnam. Results show that encouraging performance is achieved in classifying ANSD levels. Moreover, the proposed features are simple, more generalisable and outperformed the standard heart rate variability (HRV) analysis. The proposed approach would facilitate both the diagnosis and treatment of infectious diseases in low and middle income countries, and thereby improve overall patient care.
△ Less
Submitted 10 December, 2019;
originally announced December 2019.