-
Sequents, barcodes, and homology
Authors:
Saugata Basu,
Negin Karisani,
Laxmi Parida
Abstract:
We consider the problem of generating hypothesis from data based on ideas from logic. We introduce a notion of barcodes, which we call sequent barcodes, that mirrors the barcodes in persistent homology theory in topological data analysis. We prove a theoretical result on the stability of these barcodes in analogy with similar results in persistent homology theory. Additionally we show that our new…
▽ More
We consider the problem of generating hypothesis from data based on ideas from logic. We introduce a notion of barcodes, which we call sequent barcodes, that mirrors the barcodes in persistent homology theory in topological data analysis. We prove a theoretical result on the stability of these barcodes in analogy with similar results in persistent homology theory. Additionally we show that our new notion of barcodes can be interpreted in terms of a persistent homology of a particular filtration of topological spaces induced by the data. Finally, we discuss a concrete application of the sequent barcodes in a discovery problem arising from the area of cancer genomics.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
Multi-View Active Learning for Short Text Classification in User-Generated Data
Authors:
Payam Karisani,
Negin Karisani,
Li Xiong
Abstract:
Mining user-generated data often suffers from the lack of enough labeled data, short document lengths, and the informal user language. In this paper, we propose a novel active learning model to overcome these obstacles in the tasks tailored for query phrases--e.g., detecting positive reports of natural disasters. Our model has three novelties: 1) It is the first approach to employ multi-view activ…
▽ More
Mining user-generated data often suffers from the lack of enough labeled data, short document lengths, and the informal user language. In this paper, we propose a novel active learning model to overcome these obstacles in the tasks tailored for query phrases--e.g., detecting positive reports of natural disasters. Our model has three novelties: 1) It is the first approach to employ multi-view active learning in this domain. 2) It uses the Parzen-Rosenblatt window method to integrate the representativeness measure into multi-view active learning. 3) It employs a query-by-committee strategy, based on the agreement between predictors, to address the usually noisy language of the documents in this domain. We evaluate our model in four publicly available Twitter datasets with distinctly different applications. We also compare our model with a wide range of baselines including those with multiple classifiers. The experiments testify that our model is highly consistent and outperforms existing models.
△ Less
Submitted 20 December, 2022; v1 submitted 5 December, 2021;
originally announced December 2021.
-
Semi-Supervised Text Classification via Self-Pretraining
Authors:
Payam Karisani,
Negin Karisani
Abstract:
We present a neural semi-supervised learning model termed Self-Pretraining. Our model is inspired by the classic self-training algorithm. However, as opposed to self-training, Self-Pretraining is threshold-free, it can potentially update its belief about previously labeled documents, and can cope with the semantic drift problem. Self-Pretraining is iterative and consists of two classifiers. In eac…
▽ More
We present a neural semi-supervised learning model termed Self-Pretraining. Our model is inspired by the classic self-training algorithm. However, as opposed to self-training, Self-Pretraining is threshold-free, it can potentially update its belief about previously labeled documents, and can cope with the semantic drift problem. Self-Pretraining is iterative and consists of two classifiers. In each iteration, one classifier draws a random set of unlabeled documents and labels them. This set is used to initialize the second classifier, to be further trained by the set of labeled documents. The algorithm proceeds to the next iteration and the classifiers' roles are reversed. To improve the flow of information across the iterations and also to cope with the semantic drift problem, Self-Pretraining employs an iterative distillation process, transfers hypotheses across the iterations, utilizes a two-stage training model, uses an efficient learning rate schedule, and employs a pseudo-label transformation heuristic. We have evaluated our model in three publicly available social media datasets. Our experiments show that Self-Pretraining outperforms the existing state-of-the-art semi-supervised classifiers across multiple settings. Our code is available at https://github.com/p-karisani/self_pretraining.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Inferring COVID-19 Biological Pathways from Clinical Phenotypes via Topological Analysis
Authors:
Negin Karisani,
Daniel E. Platt,
Saugata Basu,
Laxmi Parida
Abstract:
COVID-19 has caused thousands of deaths around the world and also resulted in a large international economic disruption. Identifying the pathways associated with this illness can help medical researchers to better understand the properties of the condition. This process can be carried out by analyzing the medical records. It is crucial to develop tools and models that can aid researchers with this…
▽ More
COVID-19 has caused thousands of deaths around the world and also resulted in a large international economic disruption. Identifying the pathways associated with this illness can help medical researchers to better understand the properties of the condition. This process can be carried out by analyzing the medical records. It is crucial to develop tools and models that can aid researchers with this process in a timely manner. However, medical records are often unstructured clinical notes, and this poses significant challenges to develo** the automated systems. In this article, we propose a pipeline to aid practitioners in analyzing clinical notes and revealing the pathways associated with this disease. Our pipeline relies on topological properties and consists of three steps: 1) pre-processing the clinical notes to extract the salient concepts, 2) constructing a feature space of the patients to characterize the extracted concepts, and finally, 3) leveraging the topological properties to distill the available knowledge and visualize the result. Our experiments on a publicly available dataset of COVID-19 clinical notes testify that our pipeline can indeed extract meaningful pathways.
△ Less
Submitted 1 May, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Mining Coronavirus (COVID-19) Posts in Social Media
Authors:
Negin Karisani,
Payam Karisani
Abstract:
World Health Organization (WHO) characterized the novel coronavirus (COVID-19) as a global pandemic on March 11th, 2020. Before this and in late January, more specifically on January 27th, while the majority of the infection cases were still reported in China and a few cruise ships, we began crawling social media user postings using the Twitter search API. Our goal was to leverage machine learning…
▽ More
World Health Organization (WHO) characterized the novel coronavirus (COVID-19) as a global pandemic on March 11th, 2020. Before this and in late January, more specifically on January 27th, while the majority of the infection cases were still reported in China and a few cruise ships, we began crawling social media user postings using the Twitter search API. Our goal was to leverage machine learning and linguistic tools to better understand the impact of the outbreak in China. Unlike our initial expectation to monitor a local outbreak, COVID-19 rapidly spread across the globe. In this short article we report the preliminary results of our study on automatically detecting the positive reports of COVID-19 from social media user postings using state-of-the-art machine learning models.
△ Less
Submitted 28 March, 2020;
originally announced April 2020.