Search | arXiv e-print repository

Explainable Multi-Label Classification of MBTI Types

Abstract: In this study, we aim to identify the most effective machine learning model for accurately classifying Myers-Briggs Type Indicator (MBTI) types from Reddit posts and a Kaggle data set. We apply multi-label classification using the Binary Relevance method. We use Explainable Artificial Intelligence (XAI) approach to highlight the transparency and understandability of the process and result. To achi… ▽ More In this study, we aim to identify the most effective machine learning model for accurately classifying Myers-Briggs Type Indicator (MBTI) types from Reddit posts and a Kaggle data set. We apply multi-label classification using the Binary Relevance method. We use Explainable Artificial Intelligence (XAI) approach to highlight the transparency and understandability of the process and result. To achieve this, we experiment with glass-box learning models, i.e. models designed for simplicity, transparency, and interpretability. We selected k-Nearest Neighbour, Multinomial Naive Bayes, and Logistic Regression for the glass-box models. We show that Multinomial Naive Bayes and k-Nearest Neighbour perform better if classes with Observer (S) traits are excluded, whereas Logistic Regression obtains its best results when all classes have > 550 entries. △ Less

Submitted 7 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: 22 pages, 12 tables, 2 figure

ACM Class: I.2.6

arXiv:2401.13805 [pdf]

Longitudinal Sentiment Topic Modelling of Reddit Posts

Authors: Fabian Nwaoha, Ziyad Gaffar, Ho Joon Chun, Marina Sokolova

Abstract: In this study, we analyze texts of Reddit posts written by students of four major Canadian universities. We gauge the emotional tone and uncover prevailing themes and discussions through longitudinal topic modeling of posts textual data. Our study focuses on four years, 2020-2023, covering COVID-19 pandemic and after pandemic years. Our results highlight a gradual uptick in discussions related to… ▽ More In this study, we analyze texts of Reddit posts written by students of four major Canadian universities. We gauge the emotional tone and uncover prevailing themes and discussions through longitudinal topic modeling of posts textual data. Our study focuses on four years, 2020-2023, covering COVID-19 pandemic and after pandemic years. Our results highlight a gradual uptick in discussions related to mental health. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 21 pages, 4 figures, 13 tables. arXiv admin note: text overlap with arXiv:2401.12382

ACM Class: I.2.7

arXiv:2401.12382 [pdf]

Longitudinal Sentiment Classification of Reddit Posts

Authors: Fabian Nwaoha, Ziyad Gaffar, Ho Joon Chun, Marina Sokolova

Abstract: We report results of a longitudinal sentiment classification of Reddit posts written by students of four major Canadian universities. We work with the texts of the posts, concentrating on the years 2020-2023. By finely tuning a sentiment threshold to a range of [-0.075,0.075], we successfully built classifiers proficient in categorizing post sentiments into positive and negative categories. Notice… ▽ More We report results of a longitudinal sentiment classification of Reddit posts written by students of four major Canadian universities. We work with the texts of the posts, concentrating on the years 2020-2023. By finely tuning a sentiment threshold to a range of [-0.075,0.075], we successfully built classifiers proficient in categorizing post sentiments into positive and negative categories. Noticeably, our sentiment classification results are consistent across the four university data sets. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 11 pages, 10 figures, 4 tables

ACM Class: I.2.6

arXiv:2205.06863 [pdf]

Sentiment Analysis of Covid-related Reddits

Authors: Yilin Yang, Tomas Fieg, Marina Sokolova

Abstract: This paper focuses on Sentiment Analysis of Covid-19 related messages from the r/Canada and r/Unitedkingdom subreddits of Reddit. We apply manual annotation and three Machine Learning algorithms to analyze sentiments conveyed in those messages. We use VADER and TextBlob to label messages for Machine Learning experiments. Our results show that removal of shortest and longest messages improves VADER… ▽ More This paper focuses on Sentiment Analysis of Covid-19 related messages from the r/Canada and r/Unitedkingdom subreddits of Reddit. We apply manual annotation and three Machine Learning algorithms to analyze sentiments conveyed in those messages. We use VADER and TextBlob to label messages for Machine Learning experiments. Our results show that removal of shortest and longest messages improves VADER and TextBlob agreement on positive sentiments and F-score of sentiment classification by all the three algorithms △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 10 pages, 1 figure, 5 tables

ACM Class: I.2.7; I.2.6

arXiv:2108.06215 [pdf]

Sentiment Analysis of the COVID-related r/Depression Posts

Authors: Zihan Chen, Marina Sokolova

Abstract: Reddit.com is a popular social media platform among young people. Reddit users share their stories to seek support from other users, especially during the Covid-19 pandemic. Messages posted on Reddit and their content have provided researchers with opportunity to analyze public concerns. In this study, we analyzed sentiments of COVID-related messages posted on r/Depression. Our study poses the fol… ▽ More Reddit.com is a popular social media platform among young people. Reddit users share their stories to seek support from other users, especially during the Covid-19 pandemic. Messages posted on Reddit and their content have provided researchers with opportunity to analyze public concerns. In this study, we analyzed sentiments of COVID-related messages posted on r/Depression. Our study poses the following questions: a) What are the common topics that the Reddit users discuss? b) Can we use these topics to classify sentiments of the posts? c) What matters concern people more during the pandemic? Key Words: Sentiment Classification, Depression, COVID-19, Reddit, LDA, BERT △ Less

Submitted 28 July, 2021; originally announced August 2021.

Comments: 16 pages, 7 figures, 5 tables, 1 appendix

ACM Class: I.2; I.2.7

arXiv:2105.13430 [pdf]

Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

Authors: YuanZheng Hu, Marina Sokolova

Abstract: Application of Machine Learning algorithms to the medical domain is an emerging trend that helps to advance medical knowledge. At the same time, there is a significant a lack of explainable studies that promote informed, transparent, and interpretable use of Machine Learning algorithms. In this paper, we present explainable multi-class classification of the Covid-19 mental health data. In Machine… ▽ More Application of Machine Learning algorithms to the medical domain is an emerging trend that helps to advance medical knowledge. At the same time, there is a significant a lack of explainable studies that promote informed, transparent, and interpretable use of Machine Learning algorithms. In this paper, we present explainable multi-class classification of the Covid-19 mental health data. In Machine Learning study, we aim to find the potential factors to influence a personal mental health during the Covid-19 pandemic. We found that Random Forest (RF) and Gradient Boosting (GB) have scored the highest accuracy of 68.08% and 68.19% respectively, with LIME prediction accuracy 65.5% for RF and 61.8% for GB. We then compare a Post-hoc system (Local Interpretable Model-Agnostic Explanations, or LIME) and an Ante-hoc system (Gini Importance) in their ability to explain the obtained Machine Learning results. To the best of these authors knowledge, our study is the first explainable Machine Learning study of the mental health data collected during Covid-19 pandemics. △ Less

Submitted 27 May, 2021; originally announced May 2021.

Comments: 22 pages, including Appendixes; 7 tables and 5 figures in the main text

ACM Class: I.2

arXiv:2012.14059 [pdf]

Convolutional Neural Networks in Multi-Class Classification of Medical Data

Authors: YuanZheng Hu, Marina Sokolova

Abstract: We report applications of Convolutional Neural Networks (CNN) to multi-classification classification of a large medical data set. We discuss in detail how changes in the CNN model and the data pre-processing impact the classification results. In the end, we introduce an ensemble model that consists of both deep learning (CNN) and shallow learning models (Gradient Boosting). The method achieves Acc… ▽ More We report applications of Convolutional Neural Networks (CNN) to multi-classification classification of a large medical data set. We discuss in detail how changes in the CNN model and the data pre-processing impact the classification results. In the end, we introduce an ensemble model that consists of both deep learning (CNN) and shallow learning models (Gradient Boosting). The method achieves Accuracy of 64.93, the highest three-class classification accuracy we achieved in this study. Our results also show that CNN and the ensemble consistently obtain a higher Recall than Precision. The highest Recall is 68.87, whereas the highest Precision is 65.04. △ Less

Submitted 27 December, 2020; originally announced December 2020.

Comments: 13 pages; 14 tables

ACM Class: I.2; J.3

arXiv:2012.13796 [pdf]

Explainable Multi-class Classification of Medical Data

Authors: YuanZheng Hu, Marina Sokolova

Abstract: Machine Learning applications have brought new insights into a secondary analysis of medical data. Machine Learning helps to develop new drugs, define populations susceptible to certain illnesses, identify predictors of many common diseases. At the same time, Machine Learning results depend on convolution of many factors, including feature selection, class (im)balance, algorithm preference, and pe… ▽ More Machine Learning applications have brought new insights into a secondary analysis of medical data. Machine Learning helps to develop new drugs, define populations susceptible to certain illnesses, identify predictors of many common diseases. At the same time, Machine Learning results depend on convolution of many factors, including feature selection, class (im)balance, algorithm preference, and performance metrics. In this paper, we present explainable multi-class classification of a large medical data set. We in details discuss knowledge-based feature engineering, data set balancing, best model selection, and parameter tuning. Six algorithms are used in this study: Support Vector Machine (SVM), Naïve Bayes, Gradient Boosting, Decision Trees, Random Forest, and Logistic Regression. Our empirical evaluation is done on the UCI Diabetes 130-US hospitals for years 1999-2008 dataset, with the task to classify patient hospital re-admission stay into three classes: 0 days, <30 days, or > 30 days. Our results show that using 23 medication features in learning experiments improves Recall of five out of the six applied learning algorithms. This is a new result that expands the previous studies conducted on the same data. Gradient Boosting and Random Forest outperformed other algorithms in terms of the three-class classification Accuracy. △ Less

Submitted 26 December, 2020; originally announced December 2020.

Comments: 21 pages; 23 tables; 2 appendixes

ACM Class: I.2; J.3

arXiv:2010.09574 [pdf]

Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Authors: Marina Sokolova, Victoria Bobicev

Abstract: We propose the Echo-Chamber Effect assessment of an online forum. Sentiments perceived by the forum readers are at the core of the analysis; a complete message is the unit of the study. We build 14 models and apply those to represent discussions gathered from an online medical forum. We use four multi-class sentiment classification applications and two Machine Learning algorithms to evaluate prowe… ▽ More We propose the Echo-Chamber Effect assessment of an online forum. Sentiments perceived by the forum readers are at the core of the analysis; a complete message is the unit of the study. We build 14 models and apply those to represent discussions gathered from an online medical forum. We use four multi-class sentiment classification applications and two Machine Learning algorithms to evaluate prowess of the assessment models. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: 17 pages, including Appendix; 6 figures in the main text; 5 tables in the main text and 7 tables in Appendix

ACM Class: I.2.6; I.2.m

arXiv:1805.00352 [pdf]

Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

Authors: Qufei Chen, Marina Sokolova

Abstract: In this study, we explored application of Word2Vec and Doc2Vec for sentiment analysis of clinical discharge summaries. We applied unsupervised learning since the data sets did not have sentiment annotations. Note that unsupervised learning is a more realistic scenario than supervised learning which requires an access to a training set of sentiment-annotated data. We aim to detect if there exists a… ▽ More In this study, we explored application of Word2Vec and Doc2Vec for sentiment analysis of clinical discharge summaries. We applied unsupervised learning since the data sets did not have sentiment annotations. Note that unsupervised learning is a more realistic scenario than supervised learning which requires an access to a training set of sentiment-annotated data. We aim to detect if there exists any underlying bias towards or against a certain disease. We used SentiWordNet to establish a gold sentiment standard for the data sets and evaluate performance of Word2Vec and Doc2Vec methods. We have shown that the Word2vec and Doc2Vec methods complement each other results in sentiment analysis of the data sets. △ Less

Submitted 1 May, 2018; originally announced May 2018.

Comments: 23 pages, 3 figures, 16 tables

MSC Class: 68T05; 68T50

arXiv:1803.06390 [pdf]

Corpus Statistics in Text Classification of Online Data

Authors: Marina Sokolova, Victoria Bobicev

Abstract: Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empiri… ▽ More Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application. △ Less

Submitted 16 March, 2018; originally announced March 2018.

Comments: 12 pages, 6 tables, 1 figure

MSC Class: 68T05; 68T50

arXiv:1802.09059 [pdf, other]

One Single Deep Bidirectional LSTM Network for Word Sense Disambiguation of Text Data

Authors: Ahmad Pesaranghader, Ali Pesaranghader, Stan Matwin, Marina Sokolova

Abstract: Due to recent technical and scientific advances, we have a wealth of information hidden in unstructured text data such as offline/online narratives, research articles, and clinical reports. To mine these data properly, attributable to their innate ambiguity, a Word Sense Disambiguation (WSD) algorithm can avoid numbers of difficulties in Natural Language Processing (NLP) pipeline. However, conside… ▽ More Due to recent technical and scientific advances, we have a wealth of information hidden in unstructured text data such as offline/online narratives, research articles, and clinical reports. To mine these data properly, attributable to their innate ambiguity, a Word Sense Disambiguation (WSD) algorithm can avoid numbers of difficulties in Natural Language Processing (NLP) pipeline. However, considering a large number of ambiguous words in one language or technical domain, we may encounter limiting constraints for proper deployment of existing WSD models. This paper attempts to address the problem of one-classifier-per-one-word WSD algorithms by proposing a single Bidirectional Long Short-Term Memory (BLSTM) network which by considering senses and context sequences works on all ambiguous words collectively. Evaluated on SensEval-3 benchmark, we show the result of our model is comparable with top-performing WSD algorithms. We also discuss how applying additional modifications alleviates the model fault and the need for more training data. △ Less

Submitted 25 February, 2018; originally announced February 2018.

Comments: 12 pages, 1 figure, to appear in the Proceedings of the 31st Canadian Conference on Artificial Intelligence, 8-11 May, 2018, Toronto, Canada

arXiv:1702.08866 [pdf]

Studying Positive Speech on Twitter

Authors: Marina Sokolova, Vera Sazonova, Kanyi Huang, Rudraneel Chakraboty, Stan Matwin

Abstract: We present results of empirical studies on positive speech on Twitter. By positive speech we understand speech that works for the betterment of a given situation, in this case relations between different communities in a conflict-prone country. We worked with four Twitter data sets. Through semi-manual opinion mining, we found that positive speech accounted for < 1% of the data . In fully automate… ▽ More We present results of empirical studies on positive speech on Twitter. By positive speech we understand speech that works for the betterment of a given situation, in this case relations between different communities in a conflict-prone country. We worked with four Twitter data sets. Through semi-manual opinion mining, we found that positive speech accounted for < 1% of the data . In fully automated studies, we tested two approaches: unsupervised statistical analysis, and supervised text classification based on distributed word representation. We discuss benefits and challenges of those approaches and report empirical evidence obtained in the study. △ Less

Submitted 24 February, 2017; originally announced February 2017.

Comments: 13 pages, 6 tables

ACM Class: I.2.6; I.2.7

arXiv:1608.02519 [pdf]

Topic Modelling and Event Identification from Twitter Textual Data

Authors: Marina Sokolova, Kanyi Huang, Stan Matwin, Joshua Ramisch, Vera Sazonova, Renee Black, Chris Orwa, Sidney Ochieng, Nanjira Sambuli

Abstract: The tremendous growth of social media content on the Internet has inspired the development of the text analytics to understand and solve real-life problems. Leveraging statistical topic modelling helps researchers and practitioners in better comprehension of textual content as well as provides useful information for further analysis. Statistical topic modelling becomes especially important when we… ▽ More The tremendous growth of social media content on the Internet has inspired the development of the text analytics to understand and solve real-life problems. Leveraging statistical topic modelling helps researchers and practitioners in better comprehension of textual content as well as provides useful information for further analysis. Statistical topic modelling becomes especially important when we work with large volumes of dynamic text, e.g., Facebook or Twitter datasets. In this study, we summarize the message content of four data sets of Twitter messages relating to challenging social events in Kenya. We use Latent Dirichlet Allocation (LDA) topic modelling to analyze the content. Our study uses two evaluation measures, Normalized Mutual Information (NMI) and topic coherence analysis, to select the best LDA models. The obtained LDA results show that the tool can be effectively used to extract discussion topics and summarize them for further manual analysis △ Less

Submitted 8 August, 2016; originally announced August 2016.

Comments: 17 pages, 2 figures, 5 tables

ACM Class: D.4.8; H.1.2; H.2.8; I.2.7

arXiv:1602.01937 [pdf]

doi 10.5121/ijsptm.2013.2402

YOURPRIVACYPROTECTOR, A recommender system for privacy settings in social networks

Authors: Kambiz Ghazinour, Stan Matwin, Marina Sokolova

Abstract: Ensuring privacy of users of social networks is probably an unsolvable conundrum. At the same time, an informed use of the existing privacy options by the social network participants may alleviate - or even prevent - some of the more drastic privacy-averse incidents. Unfortunately, recent surveys show that an average user is either not aware of these options or does not use them, probably due to t… ▽ More Ensuring privacy of users of social networks is probably an unsolvable conundrum. At the same time, an informed use of the existing privacy options by the social network participants may alleviate - or even prevent - some of the more drastic privacy-averse incidents. Unfortunately, recent surveys show that an average user is either not aware of these options or does not use them, probably due to their perceived complexity. It is therefore reasonable to believe that tools assisting users with two tasks: 1) understanding their social net behavior in terms of their privacy settings and broad privacy categories, and 2)recommending reasonable privacy options, will be a valuable tool for everyday privacy practice in a social network context. This paper presents YourPrivacyProtector, a recommender system that shows how simple machine learning techniques may provide useful assistance in these two tasks to Facebook users. We support our claim with empirical results of application of YourPrivacyProtector to two groups of Facebook users. △ Less

Submitted 5 February, 2016; originally announced February 2016.

Comments: 15 pages, International journal of security, privacy and trust management. (IJSPTM) Volume 2, No 4, Aug. 2013

Journal ref: International journal of security, privacy and trust management. (IJSPTM) Volume 2, No 4, Aug. 2013

arXiv:1503.07795 [pdf]

Multi-Labeled Classification of Demographic Attributes of Patients: a case study of diabetics patients

Authors: Naveen Kumar Parachur Cotha, Marina Sokolova

Abstract: Automated learning of patients demographics can be seen as multi-label problem where a patient model is based on different race and gender groups. The resulting model can be further integrated into Privacy-Preserving Data Mining, where it can be used to assess risk of identification of different patient groups. Our project considers relations between diabetes and demographics of patients as a mult… ▽ More Automated learning of patients demographics can be seen as multi-label problem where a patient model is based on different race and gender groups. The resulting model can be further integrated into Privacy-Preserving Data Mining, where it can be used to assess risk of identification of different patient groups. Our project considers relations between diabetes and demographics of patients as a multi-labelled problem. Most research in this area has been done as binary classification, where the target class is finding if a person has diabetes or not. But very few, and maybe no work has been done in multi-labeled analysis of the demographics of patients who are likely to be diagnosed with diabetes. To identify such groups, we applied ensembles of several multi-label learning algorithms. △ Less

Submitted 26 March, 2015; originally announced March 2015.

Comments: 16 pages, 9 tables

Showing 1–16 of 16 results for author: Sokolova, M