Search | arXiv e-print repository

arXiv:2406.07759 [pdf, other]

LT4SG@SMM4H24: Tweets Classification for Digital Epidemiology of Childhood Health Outcomes Using Pre-Trained Language Models

Authors: Dasun Athukoralage, Thushari Atapattu, Menasha Thilakaratne, Katrina Falkner

Abstract: This paper presents our approaches for the SMM4H24 Shared Task 5 on the binary classification of English tweets reporting children's medical disorders. Our first approach involves fine-tuning a single RoBERTa-large model, while the second approach entails ensembling the results of three fine-tuned BERTweet-large models. We demonstrate that although both approaches exhibit identical performance on… ▽ More This paper presents our approaches for the SMM4H24 Shared Task 5 on the binary classification of English tweets reporting children's medical disorders. Our first approach involves fine-tuning a single RoBERTa-large model, while the second approach entails ensembling the results of three fine-tuned BERTweet-large models. We demonstrate that although both approaches exhibit identical performance on validation data, the BERTweet-large ensemble excels on test data. Our best-performing system achieves an F1-score of 0.938 on test data, outperforming the benchmark classifier by 1.18%. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Submitted for the 9th Social Media Mining for Health Research and Applications Workshop and Shared Tasks- Large Language Models (LLMs) and Generalizability for Social Media NLP

arXiv:2208.08486 [pdf, other]

EmoMent: An Emotion Annotated Mental Health Corpus from two South Asian Countries

Authors: Thushari Atapattu, Mahen Herath, Charitha Elvitigala, Piyanjali de Zoysa, Kasun Gunawardana, Menasha Thilakaratne, Kasun de Zoysa, Katrina Falkner

Abstract: People often utilise online media (e.g., Facebook, Reddit) as a platform to express their psychological distress and seek support. State-of-the-art NLP techniques demonstrate strong potential to automatically detect mental health issues from text. Research suggests that mental health issues are reflected in emotions (e.g., sadness) indicated in a person's choice of language. Therefore, we develope… ▽ More People often utilise online media (e.g., Facebook, Reddit) as a platform to express their psychological distress and seek support. State-of-the-art NLP techniques demonstrate strong potential to automatically detect mental health issues from text. Research suggests that mental health issues are reflected in emotions (e.g., sadness) indicated in a person's choice of language. Therefore, we developed a novel emotion-annotated mental health corpus (EmoMent), consisting of 2802 Facebook posts (14845 sentences) extracted from two South Asian countries - Sri Lanka and India. Three clinical psychology postgraduates were involved in annotating these posts into eight categories, including 'mental illness' (e.g., depression) and emotions (e.g., 'sadness', 'anger'). EmoMent corpus achieved 'very good' inter-annotator agreement of 98.3% (i.e. % with two or more agreement) and Fleiss' Kappa of 0.82. Our RoBERTa based models achieved an F1 score of 0.76 and a macro-averaged F1 score of 0.77 for the first task (i.e. predicting a mental health condition from a post) and the second task (i.e. extent of association of relevant posts with the categories defined in our taxonomy), respectively. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: This work has been accepted to appear at COLING 2022 Conference

arXiv:2012.02565 [pdf, other]

Automated Detection of Cyberbullying Against Women and Immigrants and Cross-domain Adaptability

Authors: Thushari Atapattu, Mahen Herath, Georgia Zhang, Katrina Falkner

Abstract: Cyberbullying is a prevalent and growing social problem due to the surge of social media technology usage. Minorities, women, and adolescents are among the common victims of cyberbullying. Despite the advancement of NLP technologies, the automated cyberbullying detection remains challenging. This paper focuses on advancing the technology using state-of-the-art NLP techniques. We use a Twitter data… ▽ More Cyberbullying is a prevalent and growing social problem due to the surge of social media technology usage. Minorities, women, and adolescents are among the common victims of cyberbullying. Despite the advancement of NLP technologies, the automated cyberbullying detection remains challenging. This paper focuses on advancing the technology using state-of-the-art NLP techniques. We use a Twitter dataset from SemEval 2019 - Task 5(HatEval) on hate speech against women and immigrants. Our best performing ensemble model based on DistilBERT has achieved 0.73 and 0.74 of F1 score in the task of classifying hate speech (Task A) and aggressiveness and target (Task B) respectively. We adapt the ensemble model developed for Task A to classify offensive language in external datasets and achieved ~0.7 of F1 score using three benchmark datasets, enabling promising results for cross-domain adaptability. We conduct a qualitative analysis of misclassified tweets to provide insightful recommendations for future cyberbullying research. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2010.06640 [pdf, other]

Enhancing the Identification of Cyberbullying through Participant Roles

Authors: Gathika Ratnayaka, Thushari Atapattu, Mahen Herath, Georgia Zhang, Katrina Falkner

Abstract: Cyberbullying is a prevalent social problem that inflicts detrimental consequences to the health and safety of victims such as psychological distress, anti-social behaviour, and suicide. The automation of cyberbullying detection is a recent but widely researched problem, with current research having a strong focus on a binary classification of bullying versus non-bullying. This paper proposes a no… ▽ More Cyberbullying is a prevalent social problem that inflicts detrimental consequences to the health and safety of victims such as psychological distress, anti-social behaviour, and suicide. The automation of cyberbullying detection is a recent but widely researched problem, with current research having a strong focus on a binary classification of bullying versus non-bullying. This paper proposes a novel approach to enhancing cyberbullying detection through role modeling. We utilise a dataset from ASKfm to perform multi-class classification to detect participant roles (e.g. victim, harasser). Our preliminary results demonstrate promising performance including 0.83 and 0.76 of F1-score for cyberbullying and role classification respectively, outperforming baselines. △ Less

Submitted 22 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

arXiv:2007.10744 [pdf, ps, other]

Beyond Accuracy: Assessing Software Documentation Quality

Authors: Christoph Treude, Justin Middleton, Thushari Atapattu

Abstract: Good software documentation encourages good software engineering, but the meaning of "good" documentation is vaguely defined in the software engineering literature. To clarify this ambiguity, we draw on work from the data and information quality community to propose a framework that decomposes documentation quality into ten dimensions of structure, content, and style. To demonstrate its applicatio… ▽ More Good software documentation encourages good software engineering, but the meaning of "good" documentation is vaguely defined in the software engineering literature. To clarify this ambiguity, we draw on work from the data and information quality community to propose a framework that decomposes documentation quality into ten dimensions of structure, content, and style. To demonstrate its application, we recruited technical editors to apply the framework when evaluating examples from several genres of software documentation. We summarise their assessments -- for example, reference documentation and README files excel in quality whereas blog articles have more problems -- and we describe our vision for reasoning about software documentation quality and for the expansion and potential of a unified quality framework. △ Less

Submitted 8 September, 2020; v1 submitted 21 July, 2020; originally announced July 2020.

Comments: to appear in the Visions and Reflections Track of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2020

arXiv:1903.03286 [pdf]

An Identification of Learners' Confusion through Language and Discourse Analysis

Authors: Thushari Atapattu, Katrina Falkner, Menasha Thilakaratne, Lavendini Sivaneasharajah, Rangana Jayashanka

Abstract: The substantial growth of online learning, in particular, Massively Open Online Courses (MOOCs), supports research into the development of better models for effective learning. Learner 'confusion' is among one of the identified aspects which impacts the overall learning process, and ultimately, course attrition. Confusion for a learner is an individual state of bewilderment and uncertainty of how… ▽ More The substantial growth of online learning, in particular, Massively Open Online Courses (MOOCs), supports research into the development of better models for effective learning. Learner 'confusion' is among one of the identified aspects which impacts the overall learning process, and ultimately, course attrition. Confusion for a learner is an individual state of bewilderment and uncertainty of how to move forward. The majority of recent works neglect the 'individual' factor and measure the influence of community-related aspects (e.g. votes, views) for confusion classification. While this is a useful measure, as the popularity of one's post can indicate that many other students have similar confusion regarding course topics, these models neglect the personalised context, such as individual's affect or emotions. Certain physiological aspects (e.g. facial expressions, heart rate) have been utilised to classify confusion in small to medium classrooms. However, these techniques are challenging to adopt to MOOCs. To bridge this gap, we propose an approach solely based on language and discourse aspects of learners, which outperforms the previous models. We contribute through the development of a novel linguistic feature set that is predictive for confusion classification. We train the confusion classifier using one domain, successfully applying it across other domains. △ Less

Submitted 8 March, 2019; originally announced March 2019.

arXiv:1802.06997 [pdf, other]

Categorizing the Content of GitHub README Files

Authors: Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atapattu, David Lo

Abstract: README files play an essential role in sha** a developer's first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,22… ▽ More README files play an essential role in sha** a developer's first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the `What' and `How' of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files. △ Less

Submitted 30 July, 2018; v1 submitted 20 February, 2018; originally announced February 2018.

Showing 1–7 of 7 results for author: Atapattu, T