-
MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting
Authors:
Al Nahian Bin Emran,
Amrita Ganguly,
Sadiya Sayara Chowdhury Puspo,
Nishat Raihan,
Dhiman Goswami
Abstract:
In this paper, we present MasonTigers' participation in SemEval-2024 Task 10, a shared task aimed at identifying emotions and understanding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for Hindi-English code-mixed dialogues, emotion flip reasoning for Hindi-Engli…
▽ More
In this paper, we present MasonTigers' participation in SemEval-2024 Task 10, a shared task aimed at identifying emotions and understanding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for Hindi-English code-mixed dialogues, emotion flip reasoning for Hindi-English code-mixed dialogues, and emotion flip reasoning for English dialogues. Our team, MasonTigers, contributed to each subtask, focusing on develo** methods for accurate emotion recognition and reasoning. By leveraging our approaches, we attained impressive F1-scores of 0.78 for the first task and 0.79 for both the second and third tasks. This performance not only underscores the effectiveness of our methods across different aspects of the task but also secured us the top rank in the first and third subtasks, and the 2nd rank in the second subtask. Through extensive experimentation and analysis, we provide insights into our system's performance and contributions to each subtask.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection
Authors:
Nishat Raihan,
Dhiman Goswami,
Antara Mahmud,
Antonios Anastasopoulos,
Marcos Zampieri
Abstract:
Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages.…
▽ More
Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
CSEPrompts: A Benchmark of Introductory Computer Science Prompts
Authors:
Nishat Raihan,
Dhiman Goswami,
Sadiya Sayara Chowdhury Puspo,
Christian Newman,
Tharindu Ranasinghe,
Marcos Zampieri
Abstract:
Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and profe…
▽ More
Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.
△ Less
Submitted 4 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
Authors:
Dhiman Goswami,
Sadiya Sayara Chowdhury Puspo,
Md Nishat Raihan,
Al Nahian Bin Emran,
Amrita Ganguly,
Marcos Zampieri
Abstract:
This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21…
▽ More
This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.
△ Less
Submitted 5 April, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection
Authors:
Sadiya Sayara Chowdhury Puspo,
Md Nishat Raihan,
Dhiman Goswami,
Al Nahian Bin Emran,
Amrita Ganguly,
Ozlem Uzuner
Abstract:
This paper presents the MasonTigers entry to the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. The task encompasses Binary Human-Written vs. Machine-Generated Text Classification (Track A), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine Mixed Text Detection (Track C). Our best performing approaches util…
▽ More
This paper presents the MasonTigers entry to the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. The task encompasses Binary Human-Written vs. Machine-Generated Text Classification (Track A), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine Mixed Text Detection (Track C). Our best performing approaches utilize mainly the ensemble of discriminator transformer models along with sentence transformer and statistical machine learning approaches in specific cases. Moreover, zero-shot prompting and fine-tuning of FLAN-T5 are used for Track A and B.
△ Less
Submitted 5 April, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts
Authors:
Md Nishat Raihan,
Dhiman Goswami,
Al Nahian Bin Emran,
Sadiya Sayara Chowdhury Puspo,
Amrita Ganguly,
Marcos Zampieri
Abstract:
Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain f…
▽ More
Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain further improved results with chain-of-thought prompting, an iterative prompting method that breaks down the reasoning process step-by-step. We obtain our best results by utilizing an ensemble of chain-of-thought prompts, placing 2nd in the word puzzle subtask and 13th in the sentence puzzle subtask. The strong performance of prompted LLMs demonstrates their capability for complex reasoning when provided with a decomposition of the thought process. Our work sheds light on how step-wise explanatory prompts can unlock more of the knowledge encoded in the parameters of large models.
△ Less
Submitted 3 April, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
MasonPerplexity at ClimateActivism 2024: Integrating Advanced Ensemble Techniques and Data Augmentation for Climate Activism Stance and Hate Event Identification
Authors:
Al Nahian Bin Emran,
Amrita Ganguly,
Sadiya Sayara Chowdhury Puspo,
Dhiman Goswami,
Md Nishat Raihan
Abstract:
The task of identifying public opinions on social media, particularly regarding climate activism and the detection of hate events, has emerged as a critical area of research in our rapidly changing world. With a growing number of people voicing either to support or oppose to climate-related issues - understanding these diverse viewpoints has become increasingly vital. Our team, MasonPerplexity, pa…
▽ More
The task of identifying public opinions on social media, particularly regarding climate activism and the detection of hate events, has emerged as a critical area of research in our rapidly changing world. With a growing number of people voicing either to support or oppose to climate-related issues - understanding these diverse viewpoints has become increasingly vital. Our team, MasonPerplexity, participates in a significant research initiative focused on this subject. We extensively test various models and methods, discovering that our most effective results are achieved through ensemble modeling, enhanced by data augmentation techniques like back-translation. In the specific components of this research task, our team achieved notable positions, ranking 5th, 1st, and 6th in the respective sub-tasks, thereby illustrating the effectiveness of our approach in this important field of study.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles
Authors:
Amrita Ganguly,
Al Nahian Bin Emran,
Sadiya Sayara Chowdhury Puspo,
Md Nishat Raihan,
Dhiman Goswami,
Marcos Zampieri
Abstract:
The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities. Identifying hate speech in multimodal content is a particularly challenging task because offensiveness can be manifested in either words or images or a juxtaposition of the two. This paper presents the MasonPerplexity submission for the Shared Task on Multimodal Hate…
▽ More
The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities. Identifying hate speech in multimodal content is a particularly challenging task because offensiveness can be manifested in either words or images or a juxtaposition of the two. This paper presents the MasonPerplexity submission for the Shared Task on Multimodal Hate Speech Event Detection at CASE 2024 at EACL 2024. The task is divided into two sub-tasks: sub-task A focuses on the identification of hate speech and sub-task B focuses on the identification of targets in text-embedded images during political events. We use an XLM-roBERTa-large model for sub-task A and an ensemble approach combining XLM-roBERTa-base, BERTweet-large, and BERT-base for sub-task B. Our approach obtained 0.8347 F1-score in sub-task A and 0.6741 F1-score in sub-task B ranking 3rd on both sub-tasks.
△ Less
Submitted 18 February, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
MasonTigers@LT-EDI-2024: An Ensemble Approach Towards Detecting Homophobia and Transphobia in Social Media Comments
Authors:
Dhiman Goswami,
Sadiya Sayara Chowdhury Puspo,
Md Nishat Raihan,
Al Nahian Bin Emran
Abstract:
In this paper, we describe our approaches and results for Task 2 of the LT-EDI 2024 Workshop, aimed at detecting homophobia and/or transphobia across ten languages. Our methodologies include monolingual transformers and ensemble methods, capitalizing on the strengths of each to enhance the performance of the models. The ensemble models worked well, placing our team, MasonTigers, in the top five fo…
▽ More
In this paper, we describe our approaches and results for Task 2 of the LT-EDI 2024 Workshop, aimed at detecting homophobia and/or transphobia across ten languages. Our methodologies include monolingual transformers and ensemble methods, capitalizing on the strengths of each to enhance the performance of the models. The ensemble models worked well, placing our team, MasonTigers, in the top five for eight of the ten languages, as measured by the macro F1 score. Our work emphasizes the efficacy of ensemble methods in multilingual scenarios, addressing the complexities of language-specific tasks.
△ Less
Submitted 15 February, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis
Authors:
Dhiman Goswami,
Md Nishat Raihan,
Sadiya Sayara Chowdhury Puspo,
Marcos Zampieri
Abstract:
In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The main objective of this task is to identify the polarity of social media content using a Bangla dataset annotated with positive, neutral, and negative labels provided by the shared task…
▽ More
In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The main objective of this task is to identify the polarity of social media content using a Bangla dataset annotated with positive, neutral, and negative labels provided by the shared task organizers. Our best system for this task is a transfer learning approach with data augmentation which achieved a micro F1 score of 0.71. Our best system ranked 12th among 30 teams that participated in the competition.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla
Authors:
Md Nishat Raihan,
Dhiman Goswami,
Sadiya Sayara Chowdhury Puspo,
Marcos Zampieri
Abstract:
In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using…
▽ More
In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using back translation and multilinguality which ranked 6th out of 27 teams with a macro F1 score of 0.74.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
Offensive Language Identification in Transliterated and Code-Mixed Bangla
Authors:
Md Nishat Raihan,
Umma Hani Tanmoy,
Anika Binte Islam,
Kai North,
Tharindu Ranasinghe,
Antonios Anastasopoulos,
Marcos Zampieri
Abstract:
Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introdu…
▽ More
Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Authors:
Dhiman Goswami,
Md Nishat Raihan,
Antara Mahmud,
Antonios Anastasopoulos,
Marcos Zampieri
Abstract:
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper,…
▽ More
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages. We experiment with several models on this dataset and observe that BanglishBERT outperforms other transformer-based models and GPT-3.5.
△ Less
Submitted 25 November, 2023; v1 submitted 27 October, 2023;
originally announced October 2023.
-
SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis
Authors:
Md Nishat Raihan,
Dhiman Goswami,
Antara Mahmud,
Antonios Anastasopoulos,
Marcos Zampieri
Abstract:
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel d…
▽ More
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
△ Less
Submitted 29 November, 2023; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest
Authors:
Md Nishat Raihan
Abstract:
Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mi…
▽ More
Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.
△ Less
Submitted 1 October, 2023;
originally announced October 2023.
-
Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
Authors:
Md Nishat Raihan,
Dhiman Goswami,
Antara Mahmud
Abstract:
One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NL…
▽ More
One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.
△ Less
Submitted 14 March, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Contrast Enhancement of Medical X-Ray Image Using Morphological Operators with Optimal Structuring Element
Authors:
Rafsanjany Kushol,
Md. Nishat Raihan,
Md Sirajus Salekin,
A. B. M. Ashikur Rahman
Abstract:
To guide surgical and medical treatment X-ray images have been used by physicians in every modern healthcare organization and hospitals. Doctor's evaluation process and disease identification in the area of skeletal system can be performed in a faster and efficient way with the help of X-ray imaging technique as they can depict bone structure painlessly. This paper presents an efficient contrast e…
▽ More
To guide surgical and medical treatment X-ray images have been used by physicians in every modern healthcare organization and hospitals. Doctor's evaluation process and disease identification in the area of skeletal system can be performed in a faster and efficient way with the help of X-ray imaging technique as they can depict bone structure painlessly. This paper presents an efficient contrast enhancement technique using morphological operators which will help to visualize important bone segments and soft tissues more clearly. Top-hat and Bottom-hat transform are utilized to enhance the image where gradient magnitude value is calculated for automatically selecting the structuring element (SE) size. Experimental evaluation on different x-ray imaging databases shows the effectiveness of our method which also produces comparatively better output against some existing image enhancement techniques.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.