A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts

Kazi Toufique Elahi, Tasnuva Binte Rahman, Shakil Shahriar, Samir Sarker,
Md. Tanvir Rouf Shawon, G. M. Shahariar
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology, Dhaka, Bangladesh
{ktoufiquee, tasnuvabinterahmansrishti, shakilshahriararnob, rohitsarker5,
shawontanvir95, sshibli745}@gmail.com

Abstract

While Bangla is considered a language with limited resources, sentiment analysis has been a subject of extensive research in the literature. Nevertheless, there is a scarcity of exploration into sentiment analysis specifically in the realm of noisy Bangla texts. In this paper, we introduce a dataset (NC-SentNoB) that we annotated manually to identify ten different types of noise found in a pre-existing sentiment analysis dataset comprising of around 15K noisy Bangla texts. At first, given an input noisy text, we identify the noise type, addressing this as a multi-label classification task. Then, we introduce baseline noise reduction methods to alleviate noise prior to conducting sentiment analysis. Finally, we assess the performance of fine-tuned sentiment analysis models with both noisy and noise-reduced texts to make comparisons. The experimental findings indicate that the noise reduction methods utilized are not satisfactory, highlighting the need for more suitable noise reduction methods in future research endeavors. We have made the implementation and dataset presented in this paper publicly available¹¹1https://github.com/ktoufiquee/A-Comparative-Analysis-of-Noise-Reduction-Methods-in-Sentiment-Analysis-on-Noisy-Bangla-Texts.

1 Introduction

Sentiment analysis is the process of analyzing and categorizing the emotions or opinions expressed in textual content. This process holds considerable importance in evaluating public sentiments, analyzing social media posts, and assessing customer feedback. It contributes significantly to gaining insights into ongoing social media dynamics. There have been nearly 7000 papers published on this topic and 99% of the papers have appeared after 2004, making sentiment analysis one of the fastest-growing research areas (Mäntylä et al., 2018).

Sentiment Data Noise Neutral [B] Aenk idn OeyT kraIechn saeym bhaI
[E] You kept me waiting for several days brother Sayem. Mixed
Language
Local Word Positive [B] Aaim maejh maejh JaI- khub mangt khabar
[E] I occasionally visit, and the food is of high quality. Punctuation
Error Negative [B] bhaI dyaker khabar nSh/T krebn/na
[E] Please don’t waste food brother. Spacing
Error
Spelling
Error

Table 1: Few examples from our NC-SentNoB dataset with sentiment on the leftmost column and noise types on the rightmost column. B represents the original text in Bangla and E represents the corresponding English translation.

With the recent emergence of pre-trained language models (PLMs) (Devlin et al., 2018a; Liu et al., 2019; He et al., 2020; Raffel et al., 2020; Xue et al., 2020), there has been a notable enhancement in the sentiment analysis task. However, when confronted with increased textual noise, the performance of PLMs drops drastically (around 50%), primarily due to the inability of the tokenizer to handle misspelled words (Srivastava et al., 2020). This issue is less pronounced in English, where most ty** tools and applications offer robust auto-correction systems. However, Bangla, despite being the seventh most spoken language with a minimum of 272.7 million speakers (Wikipedia, 2023), faces significant challenges due to the absence of an effective auto-correction system in digital devices and software. As a result, a considerable amount of text shared on social media platforms often exhibits diverse forms of noise, including informal language, regional words, spelling errors, typographic errors, punctuation errors, coined words, embedded metadata, a mixture of two or more languages (code-mixed text), grammatical mistakes and so forth (Srivastava et al., 2020). For example, the sentence "na , muI Oega ikchu kI naI , duhkhu paIeb" (English: No, I did not tell them anything, they will get sad) incorporates regional words like "muI" ("Aaim", I), "Oega" ("Oedr", them), "kI" ("bil", tell), "paIeb" ("paeb", get), alongside a spelling error "duhkhu" ("duhkh", sad).

Recent investigations into Bangla sentiment analysis have primarily focused on Bangla texts, Romanized Bangla texts (Hassan et al., 2016), and social media comments (Chakraborty et al., 2022). However, there is a notable scarcity of research specifically addressing noisy Bangla texts, and the available datasets for such studies are limited. To address this gap, the SentNoB dataset (Islam et al., 2021) has been recently introduced, aiming to tackle challenges associated with sentiment analysis in noisy Bangla texts. Nevertheless, it is worth noting that this dataset lacks annotations for noise types present in the noisy texts and does not incorporate any noise reduction methods. The presence of noise significantly impacts the performance of models compared to their performance on noiseless text, which indicates a potential area for further research. To address these issues, we have made the following contributions:

•

We present a dataset named NC-SentNoB (Noise Classification on SentNoB dataset), designed for the identification of ten distinct types of noise found in approximately 15K noisy Bangla texts. Few sample instances are provided in Table 1.
•

We employ machine learning, deep learning and fine-tune pre-trained transformer models to identify noise types in noisy Bangla texts (a multi-label classification task) and to perform sentiment analysis on both noisy and noise-reduced texts (a multi-class classification task).
•

We conduct experiments with various techniques to reduce noise from Bangla texts including spell correction, back translation, paraphrasing and masking. To assess their effectiveness, we compare the performance of these methods against a set of 1000 random, noisy texts that have been manually corrected by annotators.
•

We have made our dataset and codes openly accessible for further research in this field.

2 Related Works

Haque et al. (2023) integrated 42,036 samples from two publicly available Bangla datasets, achieving the highest accuracy (85.8%) in multi-class sentiment analysis with their proposed C-LSTM. Islam et al. (2020) introduced two manually tagged Bangla datasets, achieving 71% accuracy for binary classification and 60% for multi-class classification using BERT with GRU. Bhowmick and Jana (2021) outperformed the baseline model proposed by Islam et al. (2020), attaining a 95% accuracy on binary classification by fine-tuning m-BERT and XLM-RoBERTa. Samia et al. (2022) utilized BERT, BiLSTM, and LSTM for aspect-based sentiment analysis, where BERT performed best by achieving 95% in aspect detection and 77% sentiment classification. Hasan et al. (2023) fine-tuned transformer models where BanglaBERT surpassed other models with 86% accuracy and a macro F1-score of 0.82 in multi-class setting.

Bangla sentiment analysis has also been extended to address the challenges of noisy social media texts. One of the notable contributions is SentNoB, a dataset of over 15,000 social media comments developed by Islam et al. (2021). It was benchmarked by SVM with lexical features, neural networks, and pre-trained language models. The best micro-averaged F1-Score (0.646) was achieved by SVM with word and character n-grams. Hoq et al. (2021) added Twitter data to SentNoB and got 87.31% accuracy with multi-layer perceptrons. Islam et al. (2023) developed SentiGOLD, which is a balanced Bangla sentiment dataset consisting of 70,000 entries with five classes which utilized SentNoB for cross-dataset evaluation. It was benchmarked by BiLSTM, HAN, BiLSTM, CNN with attention and BanglaBERT. The best macro F1-Score (0.62) was achieved by fine-tuning BanglaBERT, which also got an F1-Score (0.61) on SentNoB during cross-dataset testing.

As for the correction of noisy texts, Koyama et al. (2021) performed a comparative analysis of grammatical error correction using back-translation models. It was observed that the transformer-based model achieved the highest score on the CoNLL-2014 dataset (Ng et al., 2014). Sun and Jiang (2019) employed a BERT-based masked language modeling for contextual noise reduction. This method involves sequentially masking and correcting each word in a sentence, starting from the left. They found that this noise reduction method significantly enhances performance in applications such as neural machine translation, natural language interfaces, and paraphrase detection in noisy texts.

3 Noise Identification

In this section, we first manually annotate all the instances from SentNoB dataset, categorizing them into ten separate noise categories. A single instance may fall into multiple noise categories. Then, we outline the process of noise identification, where the objective is to determine the type of noise present in a given noisy Bangla text. This task is framed as a multi-label classification task.

3.1 Existing Dataset

The SentNoB dataset (Islam et al., 2021) has a total of 15,728 noisy Bangla texts. While the dataset offers a collection of noisy Bangla texts, it lacks information regarding the specific types of noise present in these texts. The dataset is partitioned into three subsets: train (80%), test (10%), and validation (10%). Each text is categorized into one of three labels: positive, neutral, and negative. These labels represent the sentiment or tone expressed in each text.

3.2 Dataset Development

To the best of our knowledge, there is currently no dataset specifically designed for the purpose of identifying noise in Bangla texts. To address this gap, we expanded the SentNoB dataset to create a noise identification dataset named NC-SentNoB (Noise Classification on SentNoB dataset), encompassing a total of 15,176 noisy texts. In the process, we eliminated 552 duplicate values present in the original dataset to enhance data integrity. We maintained the train-validation-test splitting ratio of the original dataset and the distribution of data in each partition is detailed in Table 2.

	Neutral	Positive	Negative
Train	2,767	4,948	4,318
Test	361	650	570
Validation	354	621	587
Total	3,482	6,219	5,475

Table 2: Data distribution in each partition.

3.3 Annotation

The primary idea behind develo** the NC-SentNoB dataset was to categorize the noises available in the dataset. To do this, the authors thoroughly investigated the SentNoB dataset, determined ten categories, and defined rules for each noise type as the annotation guidelines. The details of each noise category are presented in Appendix C. We first invited seven native Bangla speakers to assist us with the annotating process. Next, we asked each participant to label 50 samples, from which we determined their trustworthiness score (Price et al., 2020). We used 10 samples out of the 50 as control samples and discovered that only four participants achieved the 90% trustworthiness score threshold. The degree of agreement across annotators is calculated using Fleiss’ kappa score (Fleiss, 1971) to maintain the quality of the annotation. After computing the scores for four independent annotators, we found a reliable score of 0.69, indicating a substantial degree of agreement.

Refer to caption — Figure 1: Length-Frequency distribution of Texts.

3.4 Dataset Statistics

It is evident from Table 2 that the dataset is imbalanced, with the number of texts in the neutral category significantly lower than those in both the positive and negative categories.

Class Instances #Word/Instance Local Word 2,084 (0.136%) 16.05 Word Misuse 661 (0.043%) 18.55 Context/Word Missing 550 (0.036%) 13.19 Wrong Serial 69 (0.005%) 15.30 Mixed Language 6,267 (0.410%) 17.91 Punctuation Error 5,988 (0.391%) 17.25 Spacing Error 2,456 (0.161%) 18.78 Spelling Error 5,817 (0.380%) 17.30 Coined Word 549 (0.036%) 15.45 Others 1,263 (0.083%) 16.52

Table 3: Statistics of NC-SentNoB per noise class.

In addition to the class imbalance, the dataset also exhibits a wide variation in the length of the texts. On an average, the texts have a length of 66 characters. The longest text is 314 characters, while the shortest text is only 11 characters long. Figure 1 shows the length frequency distribution of the texts over the whole dataset. Table 3 shows the statistics of different types of noise we found. This provides an insight into the most common noise of Bangla texts found on the dataset. The table shows that Mixed Language is the most common noise type, Spelling Error is the second most common, and Wrong Serial is the least common. Figure 2 indicates low correlation coefficients, suggesting a minimal linear association between noise categories. Notably, Mixed Language and Spelling Error have the least correlation at -0.12, implying a slight inverse relationship between these two types. This indicates if a sentence in the dataset contains an error of Mixed Language, it has a higher possibility of not having any Spelling Error and vice versa.

3.5 Baselines

For noise identification, we implemented Support Vector Machine (SVM) (Cortes and Vapnik, 1995) (utilizing both character and word n-gram features), Bidirectional Long Short Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) network, and fine-tuned the pre-trained Bangla-BERT-Base (Sarker, 2020) model. The descriptions of the models can be found in Appendix A. The rationale behind the classification is to develop an automatic text pre-processing step that identifies different types of noise present in Bangla texts. We firmly believe that this pre-processing step will play a vital role in addressing challenges associated with noisy Bangla texts by aiding in the development of noise specific reduction methods.

3.6 Experimental Setup

SVM was implemented with a regularization parameter of 1. As for BiLSTM and Bangla-BERT-Base, Binary Cross-Entropy Loss was used. Both models were trained using the AdamW optimizer, with a learning rate of $1e-6$ for BiLSTM and $1e-5$ for Bangla-BERT-Base. The batch sizes were set at 256 for BiLSTM and 128 for Bangla-BERT-Base.

3.7 Results & Analysis

Table 4 presents the performance comparison of the implemented models on noise identification. Bangla-BERT-Base achieves the highest micro F1-score at 0.62, while SVM with character-level features secures the second-best score of 0.57. However, BiLSTM has the lowest micro F1-score of 0.24.

Model	Precision	Recall	F1-Score
SVM (C)	0.76	0.45	0.57
SVM (W)	0.64	0.38	0.48
SVM (C + W)	0.75	0.45	0.56
Bi-LSTM	0.36	0.18	0.24
Bangla-BERT-Base	0.73	0.54	0.62

Table 4: Performance comparison of different models on noise identification. C represents character level n-gram and W represents word level n-gram.

The comparison between SVM with character-level features and SVM with word-level features shows that the former attains a higher score. This suggests that character-level information is more crucial for noise identification. Implementing a similar character-level approach in neural network models and fine-tuning other pre-trained language models may improve the noise identification performance which we leave open for future work. Table 5 illustrates the performance of Bangla-BERT-Base on each type of noise. It can be seen that the model fails to classify instances of the Wrong Serial type. This is primarily due to the low amount of data available for this specific class in the dataset.

Class	Precision	Recall	F1-Score
Local Word	0.46	0.49	0.47
Word Misuse	0.65	0.16	0.25
Context/Word Missing	0.33	0.06	0.10
Wrong Serial	0.00	0.00	0.00
Mixed Language	0.75	0.85	0.80
Punctuation Error	0.83	0.54	0.65
Spacing Error	0.86	0.21	0.33
Spelling Error	0.64	0.55	0.59
Coined Word	0.82	0.89	0.86
Others	0.76	0.76	0.76

Table 5: Class-wise performance of Bangla-BERT-Base on noise identification task.

4 Sentiment Analysis

In this section, we outline the methodology employed for conducting sentiment analysis on the NC-SentNoB dataset. We employ a cost-sensitive learning objective to fine-tune seven pre-trained transformer models for the sentiment analysis task. We conduct two distinct experiments: the first involves fine-tuning transformers on the noisy texts, while the second entails fine-tuning transformers after reducing noise from the original noisy texts.

4.1 Baselines

We utilized seven publicly available pre-trained transformer models: Bangla-Bert-Base (Sarker, 2020), BanglaBERT (Bhattacharjee et al., 2022a), BanglaBERT Large (Bhattacharjee et al., 2022a), SahajBERT²²2https://huggingface.co/neuropark/sahajBERT, Bangla-Electra³³3https://huggingface.co/monsoon-nlp/bangla-electra, MuRIL (Khanuja et al., 2021). The descriptions of the models can be found in Appendix A.

4.2 Cost-sensitive Learning

Cost-sensitive learning (Elkan, 2001) is a process of training where we can make the model prioritize samples from the minority class above those from the majority class by suggesting a manually established weight for every class label in the cost function that is being minimized. We adopted this method in the sentiment analysis task. In order to provide a more equitable and balanced model performance, we tried imposing larger costs to the classes that are in the minority in numbers due to the imbalance scenario in the NC-SentNoB dataset, as seen in Table 2. This was accomplished by providing class weights to the Cross-Entropy loss function used to train the models.

4.3 Experimental Setup

Cost-sensitive learning was incorporated by using class weights as a cost matrix into the Cross-Entropy loss function. The class weights were set at 1.4496 for neutral, 0.8106 for positive, and 0.9289 for negative classes. For fine-tuning, the AdamW optimizer was used with a learning rate of $1e-5$ , betas set at (0.9, 0.9999), an epsilon value of $1e-9$ , and a weight decay of 0.08. Due to resource constraints, batch size was set to 48 for sahajBERT, 32 for BanglaBERT Large, and 128 for the rest of the models.

Model	Precision	Recall	F1-Score
Bangla-BERT-Base	0.72	0.72	0.72
BanglaBERT	0.75	0.75	0.75
BanglaBERT Large	0.74	0.74	0.74
BanglaBERT Generator	0.72	0.72	0.72
sahajBERT	0.72	0.72	0.72
Bangla-Electra	0.68	0.68	0.68
MuRIL	0.73	0.73	0.73

Table 6: Performance of sentiment analysis models fine-tuned on noisy texts.

4.4 Experiment with noise

Table 6 illustrates the performance comparison of the seven fine-tuned models. BanglaBERT yields the highest scores across all evaluation metrics with a micro F1-score of $0.75$ . This result outperforms the highest micro F1-Score of 0.6461 with SVM previously reported by Islam et al. (2021). It is also noteworthy that all other models except Bangla-Electra have demonstrated results that are somewhat comparable with ranges between 0.72 and 0.75 in terms of micro F1-score.

4.5 Experiment by reducing noise

In this experiment, we first outline the noise reduction strategies utilized prior to sentiment analysis. We then randomly select 1000 noisy texts and manually correct them. We use these 1000 manually corrected texts as ground truth for measuring the performance of the noise reduction methods in terms of semantic similarity. To assess performance, we employ various established evaluation metrics.

Class	Instances
Local Word	132 (13.2%)
Word Misuse	32 (03.2%)
Context/Word Missing	39 (03.9%)
Wrong Serial	4 (00.4%)
Mixed Language	416 (41.6%)
Punctuation Error	323 (32.3%)
Spacing Error	133 (13.3%)
Spelling Error	376 (37.6%)
Coined Word	33 (03.3%)
Others	92 (09.2%)

Table 7: Statistics of noise types on manually corrected 1000 data.

BLEU

ROUGE-L

BERT Score

SBERT Score

BSTS

BERT -iBLEU

Word Coverage

Human Evaluation (%)

Word2Vec

FastText

Bangla

BERT

Noisy Text

65.77

79.71

93.21

88.32

93.67

51.65

75.54

82.92

71.26

Google

Translate

21.55

39.46

84.72

81.04

84.28

80.93

87.52

89.01

84.86

37.90

BanglaT5

Translate

16.57

32.09

81.30

75.27

82.15

80.12

89.01

87.52

85.66

21.10

Spell

Correction (SC)

61.17

77.35

92.29

87.86

92.94

56.50

82.72

88.51

80.76

35.80

SC +

Paraphrase

20.35

36.44

83.32

74.15

85.60

80.63

86.79

83.89

20.80

MLM

(OOV)

60.99

76.44

90.72

86.90

91.82

56.60

88.51

82.27

87.18

26.80

MLM

(Random)

44.17

70.00

90.76

85.26

93.45

68.93

86.41

88.35

93.20

10.40

Table 8: Performance comparison of different noise reduction methods

4.5.1 Process of Noise Reduction

Complete elimination of noise from the noisy texts is impossible. However, our aim is to minimize noise to the greatest extent possible. This section details four distinct methods for reducing noise in noisy texts: back-translation, spelling correction, paraphrasing and replacing out of vocabulary (OOV) words with predictions generated by a masked language model (MLM). Additional details about the employed methods can be found in Appendix A.
(a) Back-translation. Back-translation serves as a method to correct various errors within a sentence. As pre-trained models have been trained on extensive corpora of noiseless sentences, they can generate a noiseless translated sentence when presented with a noisy sentence as input. Also, translating that sentence back into the original language may result in a corrected version. For this study, all input texts were initially translated into English and then into Bangla using back-translation. Two models were chosen for this purpose: Google Translate, a web service employing an RNN-based model and BanglaT5 models pre-trained on the BanglaNMT English-Bangla and BanglaNMT Bangla-English dataset (Bhattacharjee et al., 2022b).
(b) Spelling Correction. For the noisy texts we are working with, correcting spelling errors can be a beneficial process as spelling errors can affect the tokenization process. To address this issue, we implemented a spell correction algorithm based on Soundex and Levenshtein distance. This algorithm replaces misspelled words with the closest matching words found in the Bangla dictionary⁴⁴4https://github.com/MinhasKamal/BanglaDictionary. However, as it is not context-based, there are instances where it fails to correct all spellings and may even introduce out-of-context words in the sentence.
(c) Paraphrasing. Paraphrasing involves changing the words of a sentence without altering its meaning. Similar to translation models, paraphrasing models have the potential to provide a noiseless paraphrased output when given a noisy input. For this study, we used the BanglaT5 model pre-trained on the Bangla Paraphrase dataset (Akil et al., 2022). We observed the performance of the BanglaT5 paraphrase model on some randomly selected noisy texts from our dataset and found that the model performs poorly when the input data contains misspelled words. To address this issue, we used the spelling corrector algorithm prior providing input to the model.
(d) Mask Prediction. To improve the quality of noisy texts and address out-of-vocabulary words, we replaced OOV words with <MASK> and used the predictions generated by a Masked Language Model (MLM). We also implemented random masking for replacement with each word having a 20% possibility of getting replaced by the MLM model. For both cases, we used BanglaBERT Generator (Kowsher et al., 2022) model.

4.5.2 Evaluation of Noise Reduction

We first use several well-known metrics to quantify the performance of the noise reduction techniques. The evaluation is performed based on 1000 manually corrected texts. The first four authors individually corrected 250 texts each, while the last two authors verified corrections for 500 texts each. We then compare and analyze the performance of the noise reduction methods.
Evaluation Metrics. To evaluate the noise reduction methods, we employed a range of metrics including BLEU, ROUGE-L, BERTScore, SBERT Score, BSTS, BERT-iBLEU, and Word Coverage (utilizing Word2Vec, FastText, and Bangla-BERT-Base). Additionally, we conducted human evaluations of the noise reduced sentences by native Bangla speakers. The detailed descriptions of the evaluation metrics along with the human evaluation procedure are presented in Appendix B.
Noise Reduction Performance. From the data presented in Table 8, it can be seen that the original noisy texts scored highest on BLEU and ROUGE-L, which is unsurprising since the ground truth sentences contain nearly identical words. This observation is further supported by the spell-corrected sentences, which also achieve a similar score due to having nearly identical words.

Before reduction	[N] Aapin eta Hat dhya bhuel egeln bhaI [C] Aapin eta Hat edhaya bhuel egeln bhaI [E] Brother you forgot to wash your hands.
After reduction	[S] Aapin eta Hat dya bhuel egeln bhaI [SP] tuim etamar Hat-payer dya bhuel egch, bhaI. [TG] bhaI Aapin Hat dhuet bhuel egechn [TM] tuim Hat dhret bhuel egecha [MO] Aapin eta Hat Haraet bhuel egeln bhaI [MR] Aapin eta Hat dhret bhuel egeln bhaI

Table 9: Input and output of a single noisy text by the noise reduction methods. N denotes the original noisy text, C indicates the corrected text, and E represents English translation of the corrected text. S, SP, TG, TM, MO, and MR represent outputs of spelling correction, paraphrasing with spelling correction, back-translation using Google Translate, back-translation with T5 models, masked language modeling for out-of-vocabulary words, and random masked language modeling respectively. For each sentence, noisy words are marked with Red color, and noise reduced words are marked with Green color.

Similarly, for BERTScore, SBERT Score, and BSTS, the scores are higher for noisy texts. This is primarily because of the nature of textual embeddings and the tokenization method used. As mentioned earlier, BERT uses WordPiece tokenization, which can result in identical words having the same token. Therefore, when comparing noisy texts with their corresponding ground truth sentences, many tokens are likely to match perfectly, leading to higher cosine similarity scores. However, although not having the highest score, back-translation, paraphrasing, and mask prediction methods score above 80% in both BERTScore and BSTS, implying that they are semantically similar and the meaning of the sentences have not changed drastically. BERT-iBLEU score accounts for the presence of textually similar words by applying penalization while emphasizing semantic meaning, leading to Google Translate achieving the highest score in this metric. Moreover, the word coverage results show different methods scoring the highest instead of noisy texts. This is due to the generated words or sentences from these models having a higher possibility of being noiseless words from their respective vocabularies. All of the scores are based on the textual similarity of the ground truths and noise reduced sentences. Thus, we relied on human evaluation to select the best noise reduction method where 4 native Bangla speakers evaluated the sentences and discovered that the back-translation method utilizing Google Translate API was the most reliable in terms of maintaining contextual meaning. The input and output of each noise reduction method for a single noisy text are shown in table 9. Except for back-translation using Google Translate, all methods fail to rectify the spelling problem in the input. Most approaches change the meaning of the sentence by changing the noisy word.

4.5.3 Results & Analysis

We prioritized the human evaluation score based on the results of Table 8 and used back-translated data obtained from Google Translate to execute the sentiment analysis task by fine-tuning seven pre-trained transformer models. We applied the same noise reduction method on both the test and validation sets. We compared the sentiment analysis performance of the models fine-tuned on noisy and noiseless data presented in Tables 6 and 10.

Model	Precision	Recall	F1-Score
Bangla-BERT-Base	0.69	0.69	0.69
BanglaBERT	0.72	0.72	0.72
BanglaBERT Large	0.73	0.73	0.73
BanglaBERT Generator	0.70	0.70	0.70
sahajBERT	0.70	0.70	0.70
Bangla-Electra	0.66	0.66	0.66
MuRIL	0.71	0.71	0.71

Table 10: Performance of sentiment analysis models fine-tuned on noise reduced texts (back-translation with google translate).

From Table 10, it can be seen that models fine-tuned on back-translated data only attain the highest F1-Score of $0.73$ . This outcome remains consistent across all models evaluated during our experimentation. The model fine-tuned on noisy data outperformed the same model fine-tuned on back-translated data. The reason for this disparity of performance is that, while back-translation can mitigate some sources of noise, it can also introduce changes in the contextual meaning of the sentences (see Appendix D). Because of this, it had a score of 37.90% on human evaluation where our main priority of scoring was the contextual meaning of the sentence. We used the human evaluation score to achieve the best noise reduction strategy, although as shown in Table 8, other techniques scored well on several metrics as well. Nevertheless, it is worthwhile to explore alternative approaches beyond back-translation to determine whether a particular noise reduction method yields superior results in addressing specific types of noisy texts.

Class	Precision	Recall	F1-Score
Neutral	0.53	0.51	0.52
Positive	0.77	0.77	0.77
Negative	0.78	0.80	0.79
Micro	0.73	0.73	0.73
Macro	0.69	0.69	0.69
Weighted	0.72	0.73	0.72

Table 11: Class-wise performance of BanglaBERT Large on noise reduced texts (back-translation with google translate).

Table 11 illustrates the class-wise results of our best-performing model - BanglaBERT Large on noise reduced data. It is clear from the table that the results are quite high for the positive and negative classes but the opposite for the neutral class. Few training data points might be the reason for this low performance in that particular class.

5 Limitations and Future Works

One obvious limitation is that none of the noise reduction methods we employed were able to correctly reduce noise from the noisy texts. As a result, fine-tuned models achieved a lower score in sentiment analysis than models fine-tuned on noisy texts. Another limitation is that we have not evaluated sentiment analysis by considering alternative noise reduction techniques other than back-translation by Google Translate. Although other noise reduction methods performed poorly in human evaluation, it would be interesting to study whether their performance in noise reduction correlates with the performance in sentiment analysis. Furthermore, the NC-SentNoB dataset contains only a very small number of Wrong Serial data instances. Other categories such as Context/Word Missing, Word Misuse, and Coined Word are also underrepresented. In future, we would like to increase the data in these categories to tackle data imbalance, which may potentially enhance the performance of the transformer models. In addition, to combat noise coming from spelling variation and dialectal differences, we plan to incorporate text normalization methods i.e. character-level spell correction models (Farra et al., 2014; Zaky and Romadhony, 2019) and character-level Neural Machine Translation (NMT) models (Lee et al., 2017; Edman et al., 2023) for back-translation. We hypothesize that text normalization methods might be a viable solution due to their ability to comprehend context at character level. Finally, we will investigate noise-specific reduction techniques and report on the noise reduction approaches that demonstrate superior results in addressing particular types of noisy texts.

6 Conclusion

This study involves a comparison of various noise reduction techniques to assess their effectiveness in reducing noise within the NC-SentNoB dataset, which includes ten distinct types of noises. The results indicate that none of the noise reduction methods effectively reduce noise in the texts, leading to a lower F1-score compared to the sentiment analysis of noisy texts. This underscores the necessity for the development of noise-specific reduction techniques. We conducted a statistical analysis of our NC-SentNoB dataset and employed baseline models to identify the noises. However, the data imbalance adversely impacts the model performance suggesting potential enhancement upon addressing this imbalance.

References

Akil et al. (2022) Ajwad Akil, Najrin Sultana, Abhik Bhattacharjee, and Rifat Shahriyar. 2022. Banglaparaphrase: A high-quality bangla paraphrase dataset. arXiv preprint arXiv:2210.05109.
Bhattacharjee et al. (2022a) Abhik Bhattacharjee, Tahmid Hasan, Wasi Ahmad, Kazi Samin Mubasshir, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, and Rifat Shahriyar. 2022a. BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1318–1327, Seattle, United States. Association for Computational Linguistics.
Bhattacharjee et al. (2022b) Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, and Rifat Shahriyar. 2022b. Banglanlg: Benchmarks and resources for evaluating low-resource natural language generation in bangla. CoRR, abs/2205.11081.
Bhowmick and Jana (2021) Anirban Bhowmick and Abhik Jana. 2021. Sentiment analysis for bengali using transformer based models. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 481–486.
Chakraborty et al. (2022) Partha Chakraborty, Farah Nawar, and Humayra Afrin Chowdhury. 2022. Sentiment analysis of bengali facebook data using classical and deep learning approaches. In Innovation in Electrical Power Engineering, Communication, and Computing Technology: Proceedings of Second IEPCCT 2021, pages 209–218. Springer.
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20:273–297.
Devlin et al. (2018a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Devlin et al. (2018b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018b. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Edman et al. (2023) Lukas Edman, Antonio Toral, and Gertjan van Noord. 2023. Are character-level translations worth the wait? an extensive comparison of character-and subword-level models for machine translation. arXiv preprint arXiv:2302.14220.
Elkan (2001) Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pages 973–978. Lawrence Erlbaum Associates Ltd.
Farra et al. (2014) Noura Farra, Nadi Tomeh, Alla Rozovskaya, and Nizar Habash. 2014. Generalized character-level spelling error correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 161–167, Baltimore, Maryland. Association for Computational Linguistics.
Fleiss (1971) Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
Haque et al. (2023) Rezaul Haque, Naimul Islam, Mayisha Tasneem, and Amit Kumar Das. 2023. Multi-class sentiment classification on bengali social media comments using machine learning. International Journal of Cognitive Computing in Engineering, 4:21–35.
Hasan et al. (2023) Mahmud Hasan, Labiba Islam, Ismat Jahan, Sabrina Mannan Meem, and Rashedur M Rahman. 2023. Natural language processing and sentiment analysis on bangla social media comments on russia–ukraine war using transformers. Vietnam Journal of Computer Science, pages 1–28.
Hassan et al. (2016) Asif Hassan, Mohammad Rashedul Amin, Abul Kalam Al Azad, and Nabeel Mohammed. 2016. Sentiment analysis on bangla and romanized bangla text using deep recurrent models. In 2016 International Workshop on Computational Intelligence (IWCI), pages 51–56. IEEE.
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Hoq et al. (2021) Muntasir Hoq, Promila Haque, and Mohammed Nazim Uddin. 2021. Sentiment analysis of bangla language using deep learning approaches. In International Conference on Computing Science, Communication and Security, pages 140–151. Springer.
Islam et al. (2020) Khondoker Ittehadul Islam, Md Saiful Islam, and Md Ruhul Amin. 2020. Sentiment analysis in bengali via transfer learning using multi-lingual bert. In 2020 23rd International Conference on Computer and Information Technology (ICCIT), pages 1–5. IEEE.
Islam et al. (2021) Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Islam, and Mohammad Ruhul Amin. 2021. Sentnob: A dataset for analysing sentiment on noisy bangla texts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3265–3271.
Islam et al. (2023) Md Ekramul Islam, Labib Chowdhury, Faisal Ahamed Khan, Shazzad Hossain, Sourave Hossain, Mohammad Mamun Or Rashid, Nabeel Mohammed, and Mohammad Ruhul Amin. 2023. Sentigold: A large bangla gold standard multi-domain sentiment analysis dataset and its evaluation. arXiv preprint arXiv:2306.06147.
Khanuja et al. (2021) Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, and Partha Talukdar. 2021. Muril: Multilingual representations for indian languages.
Kowsher et al. (2022) Md Kowsher, Abdullah As Sami, Nusrat Jahan Prottasha, Mohammad Shamsul Arefin, Pranab Kumar Dhar, and Takeshi Koshiba. 2022. Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
Koyama et al. (2021) Aomi Koyama, Kengo Hotate, Masahiro Kaneko, and Mamoru Komachi. 2021. Comparison of grammatical error correction using back-translation models. arXiv preprint arXiv:2104.07848.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Lee et al. (2017) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Mäntylä et al. (2018) Mika V Mäntylä, Daniel Graziotin, and Miikka Kuutila. 2018. The evolution of sentiment analysis—a review of research topics, venues, and top cited papers. Computer Science Review, 27:16–32.
Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland. Association for Computational Linguistics.
Niu et al. (2020) Tong Niu, Semih Yavuz, Yingbo Zhou, Nitish Shirish Keskar, Huan Wang, and Caiming Xiong. 2020. Unsupervised paraphrasing with pretrained language models. arXiv preprint arXiv:2010.12885.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Powers (2020) David MW Powers. 2020. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061.
Price et al. (2020) Ilan Price, Jordan Gifford-Moore, Jory Fleming, Saul Musker, Maayan Roichman, Guillaume Sylvain, Nithum Thain, Lucas Dixon, and Jeffrey Sorensen. 2020. Six attributes of unhealthy conversation. arXiv preprint arXiv:2010.07410.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Samia et al. (2022) Moythry Manir Samia, Alimul Rajee, Md Rakib Hasan, Mohammad Omar Faruq, and Pintu Chandra Paul. 2022. Aspect-based sentiment analysis for bengali text using bidirectional encoder representations from transformers (bert). International Journal of Advanced Computer Science and Applications, 13(12).
Sarker (2020) Sagor Sarker. 2020. Banglabert: Bengali mask language model for bengali language understanding.
Sarker (2021) Sagor Sarker. 2021. Bnlp: Natural language processing toolkit for bengali language. arXiv preprint arXiv:2102.00405.
Shajalal and Aono (2018) Md Shajalal and Masaki Aono. 2018. Semantic textual similarity in bengali text. In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pages 1–5. IEEE.
Srivastava et al. (2020) Ankit Srivastava, Piyush Makhija, and Anuj Gupta. 2020. Noisy text data: Achilles’ heel of bert. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 16–21.
Sun and Jiang (2019) Yifu Sun and Haoming Jiang. 2019. Contextual text denoising with masked language models. arXiv preprint arXiv:1910.14080.
Wikipedia (2023) Wikipedia. 2023. List of languages by total number of speakers — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers. [Online; accessed 13-June-2023].
Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
Zaky and Romadhony (2019) Damar Zaky and Ade Romadhony. 2019. An lstm-based spell checker for indonesian text. 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–6.
Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

Appendix

Appendix A Model Descriptions

A.1 Noise Identification

(a) SVM. Support Vector Machine (SVM) is designed to find a hyperplane in a high-dimensional space. This hyperplane separates data points of different classes while maximizing the margin between these classes. For feature extraction, the TF-IDF Vectorizer was employed, utilizing both a character analyzer and a word analyzer. These are represented as SVM (C) for the character analyzer and SVM (W) for the word analyzer, respectively, using n-grams in the range of 1 to 4. Additionally, a combination of both character and word n-gram features was tested, denoted as SVM (C + W).
(b) BiLSTM. BiLSTM captures long-range dependencies and contextual information among items in a sequence. It has two LSTM layers, one that reads the input sequence in a forward direction and the other in a reverse direction. The outputs of these two layers are then concatenated to produce a final output for each item in the sequence. Our BiLSTM implementation features an embedding size of 512, a hidden size of 110, and consists of 2 layers.
(c) Bangla-BERT-Base. A pretrained Bangla language model using mask language modeling objective (Sarker, 2020). It has the same architecture as the bert-base-uncased (Devlin et al., 2018b) model with an embedding size of 768 and a total parameter of 110M.

A.2 Noise Reduction

(a) BanglaT5. A sequence-to-sequence transformer model that has been pre-trained using the span corruption objective (Bhattacharjee et al., 2022b). It consists of 247 million parameters and has an embedding size of 768. For the implementation of the back-translation method, the BanglaT5 model, pre-trained on the BanglaNMT Bangla-English dataset (Bhattacharjee et al., 2022b), is used for Bangla to English translation. Conversely, for English to Bangla translation, the BanglaT5 model pre-trained on the BanglaNMT English-Bangla dataset (Bhattacharjee et al., 2022b) is utilized. Additionally, the paraphrasing model employed by us is also BanglaT5 model, which has been pre-trained on the BanglaParaphrase dataset (Akil et al., 2022).
(b) BanglaBERT Generator. This is an ELECTRA (Clark et al., 2020) generator that has been pre-trained using the Masked Language Modeling (MLM) objective, specifically on extensive Bangla corpora (Bhattacharjee et al., 2022a). It has an embedding size of 768 and consists of 110M parameters. This model has been employed to perform the MLM task on out-of-vocabulary words and to execute random MLM with each word having a 20% possibility of being masked.

A.3 Sentiment Analysis

(a) BanglaBERT. An ELECTRA (Clark et al., 2020) discriminator model pre-trained with the Replaced Token Detection (RTD) objective. It has an embedding size of 768 and a total of 110M parameters (Bhattacharjee et al., 2022a).
(b) BanglaBERT Large. A larger variant of BanglaBERT, with 335M parameters and an embedding size of 1024 (Bhattacharjee et al., 2022a).
(c) sahajBERT⁵⁵5https://huggingface.co/neuropark/sahajBERT. Pre-trained in Bangla language using Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) objectives. It follows A Lite BERT (ALBERT) (Lan et al., 2019) architecture and has a total of 18M parameters and an embedding size of 128.
(d) Bangla-Electra⁶⁶6https://huggingface.co/monsoon-nlp/bangla-electra. Trained with ELECTRA-small (Clark et al., 2020) with an embedding size of 128 and a total of 14M parameters.
(e) MuRIL. A BERT model pre-trained on 17 Indian languages and their transliterated counterparts (Khanuja et al., 2021). It has 110M parameters and an embedding size of 768 for each token. The model is pre-trained on both monolingual and parallel segments.

Appendix B Performance Evaluation Metrics

B.1 Noise Reduction

(a) BLEU. BiLingual Evaluation Understudy (Papineni et al., 2002) is a commonly used scoring method that measures the overlap between reference and candidate sentences, providing a similarity measurement.
(b) ROUGE-L. Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (Lin, 2004) computes a similarity score by taking into account of longest common sub-sequences appearing in both reference and candidate sentences. Similar to the BLEU score, this scoring method does not provide much insight into semantic measurements, only the similarity of overlap** words/sub-sequences.
(c) BERTScore. BERTScore (Zhang et al., 2019) uses the cosine similarity of contextual embedding of the token provided from a BERT-based model. For this, we used the bert-score⁷⁷7https://pypi.org/project/bert-score/ library, which uses a multilingual BERT for Bangla sentences.
(d) SBERT Score. For this method, we employed paraphrase-multilingual-MiniLM-L12-v2 (Reimers and Gurevych, 2019), a model that maps sentences and paragraphs to a 384 dimensional dense vector space. It supports more than 50 languages and employs cosine similarity to assess the similarity between the input text and the ground truth.
(e) BSTS. Bangla Semantic Textual Similarity was first introduced by (Shajalal and Aono, 2018). It uses embeddings of Word2Vec to calculate the similarity between two sentences.
(f) BERT-iBLEU. The scoring method was originally proposed by (Niu et al., 2020), which combines BERT-Score and BLEU Score to measure the semantic similarity of sentences while penalizing for the presence of similar words. This scoring system is particularly suitable for our needs, as we intend to evaluate the method based on its ability to keep the semantic meaning intact while making necessary changes to reduce noises.
(g) Word Coverage. Pre-trained word embedding models like FastText (Sarker, 2021), and Word2Vec (Sarker, 2021) create a vocabulary on the corpus they are trained on. As they are trained on noiseless sources like Wikipedia articles, their vocabulary contains accurate words. By measuring the percentage of tokens of our data covered in their vocabulary, we can gain insight into what percentage of tokens were noise reduced properly. However, this method may not address all types of noises. Additionally, we also calculated word coverage using the vocabulary of Bangla-BERT-Base (Sarker, 2020).
(h) Human Evaluation. The output texts were evaluated by annotators by comparing them to the 1000 established ground truths. A noise reduced output was considered correct if it retained the same meaning as the ground truth and reduced at least some of the noise or complete noise from the original sentence. In essence, the score represents the proportion of accurate noise reduced data relative to the 1000 ground truth. The score can be defined as:

\text{Score (Human Evaluation)}=\frac{x}{T}*100

Here, x = Accurately noise reduced data
T = Total number of data

B.2 Classification

For both classification tasks (noise and sentiment), we used micro precision, recall, and F1-score.
(a) Precision. Precision measures the accuracy of positive predictions, specifically how many of them are correct (true positives) (Powers, 2020). Alternatively known as True Positive Accuracy (TPA), it is calculated as:

\text{Precision}=\frac{TP}{TP+FP}

where TP indicates true positive and FP indicates false positive.
(b) Recall. Recall, or True Positive Rate (TPR), gauges the classifier’s ability to accurately predict positive cases by determining how many of them it correctly identified out of all the positive cases in the dataset (Powers, 2020). It is defined as:

\text{Recall}=\frac{TP}{TP+FN}

where TP indicates true positive and FN indicates false negative.
(c) F1-Score. The F1-score is the harmonic mean of precision and recall, providing a balance between the two in cases where one may be more significant than the other. F1-score is defined as:

\text{F1-Score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{% Precision}+\text{Recall}}

Appendix C Types of Noise in NC-SentNoB

NC-SentNoB dataset contains labeled data for 10 types of noise. Table 12 illustrates the definition of each noise type annotators used for the annotation process. In case of Punctuation Error, an exception was made for sentences that end without a period "." due to the nature of the data. If such instances were considered errors, the majority of the data would be labeled as having punctuation errors. This could lead to trained models predominantly focusing on this single type of error, rather than recognizing and learning from a broader range of punctuation errors.

[Uncaptioned image] — Table 12: Types of noise with the definition that was used to annotate the dataset. N represents the original noisy sentence, C represents the corrected sentence, and E represents the corresponding English translation. The types Coined Word, and Others do not have any correction as these types of noise are essential to the meaning of the sentence. For each example, noisy words of that particular type are marked with Red color, and their correction is marked with Green color.

Type	Definition	Example with Correction
Local Word	Any regional words even if there is a spelling error	[N] pResh/nr saeth Ut/terr ekan iml paIlam na [C] pResh/nr saeth Ut/terr ekan iml eplam na [E] I did not find any similarity between the question and the answer.
Word Misuse	Wrong use of words or unnecessary repetitions of words	[N] taek AaIenr AaOtay shain/t edOya eHak [C] taek AaIenr AaOtay shais/t edOya eHak [E] He should be punished under the law.
Context/Word missing	Not enough information or missing words	[N] itin EkmaE paern EI mHaibpd - prRithbiiek rkKa kret [C] itin EkmatR paern EI mHaibpd ethek prRithbiiek rkKa kret [E] He is the only one who can save the world from this catastrophe.
Wrong Serial	Wrong order of the words	[N] saraedesh Apradhii khNNujun , Aaera Hey HenY [C] Aaera HenY Hey saraedesh Apradhii khNNujun [E] Search for the criminal desperately.
Mixed Language	Words in another language. Foreign words that were adopted into the Bangla language over time are excluded from this type.	[N] bhaIer EI inUjTa esra inIj [C] bhaIer, EI khbrTa esra khbr [E] Brother, this news is the best news.
Punctuation Error	Improper placement or missing punctuation. Sentences ending without "." (dNNairh) were excluded from this type.	[N] perr par/Tguela keb Aaseb bhaI 1 [C] perr pr/bguela keb Aaseb bhaI? [E] When will the next episodes air brother?
Spacing Error	Improper use of white space	[N] prhaeshana Ta cailey egel bhaela Heta [C] prhaeshanaTa cailey egel bhaela Heta [E] It would be better to continue studying
Spelling Error	Words not following spelling of Bangla Academy Dictionary	[N] baibek Et jal khaOyaena iThk na [C] bhabiiek Et jhal khaOyaena iThk na [E] It is not right to feed the sister-in-law so much spice.
Coined Word	Emoji, symbolic emoji, link	[N] Aaeg janel Aapnar saeth edkha krtam \scalerelX [C] ✗ [E] If I knew I would’ve met you earlier \scalerelX
Others	Noises that do not fall into categories mentioned above.	[N] red kut/tar bac/caedr phNNais caI [C] ✗ [E] I want those sons of bitches hanged.

Appendix D Failure Cases of Back-translation

To provide insight into the performance drop, we have illustrated examples where the back-translation method using Google Translate fails to adequately reduce noise in the input text in table 13. Moreover, it often alters or completely removes important contextual words, which possibly impacts the performance of sentiment analysis. Given a human evaluation score of 37.90%, it can be said that back-translation via Google Translate fails to effectively correct more than 50% of the 1000 manually corrected data.

Noisy data and corresponding Back-Translation	Observation
[N] EI juyar Taka papn ept Aamar men Hy [C] EI juyar Taka papn epeta Aamar men Hy [E] I think the gambling money went to Papan. [B] Aaim men kir EI juyar Taka pireshadh kra Heb [BE] I think this gambling money will be repaid.	The input text contained only a spelling mistake, but the back-translation introduced new words, removed a named entity, and altered the sentence’s meaning.
[N] bhaI khabaerr sadh ejmin eHak na ekn Aapnar muekh egel esTa AimRt Hey Jay dhn/nbad [C] bhaI khabaerr sWad eJmin eHak na ekn Aapnar muekh egel esTa AmrRt Hey Jay, dhnYbad [E] Brother, whatever the taste of the food is, it becomes necter in your mouth, thanks. [B] bhaI, khabaerr sWad JaI eHak na ekn, ETa Aapnar muekh Aaech. [BE] Brother, whatever the taste of the food is, it’s in your mouth.	The input text had multiple spelling mistakes and punctuation errors. The back-translation corrected one of these errors but changed the meaning of part of the sentence.
[N] Eedr ipeTr camrha etala Heb [C] Eedr ipeThr camrha etala Heb [E] Their backs will be skinned. [B] tara camrha Heb [BE] They will become skin.	The input text contained only a spelling mistake. However, the back-translation removed contextually important words, rendering the sentence meaningless.
[N] A amar emechr saeth EI eHaeTl [C] Aamar emesr saeth EI eHaeTl [E] This hotel is with my hostel. [B] Aamar jal idey EI eHaeTl [BE] This hotel with my net.	The back-translation altered a keyword in the sentence, which resulted in a loss of meaning.
[N] ilbur saet ik Aada ekhet Heb 1 [C] elbur saeth ik Aada ekhet Heb? [E] Do I need to eat ginger with lemon? [B] Aaim ik Libur Sate E Aada khaOya Uict? [BE] Should I eat ginger with Libur Sate?	The back-translation failed to correct a spelling mistake and converted the word into English, but it successfully added the missing punctuation.
[N] rana edr mt echelra jaet Hairey na jay [C] ranaedr mt echelra Jaet Hairey na Jay [E] So that boys like Rana don’t get lost. [B] ranar meta echelra eres eHer Jay na [BE] Boys like Rana do not lose in race.	The input sentence had spacing and spelling errors. The back-translation fixed the spacing issue but introduced mixed language, changing the sentence’s meaning.

Table 13: Example scenarios where back-translation with google translate fails to reduce noise in the text. N represents the original noisy sentence, C represents the corrected sentence, E represents its English translation, B represents the result of back-translation, and BE represents the direct English translation of back-translated output. For each example, noisy words are marked with Red color and noise reduced words are marked with Green color.