-
Metadata Integration for Spam Reviews Detection on Vietnamese E-commerce Websites
Authors:
Co Van Dinh,
Son T. Luu
Abstract:
The problem of detecting spam reviews (opinions) has received significant attention in recent years, especially with the rapid development of e-commerce. Spam reviews are often classified based on comment content, but in some cases, it is insufficient for models to accurately determine the review label. In this work, we introduce the ViSpamReviews v2 dataset, which includes metadata of reviews wit…
▽ More
The problem of detecting spam reviews (opinions) has received significant attention in recent years, especially with the rapid development of e-commerce. Spam reviews are often classified based on comment content, but in some cases, it is insufficient for models to accurately determine the review label. In this work, we introduce the ViSpamReviews v2 dataset, which includes metadata of reviews with the objective of integrating supplementary attributes for spam review classification. We propose a novel approach to simultaneously integrate both textual and categorical attributes into the classification model. In our experiments, the product category proved effective when combined with deep neural network (DNN) models, while text features performed well on both DNN models and the model achieved state-of-the-art performance in the problem of detecting spam reviews on Vietnamese e-commerce websites, namely PhoBERT. Specifically, the PhoBERT model achieves the highest accuracy when combined with product description features generated from the SPhoBert model, which is the combination of PhoBERT and SentenceBERT. Using the macro-averaged F1 score, the task of classifying spam reviews achieved 87.22% (an increase of 1.64% compared to the baseline), while the task of identifying the type of spam reviews achieved an accuracy of 73.49% (an increase of 1.93% compared to the baseline).
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts
Authors:
Cuong Nhat Vo,
Khanh Bao Huynh,
Son T. Luu,
Trong-Hop Do
Abstract:
The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social m…
▽ More
The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social media. We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. There are 5 targets in the dataset, and each target is labeled with the corresponding level manually by humans with strict annotation guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level. Then, we construct a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained language model to leverage the power of text representation of BERTology. Finally, we suggest a methodology to integrate the baseline model for targeted hate speech detection into the online streaming system for practical application in preventing hateful and offensive content on social media.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
VLSP 2023 -- LTER: A Summary of the Challenge on Legal Textual Entailment Recognition
Authors:
Vu Tran,
Ha-Thanh Nguyen,
Trung Vo,
Son T. Luu,
Hoang-Anh Dang,
Ngoc-Cam Le,
Thi-Thuy Le,
Minh-Tien Nguyen,
Truong-Son Nguyen,
Le-Minh Nguyen
Abstract:
In this new era of rapid AI development, especially in language processing, the demand for AI in the legal domain is increasingly critical. In the context where research in other languages such as English, Japanese, and Chinese has been well-established, we introduce the first fundamental research for the Vietnamese language in the legal domain: legal textual entailment recognition through the Vie…
▽ More
In this new era of rapid AI development, especially in language processing, the demand for AI in the legal domain is increasingly critical. In the context where research in other languages such as English, Japanese, and Chinese has been well-established, we introduce the first fundamental research for the Vietnamese language in the legal domain: legal textual entailment recognition through the Vietnamese Language and Speech Processing workshop. In analyzing participants' results, we discuss certain linguistic aspects critical in the legal domain that pose challenges that need to be addressed.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension
Authors:
Thinh Phuoc Ngo,
Khoa Tran Anh Dang,
Son T. Luu,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or te…
▽ More
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube -- an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.
△ Less
Submitted 6 April, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
A Text-based Approach For Link Prediction on Wikipedia Articles
Authors:
Anh Hoang Tran,
Tam Minh Nguyen,
Son T. Luu
Abstract:
This paper present our work in the DSAA 2023 Challenge about Link Prediction for Wikipedia Articles. We use traditional machine learning models with POS tags (part-of-speech tags) features extracted from text to train the classification model for predicting whether two nodes has the link. Then, we use these tags to test on various machine learning models. We obtained the results by F1 score at 0.9…
▽ More
This paper present our work in the DSAA 2023 Challenge about Link Prediction for Wikipedia Articles. We use traditional machine learning models with POS tags (part-of-speech tags) features extracted from text to train the classification model for predicting whether two nodes has the link. Then, we use these tags to test on various machine learning models. We obtained the results by F1 score at 0.99999 and got 7th place in the competition. Our source code is publicly available at this link: https://github.com/Tam1032/DSAA2023-Challenge-Link-prediction-DS-UIT_SAT
△ Less
Submitted 6 November, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education
Authors:
Son T. Luu,
Khoi Trong Hoang,
Tuong Quang Pham,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which conta…
▽ More
Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which contain the reading articles for students from Grade 1 to Grade 12. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. The questions in the new dataset are not fixed with four options as in the previous version. Moreover, the difficulty of questions is increased, which challenges the models to find the correct choice. The computer must understand the whole context of the reading passage, the question, and the content of each choice to extract the right answers. Hence, we propose the multi-stage approach that combines the multi-step attention network (MAN) with the natural language inference (NLI) task to enhance the performance of the reading comprehension model. Then, we compare the proposed methodology with the baseline BERTology models on the new dataset and the ViMMRC 1.0. Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models. From the results of the error analysis, we found the challenge of the reading comprehension models is understanding the implicit context in texts and linking them together in order to find the correct answers. Finally, we hope our new dataset will motivate further research in enhancing the language understanding ability of computers in the Vietnamese language.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering
Authors:
Triet Minh Thai,
Son T. Luu
Abstract:
Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are writt…
▽ More
Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
△ Less
Submitted 3 September, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts
Authors:
An Long Doan,
Son T. Luu
Abstract:
The sentiment analysis task has various applications in practice. In the sentiment analysis task, words and phrases that represent positive and negative emotions are important. Finding out the words that represent the emotion from the text can improve the performance of the classification models for the sentiment analysis task. In this paper, we propose a methodology that combines the emotion lexi…
▽ More
The sentiment analysis task has various applications in practice. In the sentiment analysis task, words and phrases that represent positive and negative emotions are important. Finding out the words that represent the emotion from the text can improve the performance of the classification models for the sentiment analysis task. In this paper, we propose a methodology that combines the emotion lexicon with the classification model to enhance the accuracy of the models. Our experimental results show that the emotion lexicon combined with the classification model improves the performance of models.
△ Less
Submitted 3 December, 2022; v1 submitted 5 October, 2022;
originally announced October 2022.
-
UIT-ViCoV19QA: A Dataset for COVID-19 Community-based Question Answering on Vietnamese Language
Authors:
Triet Minh Thai,
Ngan Ha-Thao Chu,
Anh Tuan Vo,
Son T. Luu
Abstract:
For the last two years, from 2020 to 2021, COVID-19 has broken disease prevention measures in many countries, including Vietnam, and negatively impacted various aspects of human life and the social community. Besides, the misleading information in the community and fake news about the pandemic are also serious situations. Therefore, we present the first Vietnamese community-based question answerin…
▽ More
For the last two years, from 2020 to 2021, COVID-19 has broken disease prevention measures in many countries, including Vietnam, and negatively impacted various aspects of human life and the social community. Besides, the misleading information in the community and fake news about the pandemic are also serious situations. Therefore, we present the first Vietnamese community-based question answering dataset for develo** question answering systems for COVID-19 called UIT-ViCoV19QA. The dataset comprises 4,500 question-answer pairs collected from trusted medical sources, with at least one answer and at most four unique paraphrased answers per question. Along with the dataset, we set up various deep learning models as baseline to assess the quality of our dataset and initiate the benchmark results for further research through commonly used metrics such as BLEU, METEOR, and ROUGE-L. We also illustrate the positive effects of having multiple paraphrased answers experimented on these models, especially on Transformer - a dominant architecture in the field of study.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
Detecting Spam Reviews on Vietnamese E-commerce Websites
Authors:
Co Van Dinh,
Son T. Luu,
Anh Gia-Tuan Nguyen
Abstract:
The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negat…
▽ More
The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negatively affecting online shop** behaviors. We propose the dataset called ViSpamReviews, which has a strict annotation procedure for detecting spam reviews on e-commerce platforms. Our dataset consists of two tasks: the binary classification task for detecting whether a review is spam or not and the multi-class classification task for identifying the type of spam. The PhoBERT obtained the highest results on both tasks, 86.89% and 72.17%, respectively, by macro average F1 score.
△ Less
Submitted 8 December, 2022; v1 submitted 27 July, 2022;
originally announced July 2022.
-
VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension
Authors:
Kiet Van Nguyen,
Son Quoc Tran,
Luan Thanh Nguyen,
Tin Van Huynh,
Son T. Luu,
Ngan Luu-Thuy Nguyen
Abstract:
One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To a…
▽ More
One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To address the weakness, we provide the research community with a benchmark dataset named UIT-ViQuAD 2.0 for evaluating the MRC task and question answering systems for the Vietnamese language. We use UIT-ViQuAD 2.0 as a benchmark dataset for the challenge on Vietnamese MRC at the Eighth Workshop on Vietnamese Language and Speech Processing (VLSP 2021). This task attracted 77 participant teams from 34 universities and other organizations. In this article, we present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results. The highest performances are 77.24% in F1-score and 67.43% in Exact Match on the private test set. The Vietnamese MRC systems proposed by the top 3 teams use XLM-RoBERTa, a powerful pre-trained language model based on the transformer architecture. The UIT-ViQuAD 2.0 dataset motivates researchers to further explore the Vietnamese machine reading comprehension task and related tasks such as question answering, question generation, and natural language inference.
△ Less
Submitted 4 April, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Predicting Job Titles from Job Descriptions with Multi-label Text Classification
Authors:
Hieu Trung Tran,
Hanh Hong Phuc Vo,
Son T. Luu
Abstract:
Finding a suitable job and hunting for eligible candidates are important to job seeking and human resource agencies. With the vast information about job descriptions, employees and employers need assistance to automatically detect job titles based on job description texts. In this paper, we propose the multi-label classification approach for predicting relevant job titles from job description text…
▽ More
Finding a suitable job and hunting for eligible candidates are important to job seeking and human resource agencies. With the vast information about job descriptions, employees and employers need assistance to automatically detect job titles based on job description texts. In this paper, we propose the multi-label classification approach for predicting relevant job titles from job description texts, and implement the Bi-GRU-LSTM-CNN with different pre-trained language models to apply for the job titles prediction problem. The BERT with multilingual pre-trained model obtains the highest result by F1-scores on both development and test sets, which are 62.20% on the development set, and 47.44% on the test set.
△ Less
Submitted 9 February, 2022; v1 submitted 21 December, 2021;
originally announced December 2021.
-
Automatically Detecting Cyberbullying Comments on Online Game Forums
Authors:
Hanh Hong-Phuc Vo,
Hieu Trung Tran,
Son T. Luu
Abstract:
Online game forums are popular to most of game players. They use it to communicate and discuss the strategy of the game, or even to make friends. However, game forums also contain abusive and harassment speech, disturbing and threatening players. Therefore, it is necessary to automatically detect and remove cyberbullying comments to keep the game forum clean and friendly. We use the Cyberbullying…
▽ More
Online game forums are popular to most of game players. They use it to communicate and discuss the strategy of the game, or even to make friends. However, game forums also contain abusive and harassment speech, disturbing and threatening players. Therefore, it is necessary to automatically detect and remove cyberbullying comments to keep the game forum clean and friendly. We use the Cyberbullying dataset collected from World of Warcraft (WoW) and League of Legends (LoL) forums and train classification models to automatically detect whether a comment of a player is abusive or not. The result obtains 82.69% of macro F1-score for LoL forum and 83.86% of macro F1-score for WoW forum by the Toxic-BERT model on the Cyberbullying dataset.
△ Less
Submitted 26 December, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts
Authors:
Son T. Luu,
Mao Nguyen Bui,
Loi Duc Nguyen,
Khiem Vinh Tran,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading compr…
▽ More
Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles. Then, we evaluate several baseline approaches for conversational machine comprehension on the UIT-ViCoQA corpus. The best model obtains an F1 score of 45.27%, which is 30.91 points behind human performance (76.18%), indicating that there is ample room for improvement. Our dataset is available at our website: http://nlp.uit.edu.vn/datasets/ for research purposes.
△ Less
Submitted 30 September, 2021; v1 submitted 4 May, 2021;
originally announced May 2021.
-
UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification
Authors:
Son T. Luu,
Ngan Luu-Thuy Nguyen
Abstract:
We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.
We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.
△ Less
Submitted 29 July, 2021; v1 submitted 20 April, 2021;
originally announced April 2021.
-
A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts
Authors:
Son T. Luu,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset cont…
▽ More
In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluated the dataset by deep learning models and transformer models.
△ Less
Submitted 20 July, 2021; v1 submitted 21 March, 2021;
originally announced March 2021.
-
Empirical Study of Text Augmentation on Social Media Text in Vietnamese
Authors:
Son T. Luu,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually…
▽ More
In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually skewed about one label, which leads the dataset to become imbalanced and deteriorate the model's ability. The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset, increasing the prediction model's accuracy. In this paper, we performed augmentation techniques on the VLSP2019 Hate Speech Detection on Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis. The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.
△ Less
Submitted 9 October, 2020; v1 submitted 25 September, 2020;
originally announced September 2020.
-
BANANA at WNUT-2020 Task 2: Identifying COVID-19 Information on Twitter by Combining Deep Learning and Transfer Learning Models
Authors:
Tin Van Huynh,
Luan Thanh Nguyen,
Son T. Luu
Abstract:
The outbreak COVID-19 virus caused a significant impact on the health of people all over the world. Therefore, it is essential to have a piece of constant and accurate information about the disease with everyone. This paper describes our prediction system for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. The dataset for this task contains size 10,000 tweets in English la…
▽ More
The outbreak COVID-19 virus caused a significant impact on the health of people all over the world. Therefore, it is essential to have a piece of constant and accurate information about the disease with everyone. This paper describes our prediction system for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. The dataset for this task contains size 10,000 tweets in English labeled by humans. The ensemble model from our three transformer and deep learning models is used for the final prediction. The experimental result indicates that we have achieved F1 for the INFORMATIVE label on our systems at 88.81% on the test set.
△ Less
Submitted 1 April, 2021; v1 submitted 6 September, 2020;
originally announced September 2020.
-
An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension
Authors:
Son T. Luu,
Kiet Van Nguyen,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like…
▽ More
Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like English and Chinese. However, few studies on MRC have been conducted in low-resource languages such as Vietnamese. In this paper, we conduct several experiments on neural network-based model to understand the impact of word representation to the Vietnamese multiple-choice machine reading comprehension. Our experiments include using the Co-match model on six different Vietnamese word embeddings and the BERT model for multiple-choice reading comprehension. On the ViMMRC corpus, the accuracy of BERT model is 61.28% on test set.
△ Less
Submitted 18 February, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
Comparison Between Traditional Machine Learning Models And Neural Network Models For Vietnamese Hate Speech Detection
Authors:
Son T. Luu,
Hung P. Nguyen,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Hate-speech detection on social network language has become one of the main researching fields recently due to the spreading of social networks like Facebook and Twitter. In Vietnam, the threat of offensive and harassment cause bad impacts for online user. The VLSP - Shared task about Hate Speech Detection on social networks showed many proposed approaches for detecting whatever comment is clean o…
▽ More
Hate-speech detection on social network language has become one of the main researching fields recently due to the spreading of social networks like Facebook and Twitter. In Vietnam, the threat of offensive and harassment cause bad impacts for online user. The VLSP - Shared task about Hate Speech Detection on social networks showed many proposed approaches for detecting whatever comment is clean or not. However, this problem still needs further researching. Consequently, we compare traditional machine learning and deep learning on a large dataset about the user's comments on social network in Vietnamese and find out what is the advantage and disadvantage of each model by comparing their accuracy on F1-score, then we pick two models in which has highest accuracy in traditional machine learning models and deep neural models respectively. Next, we compare these two models capable of predicting the right label by referencing their confusion matrices and considering the advantages and disadvantages of each model. Finally, from the comparison result, we propose our ensemble method that concentrates the abilities of traditional methods and deep learning methods.
△ Less
Submitted 27 September, 2020; v1 submitted 31 January, 2020;
originally announced February 2020.
-
Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension
Authors:
Kiet Van Nguyen,
Khiem Vinh Tran,
Son T. Luu,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of m…
▽ More
Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes.
△ Less
Submitted 1 November, 2020; v1 submitted 16 January, 2020;
originally announced January 2020.