-
Detecting Spam Reviews on Vietnamese E-commerce Websites
Authors:
Co Van Dinh,
Son T. Luu,
Anh Gia-Tuan Nguyen
Abstract:
The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negat…
▽ More
The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negatively affecting online shop** behaviors. We propose the dataset called ViSpamReviews, which has a strict annotation procedure for detecting spam reviews on e-commerce platforms. Our dataset consists of two tasks: the binary classification task for detecting whether a review is spam or not and the multi-class classification task for identifying the type of spam. The PhoBERT obtained the highest results on both tasks, 86.89% and 72.17%, respectively, by macro average F1 score.
△ Less
Submitted 8 December, 2022; v1 submitted 27 July, 2022;
originally announced July 2022.
-
XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-based Textual Knowledge Source
Authors:
Kiet Van Nguyen,
Phong Nguyen-Thuan Do,
Nhat Duy Nguyen,
Tin Van Huynh,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engi…
▽ More
Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engine that can find correct answers to queries or questions in open-domain or domain-specific texts using machine reading comprehension (MRC) techniques. The majority of advancements in data resources and machine-learning approaches in the MRC and QA systems especially are developed significantly in two resource-rich languages such as English and Chinese. A low-resource language like Vietnamese has witnessed a scarcity of research on QA systems. This paper presents XLMRQA, the first Vietnamese QA system using a supervised transformer-based reader on the Wikipedia-based textual knowledge source (using the UIT-ViQuAD corpus), outperforming the two robust QA systems using deep neural network models: DrQA and BERTserini with 24.46% and 6.28%, respectively. From the results obtained on the three systems, we analyze the influence of question types on the performance of the QA systems.
△ Less
Submitted 13 August, 2022; v1 submitted 14 April, 2022;
originally announced April 2022.
-
B-DAC: A Decentralized Access Control Framework on Northbound Interface for Securing SDN Using Blockchain
Authors:
Phan The Duy,
Hien Do Hoang,
Do Thi Thu Hien,
Anh Gia-Tuan Nguyen,
Van-Hau Pham
Abstract:
Software-Defined Network (SDN) is a new arising terminology of network architecture with outstanding features of orchestration by decoupling the control plane and the data plane in each network element. Even though it brings several benefits, SDN is vulnerable to a diversity of attacks. Abusing the single point of failure in the SDN controller component, hackers can shut down all network operation…
▽ More
Software-Defined Network (SDN) is a new arising terminology of network architecture with outstanding features of orchestration by decoupling the control plane and the data plane in each network element. Even though it brings several benefits, SDN is vulnerable to a diversity of attacks. Abusing the single point of failure in the SDN controller component, hackers can shut down all network operations. More specifics, a malicious OpenFlow application can access to SDN controller to carry out harmful actions without any limitation owing to the lack of the access control mechanism as a standard in the Northbound. The sensitive information about the whole network such as network topology, flow information, and statistics can be gathered and leaked out. Even worse, the entire network can be taken over by the compromised controller. Hence, it is vital to build a scheme of access control for SDN's Northbound. Furthermore, it must also protect the data integrity and availability during data exchange between application and controller. To address such limitations, we introduce B-DAC, a blockchain-based framework for decentralized authentication and fine-grained access control for the Northbound interface to assist administrators in managing and protecting critical resources. With strict policy enforcement, B-DAC can perform decentralized access control for each request to keep network applications under surveillance for preventing over-privileged activities or security policy conflicts. To demonstrate the feasibility of our approach, we also implement a prototype of this framework to evaluate the security impact, effectiveness, and performance through typical use cases.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization
Authors:
Huy Quoc To,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen,
Anh Gia-Tuan Nguyen
Abstract:
Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnames…
▽ More
Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnamese on multi-document. We introduce a novel comparison between different multilingual and monolingual BERT models. The experiment results indicate that monolingual models produce promising results compared to other multilingual models and previous text summarizing models for Vietnamese.
△ Less
Submitted 16 October, 2021; v1 submitted 31 August, 2021;
originally announced August 2021.
-
Sentence Extraction-Based Machine Reading Comprehension for Vietnamese
Authors:
Phong Nguyen-Thuan Do,
Nhat Duy Nguyen,
Tin Van Huynh,
Kiet Van Nguyen,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In t…
▽ More
The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In this paper, we introduce UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of comprises 23.074 question-answers based on 5.109 passages of 174 Wikipedia Vietnamese articles. We propose a conversion algorithm to create the dataset for sentence extraction-based machine reading comprehension and three types of approaches for sentence extraction-based machine reading comprehension in Vietnamese. Our experiments show that the best machine model is XLM-R_Large, which achieves an exact match (EM) of 85.97% and an F1-score of 88.77% on our dataset. Besides, we analyze experimental results in terms of the question type in Vietnamese and the effect of context on the performance of the MRC models, thereby showing the challenges from the UIT-ViWikiQA dataset that we propose to the language processing community.
△ Less
Submitted 11 June, 2021; v1 submitted 19 May, 2021;
originally announced May 2021.
-
Gender Prediction Based on Vietnamese Names with Machine Learning Techniques
Authors:
Huy Quoc To,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen,
Anh Gia-Tuan Nguyen
Abstract:
As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotate…
▽ More
As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.
△ Less
Submitted 23 March, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
A Vietnamese Dataset for Evaluating Machine Reading Comprehension
Authors:
Kiet Van Nguyen,
Duc-Vu Nguyen,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resourc…
▽ More
Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website to encourage the research community to overcome challenges in Vietnamese MRC.
△ Less
Submitted 7 November, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.
-
An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension
Authors:
Son T. Luu,
Kiet Van Nguyen,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like…
▽ More
Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like English and Chinese. However, few studies on MRC have been conducted in low-resource languages such as Vietnamese. In this paper, we conduct several experiments on neural network-based model to understand the impact of word representation to the Vietnamese multiple-choice machine reading comprehension. Our experiments include using the Co-match model on six different Vietnamese word embeddings and the BERT model for multiple-choice reading comprehension. On the ViMMRC corpus, the accuracy of BERT model is 61.28% on test set.
△ Less
Submitted 18 February, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles
Authors:
Kiet Van Nguyen,
Tin Van Huynh,
Duc-Vu Nguyen,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese langua…
▽ More
Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese language to evaluate healthcare reading comprehension models. The corpus comprises 22,057 human-generated question-answer pairs. Crowd-workers create the questions and their answers based on a collection of over 4,416 online Vietnamese healthcare news articles, where the answers comprise spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrate that our corpus requires abilities beyond simple reasoning, such as word matching and demanding difficult reasoning based on single-or-multiple-sentence information. We conduct experiments using different types of machine reading comprehension methods to achieve the first baseline performances, compared with further models' performances. We also measure human performance on the corpus and compared it with several powerful neural network-based and transfer learning-based models. Our experiments show that the best machine model is ALBERT, which achieves an exact match score of 65.26% and an F1-score of 84.89% on our corpus. The significant differences between humans and the best-performance model (14.53% of EM and 10.90% of F1-score) on the test set of our corpus indicate that improvements in ViNewsQA could be explored in the future study. Our corpus is publicly available on our website for the research purpose to encourage the research community to make these improvements.
△ Less
Submitted 11 February, 2021; v1 submitted 19 June, 2020;
originally announced June 2020.
-
Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension
Authors:
Kiet Van Nguyen,
Khiem Vinh Tran,
Son T. Luu,
Anh Gia-Tuan Nguyen,
Ngan Luu-Thuy Nguyen
Abstract:
Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of m…
▽ More
Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes.
△ Less
Submitted 1 November, 2020; v1 submitted 16 January, 2020;
originally announced January 2020.
-
Job Prediction: From Deep Neural Network Models to Applications
Authors:
Tin Van Huynh,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen,
Anh Gia-Tuan Nguyen
Abstract:
Determining the job is suitable for a student or a person looking for work based on their job's descriptions such as knowledge and skills that are difficult, as well as how employers must find ways to choose the candidates that match the job they require. In this paper, we focus on studying the job prediction using different deep neural network models including TextCNN, Bi-GRU-LSTM-CNN, and Bi-GRU…
▽ More
Determining the job is suitable for a student or a person looking for work based on their job's descriptions such as knowledge and skills that are difficult, as well as how employers must find ways to choose the candidates that match the job they require. In this paper, we focus on studying the job prediction using different deep neural network models including TextCNN, Bi-GRU-LSTM-CNN, and Bi-GRU-CNN with various pre-trained word embeddings on the IT Job dataset. In addition, we also proposed a simple and effective ensemble model combining different deep neural network models. The experimental results illustrated that our proposed ensemble model achieved the highest result with an F1 score of 72.71%. Moreover, we analyze these experimental results to have insights about this problem to find better solutions in the future.
△ Less
Submitted 31 January, 2020; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Hate Speech Detection on Vietnamese Social Media Text using the Bidirectional-LSTM Model
Authors:
Hang Thi-Thuy Do,
Huy Duc Huynh,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen,
Anh Gia-Tuan Nguyen
Abstract:
In this paper, we describe our system which participates in the shared task of Hate Speech Detection on Social Networks of VLSP 2019 evaluation campaign. We are provided with the pre-labeled dataset and an unlabeled dataset for social media comments or posts. Our mission is to pre-process and build machine learning models to classify comments/posts. In this report, we use Bidirectional Long Short-…
▽ More
In this paper, we describe our system which participates in the shared task of Hate Speech Detection on Social Networks of VLSP 2019 evaluation campaign. We are provided with the pre-labeled dataset and an unlabeled dataset for social media comments or posts. Our mission is to pre-process and build machine learning models to classify comments/posts. In this report, we use Bidirectional Long Short-Term Memory to build the model that can predict labels for social media text according to Clean, Offensive, Hate. With this system, we achieve comparative results with 71.43% on the public standard test set of VLSP 2019.
△ Less
Submitted 9 November, 2019;
originally announced November 2019.
-
Hate Speech Detection on Vietnamese Social Media Text using the Bi-GRU-LSTM-CNN Model
Authors:
Tin Van Huynh,
Vu Duc Nguyen,
Kiet Van Nguyen,
Ngan Luu-Thuy Nguyen,
Anh Gia-Tuan Nguyen
Abstract:
In recent years, Hate Speech Detection has become one of the interesting fields in natural language processing or computational linguistics. In this paper, we present the description of our system to solve this problem at the VLSP shared task 2019: Hate Speech Detection on Social Networks with the corpus which contains 20,345 human-labeled comments/posts for training and 5,086 for public-testing.…
▽ More
In recent years, Hate Speech Detection has become one of the interesting fields in natural language processing or computational linguistics. In this paper, we present the description of our system to solve this problem at the VLSP shared task 2019: Hate Speech Detection on Social Networks with the corpus which contains 20,345 human-labeled comments/posts for training and 5,086 for public-testing. We implement a deep learning method based on the Bi-GRU-LSTM-CNN classifier into this task. Our result in this task is 70.576% of F1-score, ranking the 5th of performance on public-test set.
△ Less
Submitted 21 December, 2019; v1 submitted 9 November, 2019;
originally announced November 2019.