Search | arXiv e-print repository

doi 10.1007/978-3-031-21743-2_48

Detecting Spam Reviews on Vietnamese E-commerce Websites

Authors: Co Van Dinh, Son T. Luu, Anh Gia-Tuan Nguyen

Abstract: The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negat… ▽ More The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negatively affecting online shop** behaviors. We propose the dataset called ViSpamReviews, which has a strict annotation procedure for detecting spam reviews on e-commerce platforms. Our dataset consists of two tasks: the binary classification task for detecting whether a review is spam or not and the multi-class classification task for identifying the type of spam. The PhoBERT obtained the highest results on both tasks, 86.89% and 72.17%, respectively, by macro average F1 score. △ Less

Submitted 8 December, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

Comments: Published at The 14th Asian Conference on Intelligent Information and Database Systems (ACIIDS 2022). The dataset is available at https://github.com/sonlam1102/vispamdetection

arXiv:2204.07002 [pdf, other]

XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-based Textual Knowledge Source

Authors: Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engi… ▽ More Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engine that can find correct answers to queries or questions in open-domain or domain-specific texts using machine reading comprehension (MRC) techniques. The majority of advancements in data resources and machine-learning approaches in the MRC and QA systems especially are developed significantly in two resource-rich languages such as English and Chinese. A low-resource language like Vietnamese has witnessed a scarcity of research on QA systems. This paper presents XLMRQA, the first Vietnamese QA system using a supervised transformer-based reader on the Wikipedia-based textual knowledge source (using the UIT-ViQuAD corpus), outperforming the two robust QA systems using deep neural network models: DrQA and BERTserini with 24.46% and 6.28%, respectively. From the results obtained on the three systems, we analyze the influence of question types on the performance of the QA systems. △ Less

Submitted 13 August, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

Comments: Accepted by ACIIDS 2022

arXiv:2111.00707 [pdf, other]

doi 10.1016/j.jisa.2021.103080

B-DAC: A Decentralized Access Control Framework on Northbound Interface for Securing SDN Using Blockchain

Authors: Phan The Duy, Hien Do Hoang, Do Thi Thu Hien, Anh Gia-Tuan Nguyen, Van-Hau Pham

Abstract: Software-Defined Network (SDN) is a new arising terminology of network architecture with outstanding features of orchestration by decoupling the control plane and the data plane in each network element. Even though it brings several benefits, SDN is vulnerable to a diversity of attacks. Abusing the single point of failure in the SDN controller component, hackers can shut down all network operation… ▽ More Software-Defined Network (SDN) is a new arising terminology of network architecture with outstanding features of orchestration by decoupling the control plane and the data plane in each network element. Even though it brings several benefits, SDN is vulnerable to a diversity of attacks. Abusing the single point of failure in the SDN controller component, hackers can shut down all network operations. More specifics, a malicious OpenFlow application can access to SDN controller to carry out harmful actions without any limitation owing to the lack of the access control mechanism as a standard in the Northbound. The sensitive information about the whole network such as network topology, flow information, and statistics can be gathered and leaked out. Even worse, the entire network can be taken over by the compromised controller. Hence, it is vital to build a scheme of access control for SDN's Northbound. Furthermore, it must also protect the data integrity and availability during data exchange between application and controller. To address such limitations, we introduce B-DAC, a blockchain-based framework for decentralized authentication and fine-grained access control for the Northbound interface to assist administrators in managing and protecting critical resources. With strict policy enforcement, B-DAC can perform decentralized access control for each request to keep network applications under surveillance for preventing over-privileged activities or security policy conflicts. To demonstrate the feasibility of our approach, we also implement a prototype of this framework to evaluate the security impact, effectiveness, and performance through typical use cases. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: 23 pages, 14 figures, 14 tables

Report number: Volume 64, February 2022

Journal ref: Journal of Information Security and Applications, 2022

arXiv:2108.13741 [pdf, other]

Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization

Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

Abstract: Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnames… ▽ More Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnamese on multi-document. We introduce a novel comparison between different multilingual and monolingual BERT models. The experiment results indicate that monolingual models produce promising results compared to other multilingual models and previous text summarizing models for Vietnamese. △ Less

Submitted 16 October, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2105.09043 [pdf, other]

Sentence Extraction-Based Machine Reading Comprehension for Vietnamese

Authors: Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In t… ▽ More The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In this paper, we introduce UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of comprises 23.074 question-answers based on 5.109 passages of 174 Wikipedia Vietnamese articles. We propose a conversion algorithm to create the dataset for sentence extraction-based machine reading comprehension and three types of approaches for sentence extraction-based machine reading comprehension in Vietnamese. Our experiments show that the best machine model is XLM-R_Large, which achieves an exact match (EM) of 85.97% and an F1-score of 88.77% on our dataset. Besides, we analyze experimental results in terms of the question type in Vietnamese and the effect of context on the performance of the MRC models, thereby showing the challenges from the UIT-ViWikiQA dataset that we propose to the language processing community. △ Less

Submitted 11 June, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

Comments: Accepted by KSEM 2021 (International Conference on Knowledge Science, Engineering and Management)

arXiv:2010.10852 [pdf]

doi 10.1145/3443279.3443309

Gender Prediction Based on Vietnamese Names with Machine Learning Techniques

Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

Abstract: As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotate… ▽ More As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model. △ Less

Submitted 23 March, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: 6 pages, 6 figures. NLPIR 2020: 4th International Conference on Natural Language Processing and Information Retrieval

arXiv:2009.14725 [pdf, other]

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

Authors: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resourc… ▽ More Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website to encourage the research community to overcome challenges in Vietnamese MRC. △ Less

Submitted 7 November, 2020; v1 submitted 30 September, 2020; originally announced September 2020.

Comments: Accepted by The 28th International Conference on Computational Linguistics (COLING 2020)

arXiv:2008.08810 [pdf, ps, other]

doi 10.1109/ICCE48956.2021.9352127

An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension

Authors: Son T. Luu, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like… ▽ More Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like English and Chinese. However, few studies on MRC have been conducted in low-resource languages such as Vietnamese. In this paper, we conduct several experiments on neural network-based model to understand the impact of word representation to the Vietnamese multiple-choice machine reading comprehension. Our experiments include using the Co-match model on six different Vietnamese word embeddings and the BERT model for multiple-choice reading comprehension. On the ViMMRC corpus, the accuracy of BERT model is 61.28% on test set. △ Less

Submitted 18 February, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

Comments: Published in the 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE)

arXiv:2006.11138 [pdf, other]

New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Authors: Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese langua… ▽ More Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese language to evaluate healthcare reading comprehension models. The corpus comprises 22,057 human-generated question-answer pairs. Crowd-workers create the questions and their answers based on a collection of over 4,416 online Vietnamese healthcare news articles, where the answers comprise spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrate that our corpus requires abilities beyond simple reasoning, such as word matching and demanding difficult reasoning based on single-or-multiple-sentence information. We conduct experiments using different types of machine reading comprehension methods to achieve the first baseline performances, compared with further models' performances. We also measure human performance on the corpus and compared it with several powerful neural network-based and transfer learning-based models. Our experiments show that the best machine model is ALBERT, which achieves an exact match score of 65.26% and an F1-score of 84.89% on our corpus. The significant differences between humans and the best-performance model (14.53% of EM and 10.90% of F1-score) on the test set of our corpus indicate that improvements in ViNewsQA could be explored in the future study. Our corpus is publicly available on our website for the research purpose to encourage the research community to make these improvements. △ Less

Submitted 11 February, 2021; v1 submitted 19 June, 2020; originally announced June 2020.

arXiv:2001.05687 [pdf, other]

doi 10.1109/ACCESS.2020.3035701

Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension

Authors: Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

Abstract: Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of m… ▽ More Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes. △ Less

Submitted 1 November, 2020; v1 submitted 16 January, 2020; originally announced January 2020.

Journal ref: IEEE Access, 2020

arXiv:1912.12214 [pdf, other]

Job Prediction: From Deep Neural Network Models to Applications

Authors: Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

Abstract: Determining the job is suitable for a student or a person looking for work based on their job's descriptions such as knowledge and skills that are difficult, as well as how employers must find ways to choose the candidates that match the job they require. In this paper, we focus on studying the job prediction using different deep neural network models including TextCNN, Bi-GRU-LSTM-CNN, and Bi-GRU… ▽ More Determining the job is suitable for a student or a person looking for work based on their job's descriptions such as knowledge and skills that are difficult, as well as how employers must find ways to choose the candidates that match the job they require. In this paper, we focus on studying the job prediction using different deep neural network models including TextCNN, Bi-GRU-LSTM-CNN, and Bi-GRU-CNN with various pre-trained word embeddings on the IT Job dataset. In addition, we also proposed a simple and effective ensemble model combining different deep neural network models. The experimental results illustrated that our proposed ensemble model achieved the highest result with an F1 score of 72.71%. Moreover, we analyze these experimental results to have insights about this problem to find better solutions in the future. △ Less

Submitted 31 January, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: Accepted by IEEE RIVF 2020 Conference

arXiv:1911.03648 [pdf, other]

Hate Speech Detection on Vietnamese Social Media Text using the Bidirectional-LSTM Model

Authors: Hang Thi-Thuy Do, Huy Duc Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

Abstract: In this paper, we describe our system which participates in the shared task of Hate Speech Detection on Social Networks of VLSP 2019 evaluation campaign. We are provided with the pre-labeled dataset and an unlabeled dataset for social media comments or posts. Our mission is to pre-process and build machine learning models to classify comments/posts. In this report, we use Bidirectional Long Short-… ▽ More In this paper, we describe our system which participates in the shared task of Hate Speech Detection on Social Networks of VLSP 2019 evaluation campaign. We are provided with the pre-labeled dataset and an unlabeled dataset for social media comments or posts. Our mission is to pre-process and build machine learning models to classify comments/posts. In this report, we use Bidirectional Long Short-Term Memory to build the model that can predict labels for social media text according to Clean, Offensive, Hate. With this system, we achieve comparative results with 71.43% on the public standard test set of VLSP 2019. △ Less

Submitted 9 November, 2019; originally announced November 2019.

Journal ref: VLSP Workshop 2019

arXiv:1911.03644 [pdf, other]

Hate Speech Detection on Vietnamese Social Media Text using the Bi-GRU-LSTM-CNN Model

Authors: Tin Van Huynh, Vu Duc Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

Abstract: In recent years, Hate Speech Detection has become one of the interesting fields in natural language processing or computational linguistics. In this paper, we present the description of our system to solve this problem at the VLSP shared task 2019: Hate Speech Detection on Social Networks with the corpus which contains 20,345 human-labeled comments/posts for training and 5,086 for public-testing.… ▽ More In recent years, Hate Speech Detection has become one of the interesting fields in natural language processing or computational linguistics. In this paper, we present the description of our system to solve this problem at the VLSP shared task 2019: Hate Speech Detection on Social Networks with the corpus which contains 20,345 human-labeled comments/posts for training and 5,086 for public-testing. We implement a deep learning method based on the Bi-GRU-LSTM-CNN classifier into this task. Our result in this task is 70.576% of F1-score, ranking the 5th of performance on public-test set. △ Less

Submitted 21 December, 2019; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: Technical Report, VLSP Workshop 2019

Journal ref: VLSP Workshop 2019

Showing 1–13 of 13 results for author: Nguyen, A G