-
Cross-lingual Extended Named Entity Classification of Wikipedia Articles
Authors:
The Viet Bui,
Phuong Le-Hong
Abstract:
The FPT.AI team participated in the SHINRA2020-ML subtask of the NTCIR-15 SHINRA task. This paper describes our method to solving the problem and discusses the official results. Our method focuses on learning cross-lingual representations, both on the word level and document level for page classification. We propose a three-stage approach including multilingual model pre-training, monolingual mode…
▽ More
The FPT.AI team participated in the SHINRA2020-ML subtask of the NTCIR-15 SHINRA task. This paper describes our method to solving the problem and discusses the official results. Our method focuses on learning cross-lingual representations, both on the word level and document level for page classification. We propose a three-stage approach including multilingual model pre-training, monolingual model fine-tuning and cross-lingual voting. Our system is able to achieve the best scores for 25 out of 30 languages; and its accuracy gaps to the best performing systems of the other five languages are relatively small.
△ Less
Submitted 17 October, 2020; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
Authors:
Viet Bui The,
Oanh Tran Thi,
Phuong Le-Hong
Abstract:
This paper describes our study on using mutilingual BERT embeddings and some new neural models for improving sequence tagging tasks for the Vietnamese language. We propose new model architectures and evaluate them extensively on two named entity recognition datasets of VLSP 2016 and VLSP 2018, and on two part-of-speech tagging datasets of VLSP 2010 and VLSP 2013. Our proposed models outperform exi…
▽ More
This paper describes our study on using mutilingual BERT embeddings and some new neural models for improving sequence tagging tasks for the Vietnamese language. We propose new model architectures and evaluate them extensively on two named entity recognition datasets of VLSP 2016 and VLSP 2018, and on two part-of-speech tagging datasets of VLSP 2010 and VLSP 2013. Our proposed models outperform existing methods and achieve new state-of-the-art results. In particular, we have pushed the accuracy of part-of-speech tagging to 95.40% on the VLSP 2010 corpus, to 96.77% on the VLSP 2013 corpus; and the F1 score of named entity recognition to 94.07% on the VLSP 2016 corpus, to 90.31% on the VLSP 2018 corpus. Our code and pre-trained models viBERT and vELECTRA are released as open source to facilitate adoption and further research.
△ Less
Submitted 25 September, 2020; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Towards Task-Oriented Dialogue in Mixed Domains
Authors:
Tho Luong Chi,
Phuong Le-Hong
Abstract:
This work investigates the task-oriented dialogue problem in mixed-domain settings. We study the effect of alternating between different domains in sequences of dialogue turns using two related state-of-the-art dialogue systems. We first show that a specialized state tracking component in multiple domains plays an important role and gives better results than an end-to-end task-oriented dialogue sy…
▽ More
This work investigates the task-oriented dialogue problem in mixed-domain settings. We study the effect of alternating between different domains in sequences of dialogue turns using two related state-of-the-art dialogue systems. We first show that a specialized state tracking component in multiple domains plays an important role and gives better results than an end-to-end task-oriented dialogue system. We then propose a hybrid system which is able to improve the belief tracking accuracy of about 28% of average absolute point on a standard multi-domain dialogue dataset. These experimental results give some useful insights for improving our commercial chatbot platform FPT.AI, which is currently deployed for many practical chatbot applications.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
A Comparative Study of Neural Network Models for Sentence Classification
Authors:
Phuong Le-Hong,
Anh-Cuong Le
Abstract:
This paper presents an extensive comparative study of four neural network models, including feed-forward networks, convolutional networks, recurrent networks and long short-term memory networks, on two sentence classification datasets of English and Vietnamese text. We show that on the English dataset, the convolutional network models without any feature engineering outperform some competitive sen…
▽ More
This paper presents an extensive comparative study of four neural network models, including feed-forward networks, convolutional networks, recurrent networks and long short-term memory networks, on two sentence classification datasets of English and Vietnamese text. We show that on the English dataset, the convolutional network models without any feature engineering outperform some competitive sentence classifiers with rich hand-crafted linguistic features. We demonstrate that the GloVe word embeddings are consistently better than both Skip-gram word embeddings and word count vectors. We also show the superiority of convolutional neural network models on a Vietnamese newspaper sentence dataset over strong baseline models. Our experimental results suggest some good practices for applying neural network models in sentence classification.
△ Less
Submitted 3 October, 2018;
originally announced October 2018.
-
A Factoid Question Answering System for Vietnamese
Authors:
Phuong Le-Hong,
Duc-Thien Bui
Abstract:
In this paper, we describe the development of an end-to-end factoid question answering system for the Vietnamese language. This system combines both statistical models and ontology-based methods in a chain of processing modules to provide high-quality map**s from natural language text to entities. We present the challenges in the development of such an intelligent user interface for an isolating…
▽ More
In this paper, we describe the development of an end-to-end factoid question answering system for the Vietnamese language. This system combines both statistical models and ontology-based methods in a chain of processing modules to provide high-quality map**s from natural language text to entities. We present the challenges in the development of such an intelligent user interface for an isolating language like Vietnamese and show that techniques developed for inflectional languages cannot be applied "as is". Our question answering system can answer a wide range of general knowledge questions with promising accuracy on a test set.
△ Less
Submitted 28 March, 2018; v1 submitted 1 March, 2018;
originally announced March 2018.
-
Vietnamese Semantic Role Labelling
Authors:
Phuong Le-Hong,
Thai Hoang Pham,
Xuan Khoai Pham,
Thi Minh Huyen Nguyen,
Thi Luong Nguyen,
Minh Hiep Nguyen
Abstract:
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a novel constituent extraction algorithm in the arg…
▽ More
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language sentences and its application for the Vietnamese language. We present our effort in building Vietnamese PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification step which is more suitable and more accurate than the common node-map** method. In the machine learning part, our system integrates distributed word features produced by two recent unsupervised learning models in two learned statistical classifiers and makes use of integer linear programming inference procedure to improve the accuracy. The system is evaluated in a series of experiments and achieves a good result, an $F_1$ score of 74.77%. Our system, including corpus and software, is available as an open source project for free research and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
△ Less
Submitted 27 November, 2017;
originally announced November 2017.
-
On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration
Authors:
Thai-Hoang Pham,
Xuan-Khoai Pham,
Phuong Le-Hong
Abstract:
This paper presents an empirical study of two machine translation-based approaches for Vietnamese diacritic restoration problem, including phrase-based and neural-based machine translation models. This is the first work that applies neural-based machine translation method to this problem and gives a thorough comparison to the phrase-based machine translation method which is the current state-of-th…
▽ More
This paper presents an empirical study of two machine translation-based approaches for Vietnamese diacritic restoration problem, including phrase-based and neural-based machine translation models. This is the first work that applies neural-based machine translation method to this problem and gives a thorough comparison to the phrase-based machine translation method which is the current state-of-the-art method for this problem. On a large dataset, the phrase-based approach has an accuracy of 97.32% while that of the neural-based approach is 96.15%. While the neural-based method has a slightly lower accuracy, it is about twice faster than the phrase-based method in terms of inference speed. Moreover, neural-based machine translation method has much room for future improvement such as incorporating pre-trained word embeddings and collecting more training data.
△ Less
Submitted 26 October, 2017; v1 submitted 20 September, 2017;
originally announced September 2017.
-
An Empirical Study of Discriminative Sequence Labeling Models for Vietnamese Text Processing
Authors:
Phuong Le-Hong,
Minh Pham Quang Nhat,
Thai-Hoang Pham,
Tuan-Anh Tran,
Dang-Minh Nguyen
Abstract:
This paper presents an empirical study of two widely-used sequence prediction models, Conditional Random Fields (CRFs) and Long Short-Term Memory Networks (LSTMs), on two fundamental tasks for Vietnamese text processing, including part-of-speech tagging and named entity recognition. We show that a strong lower bound for labeling accuracy can be obtained by relying only on simple word-based feature…
▽ More
This paper presents an empirical study of two widely-used sequence prediction models, Conditional Random Fields (CRFs) and Long Short-Term Memory Networks (LSTMs), on two fundamental tasks for Vietnamese text processing, including part-of-speech tagging and named entity recognition. We show that a strong lower bound for labeling accuracy can be obtained by relying only on simple word-based features with minimal hand-crafted feature engineering, of 90.65\% and 86.03\% performance scores on the standard test sets for the two tasks respectively. In particular, we demonstrate empirically the surprising efficiency of word embeddings in both of the two tasks, with both of the two models. We point out that the state-of-the-art LSTMs model does not always outperform significantly the traditional CRFs model, especially on moderate-sized data sets. Finally, we give some suggestions and discussions for efficient use of sequence labeling models in practical applications.
△ Less
Submitted 30 August, 2017;
originally announced August 2017.
-
NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit
Authors:
Thai-Hoang Pham,
Xuan-Khoai Pham,
Tuan-Anh Nguyen,
Phuong Le-Hong
Abstract:
This paper demonstrates neural network-based toolkit namely NNVLP for essential Vietnamese language processing tasks including part-of-speech (POS) tagging, chunking, named entity recognition (NER). Our toolkit is a combination of bidirectional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), Conditional Random Field (CRF), using pre-trained word embeddings as input, which ach…
▽ More
This paper demonstrates neural network-based toolkit namely NNVLP for essential Vietnamese language processing tasks including part-of-speech (POS) tagging, chunking, named entity recognition (NER). Our toolkit is a combination of bidirectional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), Conditional Random Field (CRF), using pre-trained word embeddings as input, which achieves state-of-the-art results on these three tasks. We provide both API and web demo for this toolkit.
△ Less
Submitted 19 October, 2017; v1 submitted 23 August, 2017;
originally announced August 2017.
-
The Importance of Automatic Syntactic Features in Vietnamese Named Entity Recognition
Authors:
Thai-Hoang Pham,
Phuong Le-Hong
Abstract:
This paper presents a state-of-the-art system for Vietnamese Named Entity Recognition (NER). By incorporating automatic syntactic features with word embeddings as input for bidirectional Long Short-Term Memory (Bi-LSTM), our system, although simpler than some deep learning architectures, achieves a much better result for Vietnamese NER. The proposed method achieves an overall F1 score of 92.05% on…
▽ More
This paper presents a state-of-the-art system for Vietnamese Named Entity Recognition (NER). By incorporating automatic syntactic features with word embeddings as input for bidirectional Long Short-Term Memory (Bi-LSTM), our system, although simpler than some deep learning architectures, achieves a much better result for Vietnamese NER. The proposed method achieves an overall F1 score of 92.05% on the test set of an evaluation campaign, organized in late 2016 by the Vietnamese Language and Speech Processing (VLSP) community. Our named entity recognition system outperforms the best previous systems for Vietnamese NER by a large margin.
△ Less
Submitted 27 August, 2017; v1 submitted 29 May, 2017;
originally announced May 2017.
-
End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Character-level
Authors:
Thai-Hoang Pham,
Phuong Le-Hong
Abstract:
This paper demonstrates end-to-end neural network architectures for Vietnamese named entity recognition. Our best model is a combination of bidirectional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), Conditional Random Field (CRF), using pre-trained word embeddings as input, which achieves an F1 score of 88.59% on a standard test set. Our system is able to achieve a compara…
▽ More
This paper demonstrates end-to-end neural network architectures for Vietnamese named entity recognition. Our best model is a combination of bidirectional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), Conditional Random Field (CRF), using pre-trained word embeddings as input, which achieves an F1 score of 88.59% on a standard test set. Our system is able to achieve a comparable performance to the first-rank system of the VLSP campaign without using any syntactic or hand-crafted features. We also give an extensive empirical study on using common deep learning models for Vietnamese NER, at both word and character level.
△ Less
Submitted 20 July, 2017; v1 submitted 11 May, 2017;
originally announced May 2017.
-
Building a Semantic Role Labelling System for Vietnamese
Authors:
Thai-Hoang Pham,
Xuan-Khoai Pham,
Phuong Le-Hong
Abstract:
Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In…
▽ More
Semantic role labelling (SRL) is a task in natural language processing which detects and classifies the semantic arguments associated with the predicates of a sentence. It is an important step towards understanding the meaning of a natural language. There exists SRL systems for well-studied languages like English, Chinese or Japanese but there is not any such system for the Vietnamese language. In this paper, we present the first SRL system for Vietnamese with encouraging accuracy. We first demonstrate that a simple application of SRL techniques developed for English could not give a good accuracy for Vietnamese. We then introduce a new algorithm for extracting candidate syntactic constituents, which is much more accurate than the common node-map** algorithm usually used in the identification step. Finally, in the classification step, in addition to the common linguistic features, we propose novel and useful features for use in SRL. Our SRL system achieves an $F_1$ score of 73.53\% on the Vietnamese PropBank corpus. This system, including software and corpus, is available as an open source project and we believe that it is a good baseline for the development of future Vietnamese SRL systems.
△ Less
Submitted 11 May, 2017;
originally announced May 2017.
-
Content-based Approach for Vietnamese Spam SMS Filtering
Authors:
Thai-Hoang Pham,
Phuong Le-Hong
Abstract:
Short Message Service (SMS) spam is a serious problem in Vietnam because of the availability of very cheap pre-paid SMS packages. There are some systems to detect and filter spam messages for English, most of which use machine learning techniques to analyze the content of messages and classify them. For Vietnamese, there is some research on spam email filtering but none focused on SMS. In this wor…
▽ More
Short Message Service (SMS) spam is a serious problem in Vietnam because of the availability of very cheap pre-paid SMS packages. There are some systems to detect and filter spam messages for English, most of which use machine learning techniques to analyze the content of messages and classify them. For Vietnamese, there is some research on spam email filtering but none focused on SMS. In this work, we propose the first system for filtering Vietnamese spam SMS. We first propose an appropriate preprocessing method since existing tools for Vietnamese preprocessing cannot give good accuracy on our dataset. We then experiment with vector representations and classifiers to find the best model for this problem. Our system achieves an accuracy of 94% when labelling spam messages while the misclassification rate of legitimate messages is relatively small, about only 0.4%. This is an encouraging result compared to that of English and can be served as a strong baseline for future development of Vietnamese SMS spam prevention systems.
△ Less
Submitted 11 May, 2017;
originally announced May 2017.
-
Vietnamese Named Entity Recognition using Token Regular Expressions and Bidirectional Inference
Authors:
Phuong Le-Hong
Abstract:
This paper describes an efficient approach to improve the accuracy of a named entity recognition system for Vietnamese. The approach combines regular expressions over tokens and a bidirectional inference method in a sequence labelling model. The proposed method achieves an overall $F_1$ score of 89.66% on a test set of an evaluation campaign, organized in late 2016 by the Vietnamese Language and S…
▽ More
This paper describes an efficient approach to improve the accuracy of a named entity recognition system for Vietnamese. The approach combines regular expressions over tokens and a bidirectional inference method in a sequence labelling model. The proposed method achieves an overall $F_1$ score of 89.66% on a test set of an evaluation campaign, organized in late 2016 by the Vietnamese Language and Speech Processing (VLSP) community.
△ Less
Submitted 19 October, 2016; v1 submitted 18 October, 2016;
originally announced October 2016.