Search | arXiv e-print repository

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

Authors: Nguyen Hong Son, Hieu M. Vu, Tuan-Anh D. Nguyen, Minh-Tien Nguyen

Abstract: This paper introduces a new information extraction model for business documents. Different from prior studies which only base on span extraction or sequence labeling, the model takes into account advantage of both span extraction and sequence labeling. The combination allows the model to deal with long documents with sparse information (the small amount of extracted information). The model is trai… ▽ More This paper introduces a new information extraction model for business documents. Different from prior studies which only base on span extraction or sequence labeling, the model takes into account advantage of both span extraction and sequence labeling. The combination allows the model to deal with long documents with sparse information (the small amount of extracted information). The model is trained end-to-end to jointly optimize the two tasks in a unified manner. Experimental results on four business datasets in English and Japanese show that the model achieves promising results and is significantly faster than the normal span-based extraction method. The code is also available. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted to IJCNN 2022

arXiv:2204.03958 [pdf, other]

Enhance Incomplete Utterance Restoration by Joint Learning Token Extraction and Text Generation

Authors: Shumpei Inoue, Tsungwei Liu, Nguyen Hong Son, Minh-Tien Nguyen

Abstract: This paper introduces a model for incomplete utterance restoration (IUR) called JET (\textbf{J}oint learning token \textbf{E}xtraction and \textbf{T}ext generation). Different from prior studies that only work on extraction or abstraction datasets, we design a simple but effective model, working for both scenarios of IUR. Our design simulates the nature of IUR, where omitted tokens from the contex… ▽ More This paper introduces a model for incomplete utterance restoration (IUR) called JET (\textbf{J}oint learning token \textbf{E}xtraction and \textbf{T}ext generation). Different from prior studies that only work on extraction or abstraction datasets, we design a simple but effective model, working for both scenarios of IUR. Our design simulates the nature of IUR, where omitted tokens from the context contribute to restoration. From this, we construct a Picker that identifies the omitted tokens. To support the picker, we design two label creation methods (soft and hard labels), which can work in cases of no annotation data for the omitted tokens. The restoration is done by using a Generator with the help of the Picker on joint learning. Promising results on four benchmark datasets in extraction and abstraction scenarios show that our model is better than the pretrained T5 and non-generative language model methods in both rich and limited training data settings.\footnote{The code is available at \url{https://github.com/shumpei19/JET}} △ Less

Submitted 28 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: This paper was accepted by 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022). It includes 10 pages, 2 figures

arXiv:2106.00978 [pdf, other]

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Authors: Tuan-Anh D. Nguyen, Hieu M. Vu, Nguyen Hong Son, Minh-Tien Nguyen

Abstract: Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows the great potential of pre-training methods. In this paper, we present a new approach to improve the capability of language model pre-training on VRDs. Firstly, we introduce a new query-based IE model that employs span extr… ▽ More Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows the great potential of pre-training methods. In this paper, we present a new approach to improve the capability of language model pre-training on VRDs. Firstly, we introduce a new query-based IE model that employs span extraction instead of using the common sequence labeling approach. Secondly, to further extend the span extraction formulation, we propose a new training task that focuses on modelling the relationships among semantic entities within a document. This task enables target spans to be extracted recursively and can be used to pre-train the model or as an IE downstream task. Evaluation on three datasets of popular business documents (invoices, receipts) shows that our proposed method achieves significant improvements compared to existing models. The method also provides a mechanism for knowledge accumulation from multiple downstream IE tasks. △ Less

Submitted 6 July, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted to Document Images and Language Workshop at ICDAR 2021

arXiv:2004.02465 [pdf, ps, other]

Independent sets of closure operations

Authors: Nguyen Hoang Son

Abstract: In this paper independent sets of closure operations are introduced. We characterize minimal keys and antikeys of closure operations in terms of independent sets. We establish an expression on the connection between minimal keys and antikeys of closure operations based on independent sets. We construct two combinatorial algorithms for finding all minimal keys and all antikeys of a given closure op… ▽ More In this paper independent sets of closure operations are introduced. We characterize minimal keys and antikeys of closure operations in terms of independent sets. We establish an expression on the connection between minimal keys and antikeys of closure operations based on independent sets. We construct two combinatorial algorithms for finding all minimal keys and all antikeys of a given closure operation based on independent sets. We estimate the time complexity of these algorithms. Finally, we give an NP-complete problem concerning nonkeys of closure operations. △ Less

Submitted 6 April, 2020; originally announced April 2020.

MSC Class: [2010]:68R99; 68P15

arXiv:2003.03064 [pdf, other]

Transfer Learning for Information Extraction with Limited Data

Authors: Minh-Tien Nguyen, Viet-Anh Phan, Le Thai Linh, Nguyen Hong Son, Le Tien Dung, Miku Hirano, Hajime Hotta

Abstract: This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of fundamental technical challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. The main idea of our prop… ▽ More This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of fundamental technical challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. The main idea of our proposal is to leverage the concept of transfer learning, which is to reuse the pre-trained model of deep neural networks, with a combination of common statistical classifiers to determine the class of each extracted term. To do that, we first exploit BERT to deal with the limitation of training data in real scenarios, then stack BERT with Convolutional Neural Networks to learn hidden representation for classification. To validate our approach, we applied our model to an actual case of document processing, which is a process of competitive bids for government projects in Japan. We used 100 documents for training and testing and confirmed that the model enables to extract fine-grained named entities with a detailed level of information preciseness specialized in the targeted business process, such as a department name of application receivers. △ Less

Submitted 8 June, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: 14 pages, 5 figures, PACLING conference

arXiv:1912.02036 [pdf]

doi 10.5121/ijcnc.2019.11605

The method of detecting online password attacks based on high-level protocol analysis and clustering techniques

Authors: Nguyen Hong Son, Ha Thanh Dung

Abstract: Although there have been many solutions applied, the safety challenges related to the password security mechanism are not reduced. The reason for this is that while the means and tools to support password attacks are becoming more and more abundant, the number of transaction systems through the Internet is increasing, and new services systems appear. For example, IoT also uses password-based authe… ▽ More Although there have been many solutions applied, the safety challenges related to the password security mechanism are not reduced. The reason for this is that while the means and tools to support password attacks are becoming more and more abundant, the number of transaction systems through the Internet is increasing, and new services systems appear. For example, IoT also uses password-based authentication. In this context, consolidating password-based authentication mechanisms is critical, but monitoring measures for timely detection of attacks also play an important role in this battle. The password attack detection solutions being used need to be supplemented and improved to meet the new situation. In this paper we propose a solution that automatically detects online password attacks in a way that is based solely on the network, using unsupervised learning techniques and protected application orientation. Our solution, therefore, minimizes dependence on the factors encountered by host-based or supervised learning solutions. The certainty of the solution comes from using the results of an in-depth analysis of attack characteristics to build the detection capacity of the mechanism. The solution was implemented experimentally on the real system and gave positive results. △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: 13 pages, 7 figures

Journal ref: International Journal of Computer Networks & Communications (IJCNC) Vol.11, No.6, November 2019

Showing 1–6 of 6 results for author: Son, N H