-
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation
Authors:
Samuel Cahyawijaya,
Genta Indra Winata,
Bryan Wilie,
Karissa Vincentio,
Xiaohong Li,
Adhiguna Kuncoro,
Sebastian Ruder,
Zhi Yuan Lim,
Syafri Bahar,
Masayu Leylia Khodra,
Ayu Purwarianti,
Pascale Fung
Abstract:
Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language…
▽ More
Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource -- yet widely spoken -- languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks -- despite using only one-fifth the parameters of a larger multilingual model, mBART-LARGE (Liu et al., 2020). This finding emphasizes the importance of pretraining on closely related, local languages to achieve more efficient learning and faster inference for very low-resource languages like Javanese and Sundanese.
△ Less
Submitted 9 October, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Multi-document Summarization using Semantic Role Labeling and Semantic Graph for Indonesian News Article
Authors:
Yuly Haruka Berliana Gunawan,
Masayu Leylia Khodra
Abstract:
In this paper, we proposed a multi-document summarization system using semantic role labeling (SRL) and semantic graph for Indonesian news articles. In order to improve existing summarizer, our system modified summarizer that employed subject, predicate, object, and adverbial (SVOA) extraction for predicate argument structure (PAS) extraction. SVOA extraction is replaced with SRL model for Indones…
▽ More
In this paper, we proposed a multi-document summarization system using semantic role labeling (SRL) and semantic graph for Indonesian news articles. In order to improve existing summarizer, our system modified summarizer that employed subject, predicate, object, and adverbial (SVOA) extraction for predicate argument structure (PAS) extraction. SVOA extraction is replaced with SRL model for Indonesian. We also replace the genetic algorithm to identify important PAS with the decision tree classifier since the summarizer without genetic algorithm gave better performance. The decision tree model is employed to identify important PAS. The decision tree model with 10 features achieved better performance than decision tree with 4 sentence features. Experiments and evaluations are conducted to generate 100 words summary and 200 words summary. The evaluation shows the proposed model get 0.313 average ROUGE-2 recall in 100 words summary and 0.394 average ROUGE-2 recall in 200 words summary.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Fine-tuning Pretrained Multilingual BERT Model for Indonesian Aspect-based Sentiment Analysis
Authors:
Annisa Nurul Azhar,
Masayu Leylia Khodra
Abstract:
Although previous research on Aspect-based Sentiment Analysis (ABSA) for Indonesian reviews in hotel domain has been conducted using CNN and XGBoost, its model did not generalize well in test data and high number of OOV words contributed to misclassification cases. Nowadays, most state-of-the-art results for wide array of NLP tasks are achieved by utilizing pretrained language representation. In t…
▽ More
Although previous research on Aspect-based Sentiment Analysis (ABSA) for Indonesian reviews in hotel domain has been conducted using CNN and XGBoost, its model did not generalize well in test data and high number of OOV words contributed to misclassification cases. Nowadays, most state-of-the-art results for wide array of NLP tasks are achieved by utilizing pretrained language representation. In this paper, we intend to incorporate one of the foremost language representation model, BERT, to perform ABSA in Indonesian reviews dataset. By combining multilingual BERT (m-BERT) with task transformation method, we manage to achieve significant improvement by 8% on the F1-score compared to the result from our previous study.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Parsing Indonesian Sentence into Abstract Meaning Representation using Machine Learning Approach
Authors:
Adylan Roaffa Ilmy,
Masayu Leylia Khodra
Abstract:
Abstract Meaning Representation (AMR) provides many information of a sentence such as semantic relations, coreferences, and named entity relation in one representation. However, research on AMR parsing for Indonesian sentence is fairly limited. In this paper, we develop a system that aims to parse an Indonesian sentence using a machine learning approach. Based on Zhang et al. work, our system cons…
▽ More
Abstract Meaning Representation (AMR) provides many information of a sentence such as semantic relations, coreferences, and named entity relation in one representation. However, research on AMR parsing for Indonesian sentence is fairly limited. In this paper, we develop a system that aims to parse an Indonesian sentence using a machine learning approach. Based on Zhang et al. work, our system consists of three steps: pair prediction, label prediction, and graph construction. Pair prediction uses dependency parsing component to get the edges between the words for the AMR. The result of pair prediction is passed to the label prediction process which used a supervised learning algorithm to predict the label between the edges of the AMR. We used simple sentence dataset that is gathered from articles and news article sentences. Our model achieved the SMATCH score of 0.820 for simple sentence test data.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews
Authors:
Jordhy Fernando,
Masayu Leylia Khodra,
Ali Akbar Septiandri
Abstract:
Aspect and opinion terms extraction from review texts is one of the key tasks in aspect-based sentiment analysis. In order to extract aspect and opinion terms for Indonesian hotel reviews, we adapt double embeddings feature and attention mechanism that outperform the best system at SemEval 2015 and 2016. We conduct experiments using 4000 reviews to find the best configuration and show the influenc…
▽ More
Aspect and opinion terms extraction from review texts is one of the key tasks in aspect-based sentiment analysis. In order to extract aspect and opinion terms for Indonesian hotel reviews, we adapt double embeddings feature and attention mechanism that outperform the best system at SemEval 2015 and 2016. We conduct experiments using 4000 reviews to find the best configuration and show the influences of double embeddings and attention mechanism toward model performance. Using 1000 reviews for evaluation, we achieved F1-measure of 0.914 and 0.90 for aspect and opinion terms extraction in token and entity (term) level respectively.
△ Less
Submitted 19 August, 2019; v1 submitted 13 August, 2019;
originally announced August 2019.
-
A Question Answering System Using Graph-Pattern Association Rules (QAGPAR) On YAGO Knowledge Base
Authors:
Wahyudi,
Masayu Leylia Khodra,
Ary Setijadi Prihatmanto,
Carmadi Machbub
Abstract:
A question answering system (QA System) was developed that uses graph-pattern association rules on the YAGO knowledge base. The answer as output of the system is provided based on a user question as input. If the answer is missing or unavailable in the database, then graph-pattern association rules are used to get the answer. The architecture of this question answering system is as follows: questi…
▽ More
A question answering system (QA System) was developed that uses graph-pattern association rules on the YAGO knowledge base. The answer as output of the system is provided based on a user question as input. If the answer is missing or unavailable in the database, then graph-pattern association rules are used to get the answer. The architecture of this question answering system is as follows: question classification, graph component generation, query generation, and query processing. The question answering system uses association graph patterns in a waterfall model. In this paper, the architecture of the system is described, specifically discussing its reasoning and performance capabilities. The results of this research is that rules with high confidence and correct logic produce correct answers, and vice versa
△ Less
Submitted 1 February, 2019;
originally announced February 2019.
-
Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting
Authors:
Genta Indra Winata,
Masayu Leylia Khodra
Abstract:
Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting al…
▽ More
Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting algorithms are employed to handle the issue and improve the performance of text categorization. The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure. Bagging ML-LP with SMO weak classifier is the best performer in terms of subset accuracy and example-based accuracy. Bagging ML-BR with SMO weak classifier has the best micro-averaged f-measure among all. In other hand, AdaBoost MH with J48 weak classifier has the lowest hamming loss value. Thus, both algorithms have high potential in boosting the performance of text categorization, but only for certain weak classifiers. However, bagging has more potential than adaptive boosting in increasing the accuracy of minority labels.
△ Less
Submitted 11 June, 2019; v1 submitted 27 October, 2018;
originally announced October 2018.
-
Using Graph-Pattern Association Rules On Yago Knowledge Base
Authors:
Wahyudi,
Masayu Leylia Khodra,
Ary Setijadi Prihatmanto,
Carmadi Machbub
Abstract:
We propose the use of Graph-Pattern Association Rules (GPARs) on the Yago knowledge base. Extending association rules for itemsets, GPARS can help to discover regularities between entities in knowledge bases. A rule-generated graph pattern (RGGP) algorithm was used for extracting rules from the Yago knowledge base and a graph-pattern association rules algorithm for creating association rules. Our…
▽ More
We propose the use of Graph-Pattern Association Rules (GPARs) on the Yago knowledge base. Extending association rules for itemsets, GPARS can help to discover regularities between entities in knowledge bases. A rule-generated graph pattern (RGGP) algorithm was used for extracting rules from the Yago knowledge base and a graph-pattern association rules algorithm for creating association rules. Our research resulted in 1114 association rules, where the value of standard confidence at 50.18% was better than partial completeness assumption (PCA) confidence at 49.82%. Besides that the computation time for standard confidence was also better than for PCA confidence
△ Less
Submitted 30 September, 2018;
originally announced October 2018.