Search | arXiv e-print repository

CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems

Authors: Abbas Ghaddar, David Alfonso-Hermelo, Philippe Langlais, Mehdi Rezagholizadeh, Boxing Chen, Prasanna Parthasarathi

Abstract: In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a diagnostic test set, designed for an improved evaluation of hallucinations… ▽ More In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a diagnostic test set, designed for an improved evaluation of hallucinations in conversational model. CHARP not only measures hallucination but also the compliance of the models to the conversation task. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. CHARP is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: To appear in Findings ACL 2024

arXiv:2403.00252 [pdf, other]

EUROPA: A Legal Multilingual Keyphrase Generation Dataset

Authors: Olivier Salaün, Frédéric Piedboeuf, Guillaume Le Berre, David Alfonso Hermelo, Philippe Langlais

Abstract: Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU of… ▽ More Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU official languages. We run multilingual models on our corpus and analyze the results, showing room for improvement on a domain-specific multilingual corpus such as the one we present. △ Less

Submitted 14 June, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: 19 pages, 2 figures, accepted at ACL 2024

arXiv:2402.14895 [pdf, other]

Data Augmentation is Dead, Long Live Data Augmentation

Authors: Frédéric Piedboeuf, Philippe Langlais

Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation is simply a way of performing better fine-tuning, and that spending more tim… ▽ More Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation is simply a way of performing better fine-tuning, and that spending more time fine-tuning before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set as to not impair training) and why did DA show positive results (facilitates training of network). We furthermore show that zero and few-shot data generation via conversational agents such as ChatGPT or LLama2 can increase performances, concluding that this form of data augmentation does still work, even if classical methods do not. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 8 pages

arXiv:2401.07760 [pdf, other]

On the importance of Data Scale in Pretraining Arabic Language Models

Authors: Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh, Boxing Chen

Abstract: Pretraining monolingual language models have been proven to be vital for performance in Arabic Natural Language Processing (NLP) tasks. In this paper, we conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs). More precisely, we reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora… ▽ More Pretraining monolingual language models have been proven to be vital for performance in Arabic Natural Language Processing (NLP) tasks. In this paper, we conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs). More precisely, we reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora. We have significantly improved the performance of the leading Arabic encoder-only BERT-base and encoder-decoder T5-base models on the ALUE and ORCA leaderboards, thereby reporting state-of-the-art results in their respective model categories. In addition, our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors. Our models and source code are publicly available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/JABER-PyTorch. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2311.11140 [pdf, other]

The state of OAI-PMH repositories in Canadian Universities

Authors: Frédéric Piedboeuf, Guillaume Le Berre, David Alfonso-Hermelo, Olivier Charbonneau, Philippe Langlais

Abstract: This article presents a study of the current state of Universities Institutional Repositories (UIRs) in Canada. UIRs are vital to sharing information and documents, mainly Electronic Thesis and Dissertation (ETDs), and theoretically allow anyone, anywhere, to access the documents contained within the repository. Despite calls for consistent and shareable metadata in these repositories, our literat… ▽ More This article presents a study of the current state of Universities Institutional Repositories (UIRs) in Canada. UIRs are vital to sharing information and documents, mainly Electronic Thesis and Dissertation (ETDs), and theoretically allow anyone, anywhere, to access the documents contained within the repository. Despite calls for consistent and shareable metadata in these repositories, our literature review shows inconsistencies in UIRs, including incorrect use of metadata fields and the omission of crucial information, rendering the systematic analysis of UIR complex. Nonetheless, we collected the data of 57 Canadian UIRs with the aim of analyzing Canadian data and to assess the quality of its UIRs. This was surprisingly difficult due to the lack of information about the UIRs, and we attempt to ease future collection efforts by organizing vital information which are difficult to find, starting from addresses of UIRs. We furthermore present and analyze the main characteristics of the UIRs we managed to collect, using this dataset to create recommendations for future practitioners. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Published at DCMI -- International conference on dublin core and metadata applications, 2023

arXiv:2307.09706 [pdf, other]

RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap

Authors: Tianjian Gao, Phillipe Langlais

Abstract: Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our… ▽ More Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our evaluation procedure to three state-of-the-art ATC algorithms with which we built seven taxonomies from the Yelp domain, and show that 1) RaTE correlates well with human judgments and 2) artificially degrading a taxonomy leads to decreasing RaTE score. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 15th International Conference on Computational Semantics (IWCS), Association for Computational Linguistics (ACL)

arXiv:2305.04971 [pdf, other]

LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization

Authors: Peng Lu, Ahmad Rashid, Ivan Kobyzev, Mehdi Rezagholizadeh, Philippe Langlais

Abstract: Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Con… ▽ More Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted at ACL2023 (Findings)

arXiv:2212.05956 [pdf, other]

Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging

Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Ahmad Rashid, Ali Ghodsi, Philippe Langlais

Abstract: Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalizat… ▽ More Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models. △ Less

Submitted 16 December, 2022; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: Published at EMNLP 2022 (Findings)

arXiv:2205.10687 [pdf, other]

Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Authors: Abbas Ghaddar, Yimeng Wu, Sunyam Bagga, Ahmad Rashid, Khalil Bibi, Mehdi Rezagholizadeh, Chao Xing, Yasheng Wang, Duan Xinyu, Zhefeng Wang, Baoxing Huai, Xin Jiang, Qun Liu, Philippe Langlais

Abstract: There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work concerns addressing two major problems in existing Arabic PLMs which constraint progress of the Arabic NLU and NLG fields.First, existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly using a more methodical approach. Second, there… ▽ More There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work concerns addressing two major problems in existing Arabic PLMs which constraint progress of the Arabic NLU and NLG fields.First, existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly using a more methodical approach. Second, there is a lack of systematic and reproducible evaluation of these models in the literature. In this work, we revisit both the pre-training and evaluation of Arabic PLMs. In terms of pre-training, we explore improving Arabic LMs from three perspectives: quality of the pre-training data, size of the model, and incorporating character-level information. As a result, we release three new Arabic BERT-style models ( JABER, Char-JABER, and SABER), and two T5-style models (AT5S and AT5B). In terms of evaluation, we conduct a comprehensive empirical study to systematically evaluate the performance of existing state-of-the-art models on ALUE that is a leaderboard-powered benchmark for Arabic NLU tasks, and on a subset of the ARGEN benchmark for Arabic NLG tasks. We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks. Our models and source code to reproduce of results will be made available shortly. △ Less

Submitted 21 May, 2022; originally announced May 2022.

arXiv:2204.07674 [pdf, other]

CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge Distillation

Authors: Md Akmal Haidar, Mehdi Rezagholizadeh, Abbas Ghaddar, Khalil Bibi, Philippe Langlais, Pascal Poupart

Abstract: Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning based data augmentation technique tailored for knowledge distillation, call… ▽ More Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation. △ Less

Submitted 15 April, 2022; originally announced April 2022.

arXiv:2112.04329 [pdf, other]

JABER and SABER: Junior and Senior Arabic BERt

Authors: Abbas Ghaddar, Yimeng Wu, Ahmad Rashid, Khalil Bibi, Mehdi Rezagholizadeh, Chao Xing, Yasheng Wang, Duan Xinyu, Zhefeng Wang, Baoxing Huai, Xin Jiang, Qun Liu, Philippe Langlais

Abstract: Language-specific pre-trained models have proven to be more accurate than multilingual ones in a monolingual evaluation setting, Arabic is no exception. However, we found that previously released Arabic BERT models were significantly under-trained. In this technical report, we present JABER and SABER, Junior and Senior Arabic BERt respectively, our pre-trained language model prototypes dedicated f… ▽ More Language-specific pre-trained models have proven to be more accurate than multilingual ones in a monolingual evaluation setting, Arabic is no exception. However, we found that previously released Arabic BERT models were significantly under-trained. In this technical report, we present JABER and SABER, Junior and Senior Arabic BERt respectively, our pre-trained language model prototypes dedicated for Arabic. We conduct an empirical study to systematically evaluate the performance of models across a diverse set of existing Arabic NLU tasks. Experimental results show that JABER and SABER achieve state-of-the-art performances on ALUE, a new benchmark for Arabic Language Understanding Evaluation, as well as on a well-established NER benchmark. △ Less

Submitted 9 January, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Technical Report; v2: add SABER and CAMeLBERT evaluation; v3: fix minor typos and grammatical errors

arXiv:2111.05196 [pdf, other]

NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation

Authors: David Alfonso-Hermelo, Ahmad Rashid, Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh

Abstract: Slot-filling and intent detection are the backbone of conversational agents such as voice assistants, and are active areas of research. Even though state-of-the-art techniques on publicly available benchmarks show impressive performance, their ability to generalize to realistic scenarios is yet to be demonstrated. In this work, we present NATURE, a set of simple spoken-language oriented transforma… ▽ More Slot-filling and intent detection are the backbone of conversational agents such as voice assistants, and are active areas of research. Even though state-of-the-art techniques on publicly available benchmarks show impressive performance, their ability to generalize to realistic scenarios is yet to be demonstrated. In this work, we present NATURE, a set of simple spoken-language oriented transformations, applied to the evaluation set of datasets, to introduce human spoken language variations while preserving the semantics of an utterance. We apply NATURE to common slot-filling and intent detection benchmarks and demonstrate that simple perturbations from the standard evaluation set by NATURE can deteriorate model performance significantly. Through our experiments we demonstrate that when NATURE operators are applied to evaluation set of popular benchmarks the model accuracy can drop by up to 40%. △ Less

Submitted 28 January, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

Comments: 20 pages, 4 figures, accepted to NeurIPS 2021 Track Datasets and Benchmarks

arXiv:2109.10164 [pdf, other]

RAIL-KD: RAndom Intermediate Layer Map** for Knowledge Distillation

Authors: Md Akmal Haidar, Nithin Anchuri, Mehdi Rezagholizadeh, Abbas Ghaddar, Philippe Langlais, Pascal Poupart

Abstract: Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer map**. To address these problems, we propose a RAnd… ▽ More Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer map**. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforce that: all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it act as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time. △ Less

Submitted 1 October, 2021; v1 submitted 21 September, 2021; originally announced September 2021.

arXiv:2109.10147 [pdf, other]

Knowledge Distillation with Noisy Labels for Natural Language Understanding

Authors: Shivendra Bhardwaj, Abbas Ghaddar, Ahmad Rashid, Khalil Bibi, Chengyang Li, Ali Ghodsi, Philippe Langlais, Mehdi Rezagholizadeh

Abstract: Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem a… ▽ More Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD. △ Less

Submitted 21 September, 2021; originally announced September 2021.

arXiv:2109.02071 [pdf, other]

doi 10.18653/v1/2021.findings-acl.168

End-to-End Self-Debiasing Framework for Robust NLU Training

Authors: Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh, Ahmad Rashid

Abstract: Existing Natural Language Understanding (NLU) models have been shown to incorporate dataset biases leading to strong performance on in-distribution (ID) test sets but poor performance on out-of-distribution (OOD) ones. We introduce a simple yet effective debiasing framework whereby the shallow representations of the main model are used to derive a bias model and both models are trained simultaneou… ▽ More Existing Natural Language Understanding (NLU) models have been shown to incorporate dataset biases leading to strong performance on in-distribution (ID) test sets but poor performance on out-of-distribution (OOD) ones. We introduce a simple yet effective debiasing framework whereby the shallow representations of the main model are used to derive a bias model and both models are trained simultaneously. We demonstrate on three well studied NLU tasks that despite its simplicity, our method leads to competitive OOD results. It significantly outperforms other debiasing approaches on two tasks, while still delivering high in-distribution performance. △ Less

Submitted 5 September, 2021; originally announced September 2021.

Comments: Findings ACL 2021

Journal ref: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; August; 2021; pages 1923--1929

arXiv:2107.11610 [pdf, other]

doi 10.1162/tacl_a_00386

Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition

Authors: Abbas Ghaddar, Philippe Langlais, Ahmad Rashid, Mehdi Rezagholizadeh

Abstract: In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB,… ▽ More In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performance on standard benchmarks. To mitigate this bias, we propose a novel model-agnostic training method that adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains. △ Less

Submitted 24 July, 2021; originally announced July 2021.

Comments: MIT Press\TACL 2021\Presented at ACL 2021 This is the exact same content of the TACL version, except the figures and tables are better aligned

Journal ref: journal={Transactions of the Association for Computational Linguistics}, volume={9}, pages={586--604}, year={2021},

arXiv:2006.02679 [pdf]

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Authors: Eva Pfanzelter, Sarah Oberbichler, Jani Marjanen, Pierre-Carl Langlais, Stefan Hechl

Abstract: Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interf… ▽ More Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups? What tools should be available on interfaces and how can we avoid too much complexity? What tools are helpful and how can we improve usability? This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools. △ Less

Submitted 4 June, 2020; originally announced June 2020.

arXiv:1809.08962 [pdf, other]

WiRe57 : A Fine-Grained Benchmark for Open Information Extraction

Authors: William Léchelle, Fabrizio Gotti, Philippe Langlais

Abstract: We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including inference and granularity. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We addre… ▽ More We build a reference for the task of Open Information Extraction, on five documents. We tentatively resolve a number of issues that arise, including inference and granularity. We seek to better pinpoint the requirements for the task. We produce our annotation guidelines specifying what is correct to extract and what is not. In turn, we use this reference to score existing Open IE systems. We address the non-trivial problem of evaluating the extractions produced by systems against the reference tuples, and share our evaluation script. Among seven compared extractors, we find the MinIE system to perform best. △ Less

Submitted 1 August, 2019; v1 submitted 24 September, 2018; originally announced September 2018.

arXiv:1806.05559 [pdf, other]

Extracting Parallel Sentences with Bidirectional Recurrent Neural Networks to Improve Machine Translation

Authors: Francis Grégoire, Philippe Langlais

Abstract: Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose a bidirectional recurrent neural network based approach to extract parallel sentences from collections of multilingual texts. Our experiments with noisy parallel corpora show that we can achieve promising results against a competitive baseline by re… ▽ More Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose a bidirectional recurrent neural network based approach to extract parallel sentences from collections of multilingual texts. Our experiments with noisy parallel corpora show that we can achieve promising results against a competitive baseline by removing the need of specific feature engineering or additional external resources. To justify the utility of our approach, we extract sentence pairs from Wikipedia articles to train machine translation systems and show significant improvements in translation performance. △ Less

Submitted 24 August, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

Comments: 12 pages, 7 figures, COLING 2018. arXiv admin note: text overlap with arXiv:1709.09783

arXiv:1806.03489 [pdf, other]

Robust Lexical Features for Improved Neural Network Named-Entity Recognition

Authors: Abbas Ghaddar, Philippe Langlais

Abstract: Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional ve… ▽ More Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute - offline - a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset. △ Less

Submitted 9 June, 2018; originally announced June 2018.

Comments: 12 pages, to appear in COLING 2018

arXiv:1709.09783 [pdf, other]

A Deep Neural Network Approach To Parallel Sentence Extraction

Authors: Francis Grégoire, Philippe Langlais

Abstract: Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leverag… ▽ More Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task. △ Less

Submitted 27 September, 2017; originally announced September 2017.

Comments: 9 pages, 5 figures

Showing 1–21 of 21 results for author: Langlais, P