Search | arXiv e-print repository

Data Centric Domain Adaptation for Historical Text with OCR Errors

Authors: Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth, Hinrich Schütze

Abstract: We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general appro… ▽ More We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora. △ Less

Submitted 2 July, 2021; originally announced July 2021.

Comments: 14 pages, 2 figures, 6 tables

arXiv:2004.03354 [pdf, other]

Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We eva… ▽ More Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT-BERT F1 delta, at 5% of BioBERT's CO_2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic. △ Less

Submitted 27 June, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

arXiv:1911.03700 [pdf, other]

Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015)… ▽ More We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson's r over single-source systems. △ Less

Submitted 24 June, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

arXiv:1911.03681 [pdf, other]

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al.… ▽ More We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pretraining of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem. △ Less

Submitted 1 May, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

arXiv:1906.10924 [pdf, other]

Interpretable Question Answering on Knowledge Bases and Text

Authors: Alona Sydorova, Nina Poerner, Benjamin Roth

Abstract: Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of th… ▽ More Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of the model. For this purpose, we propose an automatic evaluation paradigm for explanation methods in the context of QA. We also conduct a study with human annotators to evaluate whether explanations help them identify better QA models. Our results suggest that IP provides better explanations than LIME or attention, according to both automatic and human evaluation. We obtain the same ranking of methods in both experiments, which supports the validity of our automatic evaluation paradigm. △ Less

Submitted 26 June, 2019; originally announced June 2019.

arXiv:1811.00066 [pdf, other]

Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Authors: Nina Poerner, Masoud Jalili Sabet, Benjamin Roth, Hinrich Schütze

Abstract: Count-based word alignment methods, such as the IBM models or fast-align, struggle on very small parallel corpora. We therefore present an alternative approach based on cross-lingual word embeddings (CLWEs), which are trained on purely monolingual data. Our main contribution is an unsupervised objective to adapt CLWEs to parallel corpora. In experiments on between 25 and 500 sentences, our method… ▽ More Count-based word alignment methods, such as the IBM models or fast-align, struggle on very small parallel corpora. We therefore present an alternative approach based on cross-lingual word embeddings (CLWEs), which are trained on purely monolingual data. Our main contribution is an unsupervised objective to adapt CLWEs to parallel corpora. In experiments on between 25 and 500 sentences, our method outperforms fast-align. We also show that our fine-tuning objective consistently improves a CLWE-only baseline. △ Less

Submitted 31 October, 2018; originally announced November 2018.

arXiv:1809.07291 [pdf, other]

Interpretable Textual Neuron Representations for NLP

Authors: Nina Poerner, Benjamin Roth, Hinrich Schütze

Abstract: Input optimization methods, such as Google Deep Dream, create interpretable representations of neurons for computer vision DNNs. We propose and evaluate ways of transferring this technology to NLP. Our results suggest that gradient ascent with a gumbel softmax layer produces n-gram representations that outperform naive corpus search in terms of target neuron activation. The representations highlig… ▽ More Input optimization methods, such as Google Deep Dream, create interpretable representations of neurons for computer vision DNNs. We propose and evaluate ways of transferring this technology to NLP. Our results suggest that gradient ascent with a gumbel softmax layer produces n-gram representations that outperform naive corpus search in terms of target neuron activation. The representations highlight differences in syntax awareness between the language and visual models of the Imaginet architecture. △ Less

Submitted 19 September, 2018; originally announced September 2018.

Comments: BlackboxNLP Workshop at EMNLP 2018 (Extended Abstract)

arXiv:1803.01707 [pdf, other]

doi 10.1017/S1351324918000451

Neural Architectures for Open-Type Relation Argument Extraction

Authors: Benjamin Roth, Costanza Conforti, Nina Poerner, Sanjeev Karn, Hinrich Schütze

Abstract: In this work, we introduce the task of Open-Type Relation Argument Extraction (ORAE): Given a corpus, a query entity Q and a knowledge base relation (e.g.,"Q authored notable work with title X"), the model has to extract an argument of non-standard entity type (entities that cannot be extracted by a standard named entity tagger, e.g. X: the title of a book or a work of art) from the corpus. A dist… ▽ More In this work, we introduce the task of Open-Type Relation Argument Extraction (ORAE): Given a corpus, a query entity Q and a knowledge base relation (e.g.,"Q authored notable work with title X"), the model has to extract an argument of non-standard entity type (entities that cannot be extracted by a standard named entity tagger, e.g. X: the title of a book or a work of art) from the corpus. A distantly supervised dataset based on WikiData relations is obtained and released to address the task. We develop and compare a wide range of neural models for this task yielding large improvements over a strong baseline obtained with a neural question answering system. The impact of different sentence encoding architectures and answer extraction methods is systematically compared. An encoder based on gated recurrent units combined with a conditional random fields tagger gives the best results. △ Less

Submitted 30 September, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

Journal ref: Nat. Lang. Eng. 25 (2019) 219-238

arXiv:1801.06422 [pdf, other]

Evaluating neural network explanation methods using hybrid documents and morphological agreement

Authors: Nina Poerner, Benjamin Roth, Hinrich Schütze

Abstract: The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation… ▽ More The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation and are therefore broadly applicable. We also introduce LIMSSE, an explanation method inspired by LIME that is designed for NLP. We show empirically that LIMSSE, LRP and DeepLIFT are the most effective explanation methods and recommend them for explaining DNNs in NLP. △ Less

Submitted 6 May, 2019; v1 submitted 19 January, 2018; originally announced January 2018.

Showing 1–9 of 9 results for author: Poerner, N