Skip to main content

Showing 1–5 of 5 results for author: März, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2206.01444  [pdf, other

    cs.LG cs.PF

    XPASC: Measuring Generalization in Weak Supervision by Explainability and Association

    Authors: Luisa März, Ehsaneddin Asgari, Fabienne Braune, Franziska Zimmermann, Benjamin Roth

    Abstract: Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create massive amounts of labeled data, requiring only little manual effort. Standard approaches use labeling functions to specify signals that are relevant for the labeling. It has been conjectured that weakly supervised models over-rely on those signals and as a result suffer from overfitting. To verify this… ▽ More

    Submitted 22 November, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: 26 pages, 20 Figures, 5 Tables

  2. arXiv:2205.15575  [pdf, other

    cs.CL

    hmBERT: Historical Multilingual Language Models for Named Entity Recognition

    Authors: Stefan Schweter, Luisa März, Katharina Schmid, Erion Çano

    Abstract: Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts constitutes a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and Optical Character Recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change ov… ▽ More

    Submitted 1 July, 2022; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: Camera-ready HIPE-2022 Working Note Paper for CLEF 2022 (Conference and Labs of the Evaluation Forum (CLEF 2022))

  3. arXiv:2109.07994  [pdf, other

    cs.LG cs.CL

    KnowMAN: Weakly Supervised Multinomial Adversarial Networks

    Authors: Luisa März, Ehsaneddin Asgari, Fabienne Braune, Franziska Zimmermann, Benjamin Roth

    Abstract: The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reli… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 9 pages, 3 figures, 2 tables, accepted to EMNLP 2021

  4. arXiv:2107.00927  [pdf, other

    cs.CL cs.LG

    Data Centric Domain Adaptation for Historical Text with OCR Errors

    Authors: Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth, Hinrich Schütze

    Abstract: We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general appro… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: 14 pages, 2 figures, 6 tables

  5. arXiv:1905.08920  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Domain adaptation for part-of-speech tagging of noisy user-generated text

    Authors: Luisa März, Dietrich Trautmann, Benjamin Roth

    Abstract: The performance of a Part-of-speech (POS) tagger is highly dependent on the domain ofthe processed text, and for many domains there is no or only very little training data available. This work addresses the problem of POS tagging noisy user-generated text using a neural network. We propose an architecture that trains an out-of-domain model on a large newswire corpus, and transfers those weights by… ▽ More

    Submitted 21 May, 2019; originally announced May 2019.

    Comments: 6 pages, NAACL 2019