Skip to main content

Showing 1–2 of 2 results for author: Heldring, L

.
  1. arXiv:2406.15576  [pdf, other

    cs.CL econ.GN

    Contrastive Entity Coreference and Disambiguation for Historical Texts

    Authors: Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

    Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2308.12477  [pdf, other

    cs.CL cs.CV econ.GN

    American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

    Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.