Skip to main content

Showing 1–14 of 14 results for author: Dell, M

.
  1. arXiv:2406.15593  [pdf, other

    cs.CL econ.GN

    News Deja Vu: Connecting Past and Present with Semantic Search

    Authors: Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, Melissa Dell

    Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2406.15576  [pdf, other

    cs.CL econ.GN

    Contrastive Entity Coreference and Disambiguation for Historical Texts

    Authors: Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

    Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  3. arXiv:2406.09490  [pdf, other

    cs.CL econ.GN

    Newswire: A Large-Scale Structured Database of a Century of Historical News

    Authors: Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa Dell

    Abstract: In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundred… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2306.17810, arXiv:2308.12477

  4. arXiv:2310.10050  [pdf, other

    cs.CV cs.CL econ.GN

    EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

    Authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

    Abstract: Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  5. arXiv:2309.00789  [pdf, other

    cs.CL

    LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

    Authors: Abhishek Arora, Melissa Dell

    Abstract: Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easi… ▽ More

    Submitted 24 June, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

  6. arXiv:2308.12477  [pdf, other

    cs.CL cs.CV econ.GN

    American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

    Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  7. arXiv:2306.17810  [pdf, other

    cs.CL econ.GN

    A Massive Scale Semantic Similarity Dataset of Historical English

    Authors: Emily Silcock, Melissa Dell

    Abstract: A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a ma… ▽ More

    Submitted 23 August, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

  8. arXiv:2305.14672  [pdf, other

    cs.CL cs.CV econ.GN

    Quantifying Character Similarity with Vision Transformers

    Authors: Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell

    Abstract: Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions ar… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  9. arXiv:2304.03464  [pdf, other

    cs.CV cs.CL econ.GN

    Linking Representations with Multimodal Contrastive Learning

    Authors: Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell

    Abstract: Many applications require linking individuals, firms, or locations across datasets. Most widely used methods, especially in social science, do not employ deep learning, with record linkage commonly approached using string matching techniques. Moreover, existing methods do not exploit the inherently multimodal nature of documents. In historical record linkage applications, documents are typically n… ▽ More

    Submitted 21 June, 2024; v1 submitted 6 April, 2023; originally announced April 2023.

  10. arXiv:2304.02737  [pdf, other

    cs.CV cs.DL econ.GN

    Efficient OCR for Building a Diverse Digital History

    Authors: Jacob Carlson, Tom Bryan, Melissa Dell

    Abstract: Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires ex… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

  11. arXiv:2210.04261  [pdf, other

    cs.CL

    Noise-Robust De-Duplication at Scale

    Authors: Emily Silcock, Luca D'Amico-Wong, **glin Yang, Melissa Dell

    Abstract: Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evalu… ▽ More

    Submitted 24 April, 2024; v1 submitted 9 October, 2022; originally announced October 2022.

  12. arXiv:2103.15348  [pdf, other

    cs.CV cs.AI

    LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

    Authors: Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li

    Abstract: Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though ther… ▽ More

    Submitted 21 June, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

    Comments: Accepted at ICDAR 2021, 16 pages, 6 figures, 2 tables

  13. arXiv:2010.01762  [pdf, other

    cs.LG cs.CV stat.ML

    OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

    Authors: Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, Weining Li

    Abstract: Document images often have intricate layout structures, with numerous content regions (e.g. texts, figures, tables) densely arranged on each page. This makes the manual annotation of layout datasets expensive and inefficient. These characteristics also challenge existing active learning methods, as image-level scoring and selection suffer from the overexposure of common objects.Inspired by recent… ▽ More

    Submitted 29 March, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

    Comments: 12 pages, 7 figures, 5 tables

  14. arXiv:2004.08686  [pdf, other

    cs.CV

    A Large Dataset of Historical Japanese Documents with Complex Layouts

    Authors: Zejiang Shen, Kaixuan Zhang, Melissa Dell

    Abstract: Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese… ▽ More

    Submitted 18 April, 2020; originally announced April 2020.

    Comments: 8 pages, 8 figures, accepted at CVPR2020 Workshop on Text and Documents in the Deep Learning Era