Skip to main content

Showing 1–5 of 5 results for author: Bryan, T

.
  1. arXiv:2406.15593  [pdf, other

    cs.CL econ.GN

    News Deja Vu: Connecting Past and Present with Semantic Search

    Authors: Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, Melissa Dell

    Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2310.10050  [pdf, other

    cs.CV cs.CL econ.GN

    EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

    Authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

    Abstract: Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  3. arXiv:2308.12477  [pdf, other

    cs.CL cs.CV econ.GN

    American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

    Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  4. arXiv:2304.02737  [pdf, other

    cs.CV cs.DL econ.GN

    Efficient OCR for Building a Diverse Digital History

    Authors: Jacob Carlson, Tom Bryan, Melissa Dell

    Abstract: Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires ex… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

  5. arXiv:1608.01775  [pdf, ps, other

    math.RT math.CO math.QA

    An iterative formula for the Kostka-Foulkes polynomials

    Authors: Timothee W. Bryan, Naihuan **g

    Abstract: An iterative formula for the Kostka-Foulkes polynomials is given using the vertex operator realization of the Hall-Littlewood polynomials. The operational formula can handle large Kostka-Foulkes polynomials, and a stability property for the Kostka-Foulkes polynomials is shown. We also use our algorithm to give a formula of $K_{λμ}(t)$ for $μ$ being hook-shaped.

    Submitted 22 January, 2021; v1 submitted 5 August, 2016; originally announced August 2016.

    Comments: 10 pages; final version (revised version with new title. New hook-shaped formula and stability property are added)

    Journal ref: J. Algebr. Combin. 54 (2021), 625-634