Chronicling Germany: An Annotated Historical Newspaper Dataset

Schultze, Christian; Kerkfeld, Niklas; Kuebart, Kara; Weber, Princilia; Wolter, Moritz; Selgert, Felix

Computer Science > Digital Libraries

arXiv:2401.16845 (cs)

[Submitted on 30 Jan 2024 (v1), last revised 7 Jun 2024 (this version, v2)]

Title:Chronicling Germany: An Annotated Historical Newspaper Dataset

Authors:Christian Schultze (1), Niklas Kerkfeld (1), Kara Kuebart (2), Princilia Weber (2), Moritz Wolter (1), Felix Selgert (2) ((1) High-Performance Computing and Analytics (HPCA)-Lab, Universität Bonn, (2) Institut für Geschichtswissenschaft Universität Bonn)

View PDF HTML (experimental)

Abstract:The correct detection of article layout in historical newspaper pages remains challenging but is important for Natural Language Processing ( NLP) and machine learning applications in the field of digital history. Digital newspaper portals typically provide Optical Character Recognition ( OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source's scope. Our dataset is designed to address this issue for historic German-language newspapers. The Chronicling Germany dataset contains 581 annotated historical newspaper pages from the time period between 1852 and 1924. Historic domain experts have spent more than 1,500 hours annotating the dataset. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision.

Comments:	Dataset available at: this https URL . Baseline code: this https URL
Subjects:	Digital Libraries (cs.DL)
Cite as:	arXiv:2401.16845 [cs.DL]
	(or arXiv:2401.16845v2 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.2401.16845

Submission history

From: Moritz Wolter [view email]
[v1] Tue, 30 Jan 2024 09:39:04 UTC (18,859 KB)
[v2] Fri, 7 Jun 2024 15:42:52 UTC (47,344 KB)

Computer Science > Digital Libraries

Title:Chronicling Germany: An Annotated Historical Newspaper Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Chronicling Germany: An Annotated Historical Newspaper Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators