A Checklist to Publish Collections as Data in GLAM Institutions
Authors:
Gustavo Candela,
Nele Gabriƫls,
Sally Chambers,
Thuy-An Pham,
Sarah Ames,
Neil Fitzgerald,
Katrine Hofmann,
Victor Harbo,
Abigail Potter,
Meghan Ferriter,
Eileen Manchester,
Alba Irollo,
Ellen Van Keer,
Mahendra Mahey,
Olga Holownia,
Milena Dobreva
Abstract:
Large-scale digitization in Galleries, Libraries, Archives and Museums (GLAM) created the conditions for providing access to collections as data. It opened new opportunities to explore, use and reuse digital collections. Strong proponents of collections as data are the Innovation Labs which provided numerous examples of publishing datasets under open licenses in order to reuse digital content in n…
▽ More
Large-scale digitization in Galleries, Libraries, Archives and Museums (GLAM) created the conditions for providing access to collections as data. It opened new opportunities to explore, use and reuse digital collections. Strong proponents of collections as data are the Innovation Labs which provided numerous examples of publishing datasets under open licenses in order to reuse digital content in novel and creative ways. Within the current transition to the emerging data spaces, clouds for cultural heritage and open science, the need to identify practices which support more GLAM institutions to offer datasets becomes a priority, especially within the smaller and medium-sized institutions.
This paper answers the need to support GLAM institutions in facilitating the transition into publishing their digital content and to introduce collections as data services; this will also help their future efficient contribution to data spaces and cultural heritage clouds. It offers a checklist that can be used for both creating and evaluating digital collections suitable for computational use. The main contributions of this paper are i) a methodology for devising a checklist to create and assess digital collections for computational use; ii) a checklist to create and assess digital collections suitable for use with computational methods; iii) the assessment of the checklist against the practice of institutions innovating in the Collections as data field; and iv) the results obtained after the application and recommendations for the use of the checklist in GLAM institutions.
△ Less
Submitted 13 November, 2023; v1 submitted 5 April, 2023;
originally announced April 2023.
The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America
Authors:
Benjamin Charles Germain Lee,
Jaime Mears,
Eileen Jakeway,
Meghan Ferriter,
Chris Adams,
Nathan Yarasavage,
Deborah Thomas,
Kate Zwaard,
Daniel S. Weld
Abstract:
Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable int…
▽ More
Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress's Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast image similarity querying. We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus and describe the resulting Newspaper Navigator dataset, the largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model, and all source code are placed in the public domain for unrestricted re-use.
△ Less
Submitted 4 May, 2020;
originally announced May 2020.