Search | arXiv e-print repository

doi 10.1145/3604931

Named Entity Recognition and Classification on Historical Documents: A Survey

Authors: Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, Antoine Doucet

Abstract: After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficientl… ▽ More After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments. △ Less

Submitted 23 September, 2021; originally announced September 2021.

Comments: 39 pages

ACM Class: A.1; I.2.7

Journal ref: ACM Computing Surveys 56-2 (2023) 1-47

arXiv:2002.06144 [pdf, other]

doi 10.46298/jdmdh.6107

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Authors: Raphaël Barman, Maud Ehrmann, Simon Clematide, Sofia Ares Oliveira, Frédéric Kaplan

Abstract: The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images… ▽ More The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance. △ Less

Submitted 14 December, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

Journal ref: Journal of Data Mining & Digital Humanities, HistoInformatics, HistoInformatics (January 19, 2021) jdmdh:6107

arXiv:1309.6185 [pdf]

Acronym recognition and processing in 22 languages

Authors: Maud Ehrmann, Leonida della Rocca, Ralf Steinberger, Hristo Tanev

Abstract: We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acr… ▽ More We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acronym recognition patterns, initially developed for medical terms, needed to be adapted to the more general news domain and we present evaluation results. We describe our effort to automatically merge the numerous long-form variants referring to the same short-form, while kee** non-related long-forms separate. Finally, we provide extensive statistics on the frequency and the distribution of short-form/long-form pairs across languages. △ Less

Submitted 24 September, 2013; originally announced September 2013.

ACM Class: H.3.1; H.3.3; I.2.7; I.5.4

Journal ref: Proceedings of the 9th Conference 'Recent Advances in Natural Language Processing' (RANLP), pp. 237-244. Hissar, Bulgaria, 7-13 September 2013

Showing 1–3 of 3 results for author: Ehrmann, M