-
End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940
Authors:
Thomas Constum,
Lucas Preel,
Théo Larcher,
Pierrick Tranouez,
Thierry Paquet,
Sandra Brée
Abstract:
The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. In this paper, we introduce the M-POPP dataset, a subset of the…
▽ More
The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. In this paper, we introduce the M-POPP dataset, a subset of the M-POPP database with annotations for full-page text recognition and information extraction in both handwritten and printed documents, and which is now publicly available. We present a fully end-to-end architecture adapted from the DAN, designed to perform both handwritten text recognition and information extraction directly from page images without the need for explicit segmentation. We showcase the information extraction capabilities of this architecture by achieving a new state of the art for full-page Information Extraction on Esposalles and we use this architecture as a baseline for the M-POPP dataset. We also assess and compare how different encoding strategies for named entities in the text affect the performance of jointly recognizing handwritten text and extracting information, from full pages.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe
Authors:
Christophe Botella,
Benjamin Deneu,
Diego Marcos,
Maximilien Servajean,
Joaquim Estopinan,
Théo Larcher,
César Leblanc,
Pierre Bonnet,
Alexis Joly
Abstract:
The difficulty to measure or predict species community composition at fine spatio-temporal resolution and over large spatial scales severely hampers our ability to understand species assemblages and take appropriate conservation measures. Despite the progress in species distribution modeling (SDM) over the past decades, SDM have just begun to integrate high resolution remote sensing data and their…
▽ More
The difficulty to measure or predict species community composition at fine spatio-temporal resolution and over large spatial scales severely hampers our ability to understand species assemblages and take appropriate conservation measures. Despite the progress in species distribution modeling (SDM) over the past decades, SDM have just begun to integrate high resolution remote sensing data and their predictions are still entailed by many biases due to heterogeneity of the available biodiversity observations, most often opportunistic presence only data. We designed a European scale dataset covering around ten thousand plant species to calibrate and evaluate SDM predictions of species composition in space and time at high spatial resolution (~ten meters), and their spatial transferability. For model training, we extracted and harmonized five million heterogeneous presence-only records from selected GBIF datasets and 6 thousand exhaustive presence-absence surveys both sampled during 2017-2021. We associated species observations to diverse environmental rasters classically used in SDMs, as well as to 10 m resolution RGB and Near-Infra-Red satellite images and 20 years-time series of climatic variables and satellite point values. The evaluation dataset is based on 22 thousand standardized presence-absence surveys separated from the training set with a spatial block hold out procedure. The GeoLifeCLEF 2023 dataset is open access and the first benchmark for researchers aiming to improve the prediction of plant species composition at a very fine spatial grain and at continental scale. It is a space to explore new ways of combining massive and diverse species observations and environmental information at various scales. Innovative AI-based approaches, in particular, should be among the most interesting methods to experiment with on the GeoLifeCLEF 2023 dataset.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
On identification of self-similar characteristics using the Tensor Train decomposition method with application to channel turbulence flow
Authors:
Thomas von Larcher,
Rupert Klein
Abstract:
A study on the application of the Tensor Train decomposition method to 3D direct numerical simulation data of channel turbulence flow is presented. The approach is validated with respect to compression rate and storage requirement. In tests with synthetic data, it is found that grid-aligned self-similar patterns are well captured, and also the application to non grid-aligned self-similarity yields…
▽ More
A study on the application of the Tensor Train decomposition method to 3D direct numerical simulation data of channel turbulence flow is presented. The approach is validated with respect to compression rate and storage requirement. In tests with synthetic data, it is found that grid-aligned self-similar patterns are well captured, and also the application to non grid-aligned self-similarity yields satisfying results. It is observed that the shape of the input Tensor significantly affects the compression rate. Applied to data of channel turbulent flow, the Tensor Train format allows for surprisingly high compression rates whilst ensuring low relative errors.
△ Less
Submitted 25 August, 2017;
originally announced August 2017.
-
Benchmarking in a rotating annulus: a comparative experimental and numerical study of baroclinic wave dynamics
Authors:
Miklos Vincze,
Sebastian Borchert,
Ulrich Achatz,
Thomas von Larcher,
Martin Baumann,
Claudia Hertel,
Sebastian Remmler,
Teresa Beck,
Kiril Alexandrov,
Christoph Egbers,
Jochen Froehlich,
Vincent Heuveline,
Stefan Hickel,
Uwe Harlander
Abstract:
The differentially heated rotating annulus is a widely studied tabletop-size laboratory model of the general mid-latitude atmospheric circulation. The two most relevant factors of cyclogenesis, namely rotation and meridional temperature gradient are quite well captured in this simple arrangement. The radial temperature difference in the cylindrical tank and its rotation rate can be set so that the…
▽ More
The differentially heated rotating annulus is a widely studied tabletop-size laboratory model of the general mid-latitude atmospheric circulation. The two most relevant factors of cyclogenesis, namely rotation and meridional temperature gradient are quite well captured in this simple arrangement. The radial temperature difference in the cylindrical tank and its rotation rate can be set so that the isothermal surfaces in the bulk tilt, leading to the formation of baroclinic waves. The signatures of these waves at the free water surface have been analyzed via infrared thermography in a wide range of rotation rates (kee** the radial temperature difference constant) and under different initial conditions. In parallel to the laboratory experiments, five groups of the MetStröm collaboration have conducted numerical simulations in the same parameter regime using different approaches and solvers, and applying different initial conditions and perturbations. The experimentally and numerically obtained baroclinic wave patterns have been evaluated and compared in terms of their dominant wave modes, spatio-temporal variance properties and drift rates. Thus certain ``benchmarks'' have been created that can later be used as test cases for atmospheric numerical model validation.
△ Less
Submitted 12 March, 2014;
originally announced March 2014.
-
An experimental study of regime transitions in a differentially heated baroclinic annulus with flat and slo** bottom topographies
Authors:
Miklos Vincze,
Uwe Harlander,
Thomas von Larcher,
Christoph Egbers
Abstract:
A series of laboratory experiments has been carried out in a thermally driven rotating annulus to study the onset of baroclinic instability, using horizontal and uniformly slo** bottom topographies. Different wave flow regimes have been identified and their phase boundaries -- expressed in terms of appropriate non-dimensional parameters -- have been compared to the recent numerical results of \c…
▽ More
A series of laboratory experiments has been carried out in a thermally driven rotating annulus to study the onset of baroclinic instability, using horizontal and uniformly slo** bottom topographies. Different wave flow regimes have been identified and their phase boundaries -- expressed in terms of appropriate non-dimensional parameters -- have been compared to the recent numerical results of \citet{thomas_slope}. In the flat bottom case, the numerically predicted alignment of the boundary between the axisymmetric and the regular wave flow regime was found to be consistent with the experimental results. However, once the slo** bottom end wall was introduced, the detected behaviour was qualitatively different from that of the simulations. This disagreement is thought to be the consequence of nonlinear wave-wave interactions that could not be resolved in the framework of the numerical study. This argument is supported by the observed development of interference vacillation in the runs with slo** bottom, a mixed flow state in which baroclinic wave modes exhibiting different drift rates and amplitudes can co-exist.
△ Less
Submitted 2 September, 2013;
originally announced September 2013.