-
DSG: An End-to-End Document Structure Generator
Authors:
Johannes Rausch,
Gentiana Rashiti,
Maxim Gusev,
Ce Zhang,
Stefan Feuerriegel
Abstract:
Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generat…
▽ More
Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end-to-end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end-to-end, making it effective and flexible for real-world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real-world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end-to-end trainable system for hierarchical document parsing.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Numerical evaluation of the nonlinear Gribov-Levin-Ryskin-Mueller-Qiu evolution equations for nuclear parton distribution functions
Authors:
J. Rausch,
V. Guzey,
M. Klasen
Abstract:
We numerically study for the first time the nonlinear GLR-MQ evolution equations for nuclear parton distribution function (nPDFs) to next-to-leading order accuracy and quantify the impact of gluon recombination at small $x$. Using the nCTEQ15 nPDFs as input, we confirm the importance of the nonlinear corrections for small $x \lesssim 10^{-3}$, whose magnitude increases with a decrease of $x$ and a…
▽ More
We numerically study for the first time the nonlinear GLR-MQ evolution equations for nuclear parton distribution function (nPDFs) to next-to-leading order accuracy and quantify the impact of gluon recombination at small $x$. Using the nCTEQ15 nPDFs as input, we confirm the importance of the nonlinear corrections for small $x \lesssim 10^{-3}$, whose magnitude increases with a decrease of $x$ and an increase of the atomic number $A$. We find that at $x=10^{-5}$ and for heavy nuclei, after the upward evolution from $Q_0=2$ GeV to $Q=10$ GeV, the quark singlet $Ω(x,Q^2)$ and the gluon $G(x,Q^2)$ distributions become reduced by $9-15$%, respectively. The relative effect is much stronger for the downward evolution from $Q_0=10$ GeV to $Q=2$ GeV, where we find that $Ω(x,Q^2)$ is suppressed by 40%, while $G(x, Q^2)$ is enhanced by 140%. These trends propagate into the $F_2^A(x,Q^2)$ nuclear structure function and the $F_L^A(x,Q^2)$ longitudinal structure function, which after the downward evolution become reduced by 45% and enhanced by 80%, respectively. Our analysis indicates that the nonlinear effects are most pronounced in $F_L^A(x,Q^2)$ and are already quite sizable at $x \sim 10^{-3}$ for heavy nuclei. We have checked that our conclusions very weakly depend on the choice of input nPDFs. In particular, using the EPPS21 nPDFs as input, we obtain quantitatively similar results.
△ Less
Submitted 3 March, 2023; v1 submitted 25 November, 2022;
originally announced November 2022.
-
Low-$Q^2$ elastic electron-proton scattering using a gas jet target
Authors:
Y. Wang,
J. C. Bernauer,
B. S. Schlimme,
P. Achenbach,
S. Aulenbacher,
M. Ball,
M. Biroth,
D. Bonaventura,
D. Bosnar,
P. Brand,
S. Caiazza,
M. Christmann,
E. Cline,
A. Denig,
M. O. Distler,
L. Doria,
P. Eckert,
A. Esser,
I. Friscic,
S. Gagneur,
J. Geimer,
S. Grieser,
P. Gulker,
P. Herrmann,
M. Hoek
, et al. (32 additional authors not shown)
Abstract:
In this paper, we describe an experiment measuring low-$Q^2$ elastic electron-proton scattering using a newly developed cryogenic supersonic gas jet target in the A1 three-spectrometer facility at the Mainz Microtron. We measured the proton electric form factor within the four-momentum transfer range of $0.01\le Q^2 \le 0.045(\text{GeV/c})^2$. The experiment showed consistent results with the exis…
▽ More
In this paper, we describe an experiment measuring low-$Q^2$ elastic electron-proton scattering using a newly developed cryogenic supersonic gas jet target in the A1 three-spectrometer facility at the Mainz Microtron. We measured the proton electric form factor within the four-momentum transfer range of $0.01\le Q^2 \le 0.045(\text{GeV/c})^2$. The experiment showed consistent results with the existing measurements. The data we collected demonstrated the feasibility of the gas jet target and the potential of future scattering experiments using high-resolution spectrometers with this gas jet target.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets
Authors:
Susie Xi Rao,
Johannes Rausch,
Peter Egger,
Ce Zhang
Abstract:
Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both n…
▽ More
Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in develo** such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.
-
Operation and characterization of a windowless gas jet target in high-intensity electron beams
Authors:
B. S. Schlimme,
S. Aulenbacher,
P. Brand,
M. Littich,
Y. Wang,
P. Achenbach,
M. Ball,
J. C. Bernauer,
M. Biroth,
D. Bonaventura,
D. Bosnar,
S. Caiazza,
M. Christmann,
E. Cline,
A. Denig,
M. O. Distler,
L. Doria,
P. Eckert,
A. Esser,
I. Friščić,
S. Gagneur,
J. Geimer,
S. Grieser,
P. Gülker,
P. Herrmann
, et al. (32 additional authors not shown)
Abstract:
A cryogenic supersonic gas jet target was developed for the MAGIX experiment at the high-intensity electron accelerator MESA. It will be operated as an internal, windowless target in the energy-recovering recirculation arc of the accelerator with different target gases, e.g., hydrogen, deuterium, helium, oxygen, argon, or xenon. Detailed studies have been carried out at the existing A1 multi-spect…
▽ More
A cryogenic supersonic gas jet target was developed for the MAGIX experiment at the high-intensity electron accelerator MESA. It will be operated as an internal, windowless target in the energy-recovering recirculation arc of the accelerator with different target gases, e.g., hydrogen, deuterium, helium, oxygen, argon, or xenon. Detailed studies have been carried out at the existing A1 multi-spectrometer facility at the electron accelerator MAMI. This paper focuses on the developed handling procedures and diagnostic tools, and on the performance of the gas jet target under beam conditions. Considering the special features of this type of target, it proves to be well suited for a new generation of high-precision electron scattering experiments at high-intensity electron accelerators.
△ Less
Submitted 16 July, 2021; v1 submitted 27 April, 2021;
originally announced April 2021.
-
Online Active Model Selection for Pre-trained Classifiers
Authors:
Mohammad Reza Karimi,
Nezihe Merve Gürel,
Bojan Karlaš,
Johannes Rausch,
Ce Zhang,
Andreas Krause
Abstract:
Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative exa…
▽ More
Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies.
△ Less
Submitted 17 April, 2021; v1 submitted 19 October, 2020;
originally announced October 2020.
-
A Principled Approach to Data Valuation for Federated Learning
Authors:
Tianhao Wang,
Johannes Rausch,
Ce Zhang,
Ruoxi Jia,
Dawn Song
Abstract:
Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources. In order to sustain long-term participation of data owners, it is important to fairly appraise each data source and compensate data owners for their contribution to the training process. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data v…
▽ More
Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources. In order to sustain long-term participation of data owners, it is important to fairly appraise each data source and compensate data owners for their contribution to the training process. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data value notion. It has been increasingly used for valuing training data in centralized learning. However, computing the SV requires exhaustively evaluating the model performance on every subset of data sources, which incurs prohibitive communication cost in the federated setting. Besides, the canonical SV ignores the order of data sources during training, which conflicts with the sequential nature of FL. This paper proposes a variant of the SV amenable to FL, which we call the federated Shapley value. The federated SV preserves the desirable properties of the canonical SV while it can be calculated without incurring extra communication cost and is also able to capture the effect of participation order on data value. We conduct a thorough empirical study of the federated SV on a range of tasks, including noisy label detection, adversarial participant detection, and data summarization on different benchmark datasets, and demonstrate that it can reflect the real utility of data sources for FL and has the potential to enhance system robustness, security, and efficiency. We also report and analyze "failure cases" and hope to stimulate future research.
△ Less
Submitted 14 September, 2020;
originally announced September 2020.
-
DocParser: Hierarchical Structure Parsing of Document Renderings
Authors:
Johannes Rausch,
Octavio Martinez,
Fabian Bissig,
Ce Zhang,
Stefan Feuerriegel
Abstract:
Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text…
▽ More
Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.
△ Less
Submitted 25 January, 2021; v1 submitted 5 November, 2019;
originally announced November 2019.
-
Development of large area focal plane detectors for MAGIX
Authors:
P. Gülker,
P. Achenbach,
S. Aulenbacher,
J. Bernauer,
S. Caiazza,
M. Christmann,
A. Denig,
S. Grieser,
A. -K. Hergemöller,
B. Hetz,
A. Khoukaz,
M. Klein,
T. Kolar,
M. Littich,
S. Lunkenheimer,
M. Mauch,
H. Merkel,
M. Mihovilovic,
J. Muller,
J. Rausch,
Y. Schelhaas,
S. Schlimme,
S. Sirca
Abstract:
MAGIX is a planned experiment that will be implemented at the upcoming accelerator MESA in Mainz. Due to its location in the energy-recovering lane of the accelerator beam-currents up to 1mA with a maximum energy of 105 MeV will be available for precision experiments. MAGIX itself consists of a jet-target and two magnetic spectrometers. Inside the spectrometers GEM-based detectors will be used in…
▽ More
MAGIX is a planned experiment that will be implemented at the upcoming accelerator MESA in Mainz. Due to its location in the energy-recovering lane of the accelerator beam-currents up to 1mA with a maximum energy of 105 MeV will be available for precision experiments. MAGIX itself consists of a jet-target and two magnetic spectrometers. Inside the spectrometers GEM-based detectors will be used in the focal plane for track reconstruction. The design goals for the detector modules are a spatial resolution of 50 um, a size of 1.20 m x 0.3 m and a minimal material budget. To accomplish these goals we started develo** several GEM-prototypes to study different behaviors and techniques to optimize the final detector design. The GEM foils used are provided by CERN and are trained, stretched and framed in our laboratory. The readout is done with an SRS based system. In this contribution the requirements, achievements and the ongoing developments are presented.
△ Less
Submitted 2 August, 2019; v1 submitted 13 June, 2019;
originally announced June 2019.
-
Living in Parallel Realities -- Co-Existing Schema Versions with a Bidirectional Database Evolution Language
Authors:
Kai Herrmann,
Hannes Voigt,
Andreas Behrend,
Jonas Rausch,
Wolfgang Lehner
Abstract:
We introduce end-to-end support of co-existing schema versions within one database. While it is state of the art to run multiple versions of a continuously developed application concurrently, it is hard to do the same for databases. In order to keep multiple co-existing schema versions alive; which are all accessing the same data set; developers usually employ handwritten delta code (e.g. views an…
▽ More
We introduce end-to-end support of co-existing schema versions within one database. While it is state of the art to run multiple versions of a continuously developed application concurrently, it is hard to do the same for databases. In order to keep multiple co-existing schema versions alive; which are all accessing the same data set; developers usually employ handwritten delta code (e.g. views and triggers in SQL). This delta code is hard to write and hard to maintain: if a database administrator decides to adapt the physical table schema, all handwritten delta code needs to be adapted as well, which is expensive and error-prone in practice. In this paper, we present InVerDa: developers use the simple bidirectional database evolution language BiDEL, which carries enough information to generate all delta code automatically. Without additional effort, new schema versions become immediately accessible and data changes in any version are visible in all schema versions at the same time. InVerDa also allows for easily changing the physical table design without affecting the availability of co-existing schema versions. This greatly increases robustness (orders of magnitude less lines of code) and allows for significant performance optimization. A main contribution is the formal evaluation that each schema version acts like a common full-fledged database schema independently of the chosen physical table design.
△ Less
Submitted 19 September, 2017; v1 submitted 19 August, 2016;
originally announced August 2016.