Search | arXiv e-print repository

Self-supervised Pre-training of Text Recognizers

Abstract: In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different a… ▽ More In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://github.com/DCGM/pero-pretraining. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 18 pages, 6 figures, 4 tables, accepted to ICDAR24

arXiv:2308.11511 [pdf, other]

Mode Combinability: Exploring Convex Combinations of Permutation Aligned Models

Authors: Adrián Csiszárik, Melinda F. Kiss, Péter Kőrösi-Szabó, Márton Muntag, Gergely Papp, Dániel Varga

Abstract: We explore element-wise convex combinations of two permutation-aligned neural network parameter vectors $Θ_A$ and $Θ_B$ of size $d$. We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube $[0,1]^{d}$ and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating… ▽ More We explore element-wise convex combinations of two permutation-aligned neural network parameter vectors $Θ_A$ and $Θ_B$ of size $d$. We conduct extensive experiments by examining various distributions of such model combinations parametrized by elements of the hypercube $[0,1]^{d}$ and its vicinity. Our findings reveal that broad regions of the hypercube form surfaces of low loss values, indicating that the notion of linear mode connectivity extends to a more general phenomenon which we call mode combinability. We also make several novel observations regarding linear mode connectivity and model re-basin. We demonstrate a transitivity property: two models re-based to a common third model are also linear mode connected, and a robustness property: even with significant perturbations of the neuron matchings the resulting combinations continue to form a working model. Moreover, we analyze the functional and weight similarity of model combinations and show that such combinations are non-vacuous in the sense that there are significant functional differences between the resulting models. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2306.05907 [pdf, other]

doi 10.1038/s41597-023-02484-6

2DeteCT -- A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning

Authors: Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, Felix Lucka

Abstract: Recent research in computational imaging largely focuses on develo** machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this… ▽ More Recent research in computational imaging largely focuses on develo** machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for develo** ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Journal ref: Scientific Data 10, 576 (2023)

arXiv:2302.06318 [pdf, other]

Towards Writing Style Adaptation in Handwriting Recognition

Authors: Jan Kohút, Michal Hradiš, Martin Kišš

Abstract: One of the challenges of handwriting recognition is to transcribe a large number of vastly different writing styles. State-of-the-art approaches do not explicitly use information about the writer's style, which may be limiting overall accuracy due to various ambiguities. We explore models with writer-dependent parameters which take the writer's identity as an additional input. The proposed models… ▽ More One of the challenges of handwriting recognition is to transcribe a large number of vastly different writing styles. State-of-the-art approaches do not explicitly use information about the writer's style, which may be limiting overall accuracy due to various ambiguities. We explore models with writer-dependent parameters which take the writer's identity as an additional input. The proposed models can be trained on datasets with partitions likely written by a single author (e.g. single letter, diary, or chronicle). We propose a Writer Style Block (WSB), an adaptive instance normalization layer conditioned on learned embeddings of the partitions. We experimented with various placements and settings of WSB and contrastively pre-trained embeddings. We show that our approach outperforms a baseline with no WSB in a writer-dependent scenario and that it is possible to estimate embeddings for new writers. However, domain adaptation using simple finetuning in a writer-independent setting provides superior accuracy at a similar computational cost. The proposed approach should be further investigated in terms of training stability and embedding regularization to overcome such a baseline. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: Submitted to ICDAR2023 conference

arXiv:2212.02135 [pdf, other]

SoftCTC -- Semi-Supervised Learning for Text Recognition using Soft Pseudo-Labels

Authors: Martin Kišš, Michal Hradiš, Karel Beneš, Petr Buchal, Michal Kula

Abstract: This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial co… ▽ More This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely-tuned filtering based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a naïve CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public. △ Less

Submitted 19 September, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: 21 pages, 8 figures, 6 tables, accepted to International Journal on Document Analysis and Recognition (IJDAR)

MSC Class: 68T07; 68T10

arXiv:2201.09575 [pdf, other]

Importance of Textlines in Historical Document Classification

Authors: Martin Kišš, Jan Kohút, Karel Beneš, Michal Hradiš

Abstract: This paper describes a system prepared at Brno University of Technology for ICDAR 2021 Competition on Historical Document Classification, experiments leading to its design, and the main findings. The solved tasks include script and font classification, document origin localization, and dating. We combined patch-level and line-level approaches, where the line-level system utilizes an existing, publ… ▽ More This paper describes a system prepared at Brno University of Technology for ICDAR 2021 Competition on Historical Document Classification, experiments leading to its design, and the main findings. The solved tasks include script and font classification, document origin localization, and dating. We combined patch-level and line-level approaches, where the line-level system utilizes an existing, publicly available page layout analysis engine. In both systems, neural networks provide local predictions which are combined into page-level decisions, and the results of both systems are fused using linear or log-linear interpolation. We propose loss functions suitable for weakly supervised classification problem where multiple possible labels are provided, and we propose loss functions suitable for interval regression in the dating task. The line-level system significantly improves results in script and font classification and in the dating task. The full system achieved 98.48 %, 88.84 %, and 79.69 % accuracy in the font, script, and location classification tasks respectively. In the dating task, our system achieved a mean absolute error of 21.91 years. Our system achieved the best results in all tasks and became the overall winner of the competition. △ Less

Submitted 30 March, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: 13 pages, 7 figures, 5 tables

MSC Class: 68T07; 68T10

arXiv:2104.13037 [pdf, other]

doi 10.1007/978-3-030-86337-1_31

AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions

Authors: Martin Kišš, Karel Beneš, Michal Hradiš

Abstract: This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available a… ▽ More This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available annotated data from the target domain. The seed system transcribes the unannotated data from the target domain which is then used to train a better system. We study several confidence measures and eventually decide to use the posterior probability of a transcription for data selection. Additionally, we propose to augment the data using an aggressive masking scheme. By self-training, we achieve up to 55 % reduction in character error rate for handwritten data and up to 38 % on printed data. The masking augmentation itself reduces the error rate by about 10 % and its effect is better pronounced in case of difficult handwritten data. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: 15 pages, 6 figures, 5 tables

arXiv:2009.04399 [pdf, other]

Performance Analysis of FEM Solvers on Practical Electromagnetic Problems

Authors: Gergely Máté Kiss, Jan Kaska, Roberto André Henrique de Oliveira, Olena Rubanenko, Balázs Tóth

Abstract: The paper presents a comparative analysis of different commercial and academic software. The comparison aims to examine how the integrated adaptive grid refinement methodologies can deal with challenging, electromagnetic-field related problems. For this comparison, two benchmark problems were examined in the paper. The first example is a solution of an L-shape domain like test problem, which has a… ▽ More The paper presents a comparative analysis of different commercial and academic software. The comparison aims to examine how the integrated adaptive grid refinement methodologies can deal with challenging, electromagnetic-field related problems. For this comparison, two benchmark problems were examined in the paper. The first example is a solution of an L-shape domain like test problem, which has a singularity at a certain point in the geometry. The second problem is an induction heated aluminum rod, which accurate solution needs to solve a non-linear, coupled physical fields. The accurate solution of this problem requires applying adaptive mesh generation strategies or applying a very fine mesh in the electromagnetic domain, which can significantly increase the computational complexity. The results show that the fully-hp adaptive meshing strategies, which are integrated into Agros-suite, can significantly reduce the task's computational complexity compared to the automatic h-adaptivity, which is part of the examined, popular commercial solvers. △ Less

Submitted 4 September, 2020; originally announced September 2020.

MSC Class: G.1.10 ACM Class: G.1.10

arXiv:1907.01307 [pdf, other]

Brno Mobile OCR Dataset

Authors: Martin Kišš, Michal Hradiš, Oldřich Kodym

Abstract: We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform… ▽ More We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform lighting, image blur, strong noise, built-in denoising, sharpening, compression and other artifacts present in many photographs from mobile devices. This dataset contains 2 113 unique pages from random scientific papers, which were photographed by multiple people using 23 different mobile devices. The resulting 19 728 photographs of various visual quality are accompanied by precise positions and text annotations of 500k text lines. We further provide an evaluation methodology, including an evaluation server and a testset with non-public annotations. We provide a state-of-the-art text recognition baseline build on convolutional and recurrent neural networks trained with Connectionist Temporal Classification loss. This baseline achieves 2 %, 22 % and 73 % word error rates on easy, medium and hard parts of the dataset, respectively, confirming that the dataset is challenging. The presented dataset will enable future development and evaluation of document analysis for low-quality images. It is primarily intended for line-level text recognition, and can be further used for line localization, layout analysis, image restoration and text binarization. △ Less

Submitted 2 July, 2019; originally announced July 2019.

arXiv:1210.0330 [pdf]

doi 10.1016/j.pharmthera.2013.01.016

Structure and dynamics of molecular networks: A novel paradigm of drug discovery. A comprehensive review

Authors: Peter Csermely, Tamas Korcsmaros, Huba J. M. Kiss, Gabor London, Ruth Nussinov

Abstract: Despite considerable progress in genome- and proteome-based high-throughput screening methods and in rational drug design, the increase in approved drugs in the past decade did not match the increase of drug development costs. Network description and analysis not only give a systems-level understanding of drug action and disease complexity, but can also help to improve the efficiency of drug desig… ▽ More Despite considerable progress in genome- and proteome-based high-throughput screening methods and in rational drug design, the increase in approved drugs in the past decade did not match the increase of drug development costs. Network description and analysis not only give a systems-level understanding of drug action and disease complexity, but can also help to improve the efficiency of drug design. We give a comprehensive assessment of the analytical tools of network topology and dynamics. The state-of-the-art use of chemical similarity, protein structure, protein-protein interaction, signaling, genetic interaction and metabolic networks in the discovery of drug targets is summarized. We propose that network targeting follows two basic strategies. The central hit strategy selectively targets central nodes/edges of the flexible networks of infectious agents or cancer cells to kill them. The network influence strategy works against other diseases, where an efficient reconfiguration of rigid networks needs to be achieved by targeting the neighbors of central nodes or edges. It is shown how network techniques can help in the identification of single-target, edgetic, multi-target and allo-network drug target candidates. We review the recent boom in network methods hel** hit identification, lead selection optimizing drug efficacy, as well as minimizing side-effects and drug toxicity. Successful network-based drug development strategies are shown through the examples of infections, cancer, metabolic diseases, neurodegenerative diseases and aging. Summarizing more than 1200 references we suggest an optimized protocol of network-aided drug development, and provide a list of systems-level hallmarks of drug quality. Finally, we highlight network-related drug development trends hel** to achieve these hallmarks by a cohesive, global approach. △ Less

Submitted 11 May, 2013; v1 submitted 1 October, 2012; originally announced October 2012.

Comments: 76 pages, 23 Figures, 12 Tables and 1270 references

Journal ref: Pharmacology and Therapeutics 138:333-408 (2013)

Showing 1–10 of 10 results for author: Kišš, M