Search | arXiv e-print repository

Self-supervised Pre-training of Text Recognizers

Abstract: In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different a… ▽ More In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at https://github.com/DCGM/pero-pretraining. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 18 pages, 6 figures, 4 tables, accepted to ICDAR24

arXiv:2302.06318 [pdf, other]

Towards Writing Style Adaptation in Handwriting Recognition

Authors: Jan Kohút, Michal Hradiš, Martin Kišš

Abstract: One of the challenges of handwriting recognition is to transcribe a large number of vastly different writing styles. State-of-the-art approaches do not explicitly use information about the writer's style, which may be limiting overall accuracy due to various ambiguities. We explore models with writer-dependent parameters which take the writer's identity as an additional input. The proposed models… ▽ More One of the challenges of handwriting recognition is to transcribe a large number of vastly different writing styles. State-of-the-art approaches do not explicitly use information about the writer's style, which may be limiting overall accuracy due to various ambiguities. We explore models with writer-dependent parameters which take the writer's identity as an additional input. The proposed models can be trained on datasets with partitions likely written by a single author (e.g. single letter, diary, or chronicle). We propose a Writer Style Block (WSB), an adaptive instance normalization layer conditioned on learned embeddings of the partitions. We experimented with various placements and settings of WSB and contrastively pre-trained embeddings. We show that our approach outperforms a baseline with no WSB in a writer-dependent scenario and that it is possible to estimate embeddings for new writers. However, domain adaptation using simple finetuning in a writer-independent setting provides superior accuracy at a similar computational cost. The proposed approach should be further investigated in terms of training stability and embedding regularization to overcome such a baseline. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: Submitted to ICDAR2023 conference

arXiv:2302.06308 [pdf, other]

Finetuning Is a Surprisingly Effective Domain Adaptation Baseline in Handwriting Recognition

Authors: Jan Kohút, Michal Hradiš

Abstract: In many machine learning tasks, a large general dataset and a small specialized dataset are available. In such situations, various domain adaptation methods can be used to adapt a general model to the target dataset. We show that in the case of neural networks trained for handwriting recognition using CTC, simple finetuning with data augmentation works surprisingly well in such scenarios and that… ▽ More In many machine learning tasks, a large general dataset and a small specialized dataset are available. In such situations, various domain adaptation methods can be used to adapt a general model to the target dataset. We show that in the case of neural networks trained for handwriting recognition using CTC, simple finetuning with data augmentation works surprisingly well in such scenarios and that it is resistant to overfitting even for very small target domain datasets. We evaluated the behavior of finetuning with respect to augmentation, training data size, and quality of the pre-trained network, both in writer-dependent and writer-independent settings. On a large real-world dataset, finetuning provided an average relative CER improvement of 25 % with 16 text lines for new writers and 50 % for 256 text lines. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: Submitted to ICDAR2023 conference

arXiv:2212.02135 [pdf, other]

SoftCTC -- Semi-Supervised Learning for Text Recognition using Soft Pseudo-Labels

Authors: Martin Kišš, Michal Hradiš, Karel Beneš, Petr Buchal, Michal Kula

Abstract: This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial co… ▽ More This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely-tuned filtering based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a naïve CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public. △ Less

Submitted 19 September, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: 21 pages, 8 figures, 6 tables, accepted to International Journal on Document Analysis and Recognition (IJDAR)

MSC Class: 68T07; 68T10

arXiv:2201.09575 [pdf, other]

Importance of Textlines in Historical Document Classification

Authors: Martin Kišš, Jan Kohút, Karel Beneš, Michal Hradiš

Abstract: This paper describes a system prepared at Brno University of Technology for ICDAR 2021 Competition on Historical Document Classification, experiments leading to its design, and the main findings. The solved tasks include script and font classification, document origin localization, and dating. We combined patch-level and line-level approaches, where the line-level system utilizes an existing, publ… ▽ More This paper describes a system prepared at Brno University of Technology for ICDAR 2021 Competition on Historical Document Classification, experiments leading to its design, and the main findings. The solved tasks include script and font classification, document origin localization, and dating. We combined patch-level and line-level approaches, where the line-level system utilizes an existing, publicly available page layout analysis engine. In both systems, neural networks provide local predictions which are combined into page-level decisions, and the results of both systems are fused using linear or log-linear interpolation. We propose loss functions suitable for weakly supervised classification problem where multiple possible labels are provided, and we propose loss functions suitable for interval regression in the dating task. The line-level system significantly improves results in script and font classification and in the dating task. The full system achieved 98.48 %, 88.84 %, and 79.69 % accuracy in the font, script, and location classification tasks respectively. In the dating task, our system achieved a mean absolute error of 21.91 years. Our system achieved the best results in all tasks and became the overall winner of the competition. △ Less

Submitted 30 March, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: 13 pages, 7 figures, 5 tables

MSC Class: 68T07; 68T10

arXiv:2104.13037 [pdf, other]

doi 10.1007/978-3-030-86337-1_31

AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions

Authors: Martin Kišš, Karel Beneš, Michal Hradiš

Abstract: This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available a… ▽ More This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available annotated data from the target domain. The seed system transcribes the unannotated data from the target domain which is then used to train a better system. We study several confidence measures and eventually decide to use the posterior probability of a transcription for data selection. Additionally, we propose to augment the data using an aggressive masking scheme. By self-training, we achieve up to 55 % reduction in character error rate for handwritten data and up to 38 % on printed data. The masking augmentation itself reduces the error rate by about 10 % and its effect is better pronounced in case of difficult handwritten data. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: 15 pages, 6 figures, 5 tables

arXiv:2103.05489 [pdf, other]

doi 10.1007/978-3-030-86337-1_32

TS-Net: OCR Trained to Switch Between Text Transcription Styles

Authors: Jan Kohút, Michal Hradiš

Abstract: Users of OCR systems, from different institutions and scientific disciplines, prefer and produce different transcription styles. This presents a problem for training of consistent text recognition neural networks on real-world data. We propose to extend existing text recognition networks with a Transcription Style Block (TSB) which can learn from data to switch between multiple transcription style… ▽ More Users of OCR systems, from different institutions and scientific disciplines, prefer and produce different transcription styles. This presents a problem for training of consistent text recognition neural networks on real-world data. We propose to extend existing text recognition networks with a Transcription Style Block (TSB) which can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules. TSB is an adaptive instance normalization conditioned by identifiers representing consistently transcribed documents (e.g. single document, documents by a single transcriber, or an institution). We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data, it improves text recognition accuracy on large-scale real-world data, and it learns semantically meaningful transcription style embedding. We also show how TSB can efficiently adapt to transcription styles of new documents from transcriptions of only a few text lines. △ Less

Submitted 13 February, 2023; v1 submitted 9 March, 2021; originally announced March 2021.

Journal ref: ICDAR 2021: Proceedings, Part IV 16 (pp. 478-493)

arXiv:2102.11838 [pdf, other]

Page Layout Analysis System for Unconstrained Historic Documents

Authors: Oldřich Kodym, Michal Hradiš

Abstract: Extraction of text regions and individual text lines from historic documents is necessary for automatic transcription. We propose extending a CNN-based text baseline detection system by adding line height and text block boundary predictions to the model output, allowing the system to extract more comprehensive layout information. We also show that pixel-wise text orientation prediction can be used… ▽ More Extraction of text regions and individual text lines from historic documents is necessary for automatic transcription. We propose extending a CNN-based text baseline detection system by adding line height and text block boundary predictions to the model output, allowing the system to extract more comprehensive layout information. We also show that pixel-wise text orientation prediction can be used for processing documents with multiple text orientations. We demonstrate that the proposed method performs well on the cBAD baseline detection dataset. Additionally, we benchmark the method on newly introduced PERO layout dataset which we also make public. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: Submitted to ICDAR2021 conference

arXiv:1907.01307 [pdf, other]

Brno Mobile OCR Dataset

Authors: Martin Kišš, Michal Hradiš, Oldřich Kodym

Abstract: We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform… ▽ More We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform lighting, image blur, strong noise, built-in denoising, sharpening, compression and other artifacts present in many photographs from mobile devices. This dataset contains 2 113 unique pages from random scientific papers, which were photographed by multiple people using 23 different mobile devices. The resulting 19 728 photographs of various visual quality are accompanied by precise positions and text annotations of 500k text lines. We further provide an evaluation methodology, including an evaluation server and a testset with non-public annotations. We provide a state-of-the-art text recognition baseline build on convolutional and recurrent neural networks trained with Connectionist Temporal Classification loss. This baseline achieves 2 %, 22 % and 73 % word error rates on easy, medium and hard parts of the dataset, respectively, confirming that the dataset is challenging. The presented dataset will enable future development and evaluation of document analysis for low-quality images. It is primarily intended for line-level text recognition, and can be further used for line localization, layout analysis, image restoration and text binarization. △ Less

Submitted 2 July, 2019; originally announced July 2019.

arXiv:1712.06352 [pdf, other]

CNN for IMU Assisted Odometry Estimation using Velodyne LiDAR

Authors: Martin Velas, Michal Spanel, Michal Hradis, Adam Herout

Abstract: We introduce a novel method for odometry estimation using convolutional neural networks from 3D LiDAR scans. The original sparse data are encoded into 2D matrices for the training of proposed networks and for the prediction. Our networks show significantly better precision in the estimation of translational motion parameters comparing with state of the art method LOAM, while achieving real-time pe… ▽ More We introduce a novel method for odometry estimation using convolutional neural networks from 3D LiDAR scans. The original sparse data are encoded into 2D matrices for the training of proposed networks and for the prediction. Our networks show significantly better precision in the estimation of translational motion parameters comparing with state of the art method LOAM, while achieving real-time performance. Together with IMU support, high quality odometry estimation and LiDAR data registration is realized. Moreover, we propose alternative CNNs trained for the prediction of rotational motion parameters while achieving results also comparable with state of the art. The proposed method can replace wheel encoders in odometry estimation or supplement missing GPS data, when the GNSS signal absents (e.g. during the indoor map**). Our solution brings real-time performance and precision which are useful to provide online preview of the map** results and verification of the map completeness in real time. △ Less

Submitted 18 December, 2017; originally announced December 2017.

arXiv:1709.02128 [pdf, other]

CNN for Very Fast Ground Segmentation in Velodyne LiDAR Data

Authors: Martin Velas, Michal Spanel, Michal Hradis, Adam Herout

Abstract: This paper presents a novel method for ground segmentation in Velodyne point clouds. We propose an encoding of sparse 3D data from the Velodyne sensor suitable for training a convolutional neural network (CNN). This general purpose approach is used for segmentation of the sparse point cloud into ground and non-ground points. The LiDAR data are represented as a multi-channel 2D signal where the hor… ▽ More This paper presents a novel method for ground segmentation in Velodyne point clouds. We propose an encoding of sparse 3D data from the Velodyne sensor suitable for training a convolutional neural network (CNN). This general purpose approach is used for segmentation of the sparse point cloud into ground and non-ground points. The LiDAR data are represented as a multi-channel 2D signal where the horizontal axis corresponds to the rotation angle and the vertical axis the indexes channels (i.e. laser beams). Multiple topologies of relatively shallow CNNs (i.e. 3-5 convolutional layers) are trained and evaluated using a manually annotated dataset we prepared. The results show significant improvement of performance over the state-of-the-art method by Zhang et al. in terms of speed and also minor improvements in terms of accuracy. △ Less

Submitted 7 September, 2017; originally announced September 2017.

Comments: ICRA 2018 submission

arXiv:1607.03305 [pdf, other]

doi 10.5244/C.29.30

Camera Elevation Estimation from a Single Mountain Landscape Photograph

Authors: Martin Cadik, Jan Vasicek, Michal Hradis, Filip Radenovic, Ondrej Chum

Abstract: This work addresses the problem of camera elevation estimation from a single photograph in an outdoor environment. We introduce a new benchmark dataset of one-hundred thousand images with annotated camera elevation called Alps100K. We propose and experimentally evaluate two automatic data-driven approaches to camera elevation estimation: one based on convolutional neural networks, the other on loc… ▽ More This work addresses the problem of camera elevation estimation from a single photograph in an outdoor environment. We introduce a new benchmark dataset of one-hundred thousand images with annotated camera elevation called Alps100K. We propose and experimentally evaluate two automatic data-driven approaches to camera elevation estimation: one based on convolutional neural networks, the other on local features. To compare the proposed methods to human performance, an experiment with 100 subjects is conducted. The experimental results show that both proposed approaches outperform humans and that the best result is achieved by their combination. △ Less

Submitted 12 July, 2016; originally announced July 2016.

Journal ref: In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 30.1-30.12. BMVA Press, September 2015

arXiv:1605.00366 [pdf, other]

Compression Artifacts Removal Using Convolutional Neural Networks

Authors: Pavel Svoboda, Michal Hradis, David Barina, Pavel Zemcik

Abstract: This paper shows that it is possible to train large and deep convolutional neural networks (CNN) for JPEG compression artifacts reduction, and that such networks can provide significantly better reconstruction quality compared to previously used smaller networks as well as to any other state-of-the-art methods. We were able to train networks with 8 layers in a single step and in relatively short t… ▽ More This paper shows that it is possible to train large and deep convolutional neural networks (CNN) for JPEG compression artifacts reduction, and that such networks can provide significantly better reconstruction quality compared to previously used smaller networks as well as to any other state-of-the-art methods. We were able to train networks with 8 layers in a single step and in relatively short time by combining residual learning, skip architecture, and symmetric weight initialization. We provide further insights into convolution networks for JPEG artifact reduction by evaluating three different objectives, generalization with respect to training dataset size, and generalization with respect to JPEG quality level. △ Less

Submitted 2 May, 2016; originally announced May 2016.

Comments: To be published in WSCG 2016

arXiv:1602.07873 [pdf, other]

CNN for License Plate Motion Deblurring

Authors: Pavel Svoboda, Michal Hradis, Lukas Marsik, Pavel Zemcik

Abstract: In this work we explore the previously proposed approach of direct blind deconvolution and denoising with convolutional neural networks in a situation where the blur kernels are partially constrained. We focus on blurred images from a real-life traffic surveillance system, on which we, for the first time, demonstrate that neural networks trained on artificial data provide superior reconstruction q… ▽ More In this work we explore the previously proposed approach of direct blind deconvolution and denoising with convolutional neural networks in a situation where the blur kernels are partially constrained. We focus on blurred images from a real-life traffic surveillance system, on which we, for the first time, demonstrate that neural networks trained on artificial data provide superior reconstruction quality on real images compared to traditional blind deconvolution methods. The training data is easy to obtain by blurring sharp photos from a target system with a very rough approximation of the expected blur kernels, thereby allowing custom CNNs to be trained for a specific application (image content and blur range). Additionally, we evaluate the behavior and limits of the CNNs with respect to blur direction range and length. △ Less

Submitted 25 February, 2016; originally announced February 2016.

arXiv:1506.03995 [pdf, other]

Technical Report: Image Captioning with Semantically Similar Images

Authors: Martin Kolář, Michal Hradiš, Pavel Zemčík

Abstract: This report presents our submission to the MS COCO Captioning Challenge 2015. The method uses Convolutional Neural Network activations as an embedding to find semantically similar images. From these images, the most typical caption is selected based on unigram frequencies. Although the method received low scores with automated evaluation metrics and in human assessed average correctness, it is com… ▽ More This report presents our submission to the MS COCO Captioning Challenge 2015. The method uses Convolutional Neural Network activations as an embedding to find semantically similar images. From these images, the most typical caption is selected based on unigram frequencies. Although the method received low scores with automated evaluation metrics and in human assessed average correctness, it is competitive in the ratio of captions which pass the Turing test and which are assessed as better or equal to human captions. △ Less

Submitted 12 June, 2015; originally announced June 2015.

Comments: 3 pages

Showing 1–15 of 15 results for author: Hradiš, M