11institutetext: Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115, U.S.A
11email: [email protected], 11email: [email protected]

Mind the Gap:
Analyzing Lacunae with
Transformer-Based Transcription

Jaydeep Borkar    David A. Smith 0000-0002-6636-6940
Abstract

Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

Keywords:
Optical Character Recognition Transformers Lacuna.
footnotetext: Accepted to the ICDAR 2024 Workshop on Computational Paleography, Athens, Greece, 2024.

1 Introduction

Many historical documents have reached us in incomplete form. Besides the larger losses of missing leaves or inscribed surfaces, existing pages and other carriers of writing often exhibit lacunae, gaps at the edge or in the middle of lines, or missing lines in the middle of a text. Moreover, when working with digital surrogates for historical documents, we may encounter gaps introduced by the imaging process, such as overly dark microfilm or poorly cropped photographs. Paleography, papyrology, and epigraphy have developed standards, such as the Leiden conventions, for distinguishing readable text, partial letters, abbreviated forms, and restored lacunae [7].

Building on progress in handwritten text recognition (HTR) [11] and language modeling [5], researchers have employed language technologies to good effect in the traditional philological task of hypothesizing readings for these lacunae when their locations are known [2, 4]. If asked to transcribe a new document, however, these restoration models do not know where any gaps in them might be. Furthermore, as language technologies become more powerful, transcription models may suggest readings that are not supported by input document images (or, similarly, sound recordings [9]).

This paper considers two related inference problems that arise when HTR models are confronted with images of a document that may contain lacunae. First, is the HTR model capable of accurately filling the gap left by the absence of visual evidence of writing? Second, does the model provide information about the presence of lacunae, so that users are aware of which parts of a transcript are inferred from context alone? We address the first question by measuring HTR transcription accuracy on lines with and without lacunae, evaluating models trained with various proportions of lacunae. We address the second question by training models to predict the presence of lacunae and other errors in text lines at inference time.

As in many other areas of natural language processing, transformer architectures are being utilized in OCR. We perform experimental evaluations with the TrOCR model, which combines a pre-trained vision transformer with a pre-trained language model decoder [10]. Besides state-of-the-art performance, transformer architectures employ an attention mechanism, which provides us with a way to probe the relationship between visual evidence and transcription output. To control the prevalence of lacunae experimentally, we identify the presence of individual characters in the training and test data using OCR and randomly assign some proportion of them to be missing.

This study reaches the following main findings:

  • Transformer models like TrOCR that are pre-trained on clean text struggle to transfer their knowledge effectively for inferring (the likely content of) lacunae in line images.

  • Adding examples with lacunae to TrOCR’s supervised training significantly enhances its rate of lacuna restoration from 5.6% to 65.85%.

  • Using TrOCR’s log probability of an HTR transcript as a predictor in a logistic regression model correctly flags lines containing lacunae 53% of the time and images with other errors 84% of the time.

  • Using a sample of the model’s attention weights to predict lacunae in input images and mistranscriptions in the output achieves accuracy not significantly higher than what was already reached using the log probability of the transcription. Nonetheless, we believe further experiments probing HTR models’ representations of their input would be useful.

After reviewing related work (§2), we describe the construction of the training and test data for our study and the inference procedures used in our experiments (§3). We then describe the experimental setup (§4) and discuss the results (§5). We conclude with a discussion of future directions for research on lacunose manuscripts (§6).

2 Related Work

Efforts have been made to restore lacunae through a pretrain-finetune framework [16]. In this approach, authors utilized self-supervised contrastive loss training, followed by supervised fine-tuning. This method showed significant improvement in the restoration of lacunae within printed and handwritten documents, both in English and Arabic languages. There have been numerous works in NLP in the domain of masked pretraining where the objective is to learn the masked representations from unlabeled data [5]. In the speech domain, Baevski et al. proposed wav2vec 2.0 framework that masks the speech input and solves a contrastive task over quantized speech representations by using self-supervised contrastive loss during pre-training and Connectionist Temporal Classification (CTC) loss during fine-tuning [3]. Damage and inconsistencies in historical documents, such as inking issues, often pose challenges when using modern OCR models for transcription [1]. Several OCR post-correction models have been proposed to correct the transcription of OCR models when dealing with historical documents[8, 6], including some for endangered languages[14, 15].

Transformer-based models have been widely used in language modeling tasks [12, 5, 13] and in optical character recognition to transcribe the text present in the image[10]. Transformer-based OCR models, such as TrOCR, use an encoder-decoder framework. In this framework, a pre-trained image transformer serves as the encoder, while a pre-trained text transformer acts as the decoder. TrOCR has demonstrated superior performance over current state-of-the-art models across various domains, including printed, handwritten, and scene text recognition tasks.

3 Approach

3.1 Dataset Creation

In this study, we create synthetic lacunae by utilizing line images from the IAM handwriting databasehttps://fki.tic.heia-fr.ch/databases/iam-handwriting-database, replicating real-world scenarios. We begin by extracting all bounding boxes along with their coordinates from the line image using Pytesseracthttps://github.com/tesseract-ocr/tesseract. Next, we randomly select a bounding box from the list and apply binary thresholding to that region with a threshold value of 10, rendering it completely white to create the lacuna. Gaps generated in this way resemble the lacunae introduced into binarized microfilm images.

Subsequently, we binarize the remaining portions of the line image to improve quality for optical character recognition. We explore two approaches for binarizing line images: static thresholding, which applies the same threshold to all images, and adaptive thresholding, which dynamically adjusts the threshold value for each image based on its pixel distribution and layout. Our findings reveal that static thresholding adversely affects the visual quality of non-lacuna characters. Therefore, we opt for adaptive thresholding, which preserves the visual integrity of non-lacuna characters. Consequently, we utilize adaptively thresholded binarized images for lacuna creation in our experiments.

We also process the ground-truth labels to address certain irregularities. Initially, characters of words that were originally separated by white spaces are recombined (e.g., “B B C” becomes “BBC”). Additionally, contractions that were inconsistently tokenized from the original data, separating them from the words, are reattached to the respective words (e.g., “He ’ll” becomes “He’ll”, “They ’ve” becomes “They’ve”). Figure 1 shows some samples of our line images containing lacunae.

Refer to caption
(a) Ground truth: nominating any more Labour life Peers.
Refer to caption
(b) nominating any more Labour life Peers.
Refer to caption
(c) Ground truth: Lancaster House despite the crisis which had.
Refer to caption
(d) Lancaster House despite the crisis which had
Refer to caption
(e) Ground truth: round a doll’s house.
Refer to caption
(f) round a doll’s house.
Figure 1: Line images (b, d, f) containing lacuna highlighted in bold.

3.2 Inference and Evaluation

To evaluate the performance of models on our datasets, we use the Character Error Rate (CER), a metric that is widely used to evaluate the effectiveness of optical character recognition. It measures the rate of character-level errors in the OCR output compared to the ground truth text. CER is given by

CER=Ninsertions+Ndeletions+NsubstitutionsNCERsubscript𝑁insertionssubscript𝑁deletionssubscript𝑁substitutions𝑁\text{CER}=\frac{N_{\text{insertions}}+N_{\text{deletions}}+N_{\text{% substitutions}}}{N}CER = divide start_ARG italic_N start_POSTSUBSCRIPT insertions end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT deletions end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT substitutions end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG

where N is the total number of characters in the ground truth, Ninsertionssubscript𝑁insertionsN_{\text{insertions}}italic_N start_POSTSUBSCRIPT insertions end_POSTSUBSCRIPT is the number of characters incorrectly inserted by the model, Ndeletionssubscript𝑁deletionsN_{\text{deletions}}italic_N start_POSTSUBSCRIPT deletions end_POSTSUBSCRIPT is the number of characters incorrectly deleted by the model, and Nsubstitutionssubscript𝑁substitutionsN_{\text{substitutions}}italic_N start_POSTSUBSCRIPT substitutions end_POSTSUBSCRIPT is the number of characters incorrectly substituted by the model. Let X𝑋Xitalic_X represent the line image, S𝑆Sitalic_S denote its text sequence, n𝑛nitalic_n signify the total number of words in S𝑆Sitalic_S, L𝐿Litalic_L denote the lacuna token, and L^^𝐿\hat{L}over^ start_ARG italic_L end_ARG symbolize the model’s predicted lacuna token. With S=(w1,w2,,wn)𝑆subscript𝑤1subscript𝑤2subscript𝑤𝑛S=(w_{1},w_{2},...,w_{n})italic_S = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as the text sequence comprising n𝑛nitalic_n words, we assess whether L^=L^𝐿𝐿\hat{L}=Lover^ start_ARG italic_L end_ARG = italic_L when L𝐿Litalic_L appears at position nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in S𝑆Sitalic_S.

3.3 Baseline: Effects of Lacunae and Complex Writing on Log Probabilities

We aim to investigate the impact of lacunae and other factors, such as characters or words written in an unusual manner, on the log probability of transcription tokens. Log probability is a measure used to quantify the likelihood of a word or sequence of words occurring given some context. In the context of OCR, it represents the probability of predicting a new token given previously transcribed tokens. Let w1,w2,,wnsubscript𝑤1subscript𝑤2subscript𝑤𝑛w_{1},w_{2},...,w_{n}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be a sequence of n𝑛nitalic_n tokens in the transcription, w1:(i1)subscript𝑤:1𝑖1w_{1:(i-1)}italic_w start_POSTSUBSCRIPT 1 : ( italic_i - 1 ) end_POSTSUBSCRIPT be the sequence of tokens from the first token up to wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, P(wi|w1:(i1))𝑃conditionalsubscript𝑤𝑖subscript𝑤:1𝑖1P(w_{i}|w_{1:(i-1)})italic_P ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT 1 : ( italic_i - 1 ) end_POSTSUBSCRIPT ) be the conditional probability of observing token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the context w1:(i1)subscript𝑤:1𝑖1w_{1:(i-1)}italic_w start_POSTSUBSCRIPT 1 : ( italic_i - 1 ) end_POSTSUBSCRIPT. The log probability of a token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given its context w1:(i1)subscript𝑤:1𝑖1w_{1:(i-1)}italic_w start_POSTSUBSCRIPT 1 : ( italic_i - 1 ) end_POSTSUBSCRIPT is:

Log Probability=logP(wi|w1:(i1))Log Probability𝑃conditionalsubscript𝑤𝑖subscript𝑤:1𝑖1\text{Log Probability}=\log P(w_{i}|w_{1:(i-1)})Log Probability = roman_log italic_P ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT 1 : ( italic_i - 1 ) end_POSTSUBSCRIPT )

Our hypothesis suggests that the log probability would decrease for tokens affected by lacunae, as the model would exhibit uncertainty in its interpretation. Similarly, this hypothesis extends to tokens written in a complex manner, where the model may also struggle to accurately predict them thus resulting in lower log probabilities. To test our hypothesis, we train a logistic regression model using the minimum of the log probability (min(log probability)minlog probability\text{min}(\text{log probability})min ( log probability )) as a feature. The min(log probability)minlog probability\text{min}(\text{log probability})min ( log probability ) represents the lowest log probability observed among all transcription tokens. Let Pminsubscript𝑃minP_{\text{min}}italic_P start_POSTSUBSCRIPT min end_POSTSUBSCRIPT be the minimum of all log probabilities in the transcription tokens, then the logistic regression model can be represented as

Logistic Regression: y=σ(β0+β1Pmin)Logistic Regression: 𝑦𝜎subscript𝛽0subscript𝛽1subscript𝑃min\text{Logistic Regression: }y=\sigma(\beta_{0}+\beta_{1}\cdot P_{\text{min}})Logistic Regression: italic_y = italic_σ ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_P start_POSTSUBSCRIPT min end_POSTSUBSCRIPT )

where y𝑦yitalic_y represents the predicted outcome (e.g., lacuna or error detection), β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the intercept, β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the coefficient associated with the feature Pminsubscript𝑃minP_{\text{min}}italic_P start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, and σ𝜎\sigmaitalic_σ is the logistic function. We report our findings in §5.3.

3.4 Hypothesis: Can Attention Detect Lacunae and Complex Writing?

In models such as TrOCR, cross-attention refers to the mechanism by which the models attend to different parts of the input line image while generating the output transcription. Cross-attention helps the model focus on the relevant regions of the input image at each step of the transcription process, enabling it to capture any contextual dependencies for accurate optical character recognition. Given an input image I𝐼Iitalic_I consisting of a sequence of visual features {v1,v2,,vn}subscript𝑣1subscript𝑣2subscript𝑣𝑛\{v_{1},v_{2},...,v_{n}\}{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } representing different regions in the image, and a target transcription T𝑇Titalic_T consisting of a sequence of tokens {t1,t2,,tm}subscript𝑡1subscript𝑡2subscript𝑡𝑚\{t_{1},t_{2},...,t_{m}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } representing the characters in the text, the cross-attention mechanism computes a set of attention scores {α1,α2,,αn}subscript𝛼1subscript𝛼2subscript𝛼𝑛\{\alpha_{1},\alpha_{2},...,\alpha_{n}\}{ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where each attention score αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the relevance of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT visual feature to the generation of the current token tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the transcription.

We study a possible hypothesis of identifying any attention patterns in the cross-attention matrix to visually identify lacunae and complex writing. Our hypothesis suggests that when the model processes lacuna characters or complex handwriting lacking sufficient visual evidence, the model may allocate greater attention scores to other regions within the line image that provide clearer visual cues. Further, it can utilize this contextual information to infer the identity of the lacuna or other characters written in a complex way. To test this hypothesis, we train a logistic regression model using two features: min(log probability)minlog probability\text{min}(\text{log probability})min ( log probability ) (as in section 3.3) and attention entropy. attention entropy is a measure used to quantify the uncertainty or randomness in the distribution of attention scores. Entropy H𝐻Hitalic_H for an attention distribution p𝑝pitalic_p can be given as

Entropy H(p)=i=1npilog(pi)𝐻𝑝superscriptsubscript𝑖1𝑛subscript𝑝𝑖subscript𝑝𝑖H(p)=-\sum_{i=1}^{n}p_{i}\log(p_{i})italic_H ( italic_p ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

We investigate whether the attention entropy feature has a greater or lesser influence compared to our baseline log probability feature in identifying lacunas and characters with complex handwriting. We report our results related to attention in section 5.3. The reason we train separate logistic regression models for detecting lacuna and other errors is that sometimes the lacuna characters get higher log probability and lower uniform attention when transcribed correctly. Conversely, we see higher attention and lower log probability in the scenario where non-lacuna tokens get mistranscribed due to other errors. This phenomenon makes it a hard problem to solve and thus we decide to train separate models to disentangle these cases. For instance, while transcribing the line image from Figure 1(f), we find that the model correctly transcribes the lacuna token house and gives higher log probability (Figure 2a) and lower uniform attention (Figure 2b). Whereas for the non-lacuna tokens that are mistranscribed (such as record and ball), we see lower log probability and higher non-uniform attention for them.

Refer to caption
(a) Log Probability
Refer to caption
(b) Cross-attention
Figure 2: Log probabilities and cross-attention for the transcripted tokens from Figure 1(f) with the following ground truth transcription: round a doll’s house.

3.5 Training TrOCR in Leiden Conventions

The Leiden Conventions are adopted in many subfields of paleography to indicate the uncertainty of, and deviation from, the visual evidence for a transcription [7]. Under these conventions, missing or uncertain information, such as illegible characters or words, should be denoted by enclosing them within square brackets. Characters transcribed with partial evidence should have dots placed under them. For instance, using the Leiden conventions, the transcription for the line image in Figure 1(f) can be written as round a doll’s h[o]use.

We experimented with training TrOCR using the Leiden conventions in the ground truth transcriptions of line images. We annotated the transcriptions of 15% of the line images containing a lacuna in our training data by enclosing the lacuna token within square brackets. We used a customized loss function to penalize the model for failing to generate square brackets in positions where it should enclose the lacuna within square brackets. This approach aimed to train the model to recognize and appropriately handle lacunae within the transcriptions using the Leiden conventions. Let L𝐿Litalic_L be the original (cross-entropy) loss function. The custom weighted loss function Lweightedsubscript𝐿weightedL_{\text{weighted}}italic_L start_POSTSUBSCRIPT weighted end_POSTSUBSCRIPT is defined as:

Lweighted=L×(nbrackets+1)subscript𝐿weighted𝐿subscript𝑛brackets1L_{\text{weighted}}=L\times(n_{\text{brackets}}+1)italic_L start_POSTSUBSCRIPT weighted end_POSTSUBSCRIPT = italic_L × ( italic_n start_POSTSUBSCRIPT brackets end_POSTSUBSCRIPT + 1 )

where (nbracketssubscript𝑛bracketsn_{\text{brackets}}italic_n start_POSTSUBSCRIPT brackets end_POSTSUBSCRIPT+1) is the weight and nbracketssubscript𝑛bracketsn_{\text{brackets}}italic_n start_POSTSUBSCRIPT brackets end_POSTSUBSCRIPT is the number of occurrences of square brackets [][][ ] in the ground truth transcriptions. This equation ensures that the loss for each sample is multiplied by a weight that is proportional to the number of square brackets in its corresponding ground truth transcription. Samples with more square brackets will have higher weights, encouraging the model to learn the convention of generating square brackets around lacuna tokens.

4 Experiments

We use the Aachen’s partition of the IAM Handwriting datasethttps://github.com/jpuigcerver/Laia/tree/master/egs/iam where the training data has 6161 lines from 747 documents, validation data has 966 lines from 115 forms, and test data has 2915 lines from 336 forms. We generate lacuna versions corresponding to all line images in our dataset. Next, we train the following variations of models using the TrOCR base model (334M parameters).

  1. 1.

    Model trained solely on all binarized non-lacuna images (total 6161 images).

  2. 2.

    Model trained on all binarized images and their corresponding lacuna variations (total 6161*2 images).

  3. 3.

    Model trained on all binarized images in addition to 30% randomly selected lacuna images (total 8009 images).

  4. 4.

    Model trained on all binarized images supplemented with 15% randomly selected lacuna images (total 7085 images).

The first model is trained to assess whether the Transformer-based model handles lacunae without exposure to any lacuna images. TrOCR is pre-trained without any lacuna variations, so we were interested in evaluating its ability to leverage pre-training knowledge for inferring lacunae. Furthermore, we experiment with different proportions of lacuna images in the training data to observe their impact on model performance. We use the following hyperparameters for training: learning rate of 2e-5, weight decay of 0.0001, batch size of 4, and AdamW optimizer. We train our models for 20 epochs on four A100 40GB GPUs and pick the best model (with the lowest validation character error rate).

We examine the attention patterns and log probabilities specifically from the last layer (layer 12) and the first head. The TrOCR base model comprises 12 layers, with each layer containing 16 attention heads. However, in this study, we concentrate solely on the final layer and a single head. While we acknowledge the potential benefits of conducting a comprehensive analysis across all layers and heads, we opt for this focused approach for the current investigation.

5 Results

5.1 Visual Examples: Lacuna Transcription Improvement

In this section, we present images containing lacunae that were initially mistranscribed by the model trained only on non-lacuna images. However, our trained model, which is specifically trained on images with lacunae, successfully transcribes them accurately. We observe in Figure 3 and Figure 5 that multiple characters from a word form a lacuna, while in Figure 4, a single-character lacuna is evident. Remarkably, the model trained on lacuna images accurately transcribes all these variations.

Refer to caption

Ground truth: she checked back her instructions. It was

Non-lacuna model: the checked back her 1 truth. It was

lacuna model: she checked back her instructions. It was

Figure 3: Line image with a lacuna in instructions. Errors are highlighted in red.
Refer to caption

Ground truth: one of its most classic features to the Turkish rule under

Non-lacuna model: one of its most dc pictures to the Turkish rule under

lacuna model: one of its most classic features to the Turkish rule under

Figure 4: Line image with a lacuna in classic. Errors are highlighted in red.
Refer to caption

Ground truth: And the sequence in the text is different than

Non-lacuna model: And the sequence in the text is f frequent than

lacuna model: And the sequence in the text is different than

Figure 5: Line image with a lacuna in different. Errors are highlighted in red.

5.2 Performance Comparison of our Various Models

Model trained only on binarized non-lacuna images: The model trained only on clean binarized line images gives a validation CER of 7.46%, CER of 11.49% on the test set of non-lacuna images, and CER of 13.26% on a mixture of binarized non-lacuna and lacuna line images. Table 1a shows the model’s performance in predicting both the lacuna and non-lacuna characters in the dataset. We observe that the model accurately transcribes non-lacuna characters 77.82% of the time, while it achieves correct transcriptions for lacuna characters only 5.6% of the time. One potential explanation could be that during pre-training, the TrOCR model may not have encountered any text instances containing missing characters similar to lacunae.

Table 1: Confusion matrix for predicting (non-)lacuna characters
Lacuna Non-Lacuna
Correct 5.6% 77.82%
(157/2800) (17969/23089)
Incorrect 94.4% 22.18%
(2643/2800) (5120/23089)
(a) Only non-lacuna training
Lacuna Non-Lacuna
Correct 65.85% 75.38%
(1844/2800) (17406/23089)
Incorrect 34.15% 24.62%
(956/2800) (5683/23089)
(b) Non-lacuna + 100% lacuna training
Lacuna Non-Lacuna
Correct 62.25% 76.95%
(1743/2800) (17767/23089)
Incorrect 37.75% 23.05%
(1057/2800) (5322/23089)
(c) Non-lacuna + 30% lacuna training
Lacuna Non-Lacuna
Correct 60.14% 80.35%
(1684/2800) (18554/23089)
Incorrect 39.86% 19.65%
(1116/2800) (4535/23089)
(d) Non-lacuna + 15% lacuna training

Model trained on all binarized images and their corresponding lacuna variations: The model trained on clean images and their lacuna variations yields a validation CER of 10% and test CER of 13.75% on a mixture of non-lacuna and lacuna images. Table 1b shows the model’s performance in predicting both the lacuna and non-lacuna characters in the dataset. We observe a notable enhancement in model performance when training on lacuna images, increasing accuracy from 5.6% to 65.85%. Specifically, the trained model accurately identifies 1844 out of 2800 lacuna characters, a substantial improvement compared to the baseline model’s mere 157 correct predictions. Here, we can assert that training the model on labeled lacuna data teaches it to correctly recognize and restore any lacunae in the text. Another noteworthy observation is that training on lacuna images has a slight impact on the model’s accuracy with clean, non-lacuna characters, evidenced by a decrease from 77.82% to 75.38%. This tradeoff implies that while there’s an improvement in recognizing lacuna characters, there’s a minor compromise in accuracy for non-lacuna characters.

Model trained on all binarized images in addition to 30% randomly selected lacuna images : The model trained on 30% lacuna images gives a validation CER of 8.71% and a test CER of 13.04% on a mixture of lacuna and non-lacuna images. Table 1c shows the model’s performance in predicting both the lacuna and non-lacuna characters in the dataset. We observe a decline in the model’s ability to predict lacuna characters compared to when it’s trained on 100% of lacuna images. However, its accuracy in predicting non-lacuna characters increases, leading to a lower overall test CER.

Model trained on all binarized images in addition to 15% randomly selected lacuna images: The model trained on 15% lacuna images gives a validation CER of 7.83% and test CER of 11.11% on a mixture of lacuna and non-lacuna images. Table 1d shows the model’s performance in predicting both the lacuna and non-lacuna characters in the dataset. In line with the model trained on 30% lacuna images, there is a decrease in the accuracy of predicting lacuna characters, while the accuracy for non-lacuna characters shows improvement. This suggests that reducing the proportion of lacuna images in the training data significantly limits the model’s ability to learn about lacunae compared to training with 100% of lacuna images in the dataset. Another intriguing observation is the increase in the model’s accuracy in correctly predicting clean non-lacuna characters from 77.82% (accuracy of a model trained solely on binarized non-lacuna images) to 80.35%. This suggests that the inclusion of additional lacuna images in the training set contributes to enhancing performance on non-lacuna characters in line images. It is possible that initially, the model might incorrectly predict a clean non-lacuna character. However, exposure to multiple copies of the same image (lacuna and non-lacuna versions) could have assisted the model in transcribing that character correctly.

5.3 Results from our Log Probability and Attention Experiments

Our findings indicate that attention does not outperform the log probability baseline in identifying lacunae and other errors in the line images. Table 2 illustrates the efficacy of our logistic regression models trained on both log probability and attention entropy as features. We observe that attention entropy gets smaller feature importance coefficients compared to log probability in the classification of lacunae and transcription errors. The baseline models, using log probability as features, accurately predict whether a line image has any lacuna 53.25% of the time and errors 84.12% of the time. Figure 6 shows confusion matrices for all models trained to flag lacunae and errors in a line image. The model trained on log probability accurately identifies 396 non-lacuna images and 201 lacuna images. Similarly, it correctly flags 907 line images with transcription errors. Overall, log probability appears to be a robust metric to flag any lacunae and transcription errors in the line images.

In our analysis, we focus exclusively on the last layer and the first head of attention. However, it is essential to explore whether these findings remain consistent across all layers and heads, as well as with different attention metrics. Such an investigation could provide valuable insights into the robustness and generalizability of our results.

Table 2: Comparison of log probability and attention entropy features shows no difference in predictive accuracy.
Task Feature Importance Accuracy
Lacuna Log probability: -0.247 53.25%
Lacuna Log probability: -0.163 & Entropy: 0.001 53.25%
Other errors Log probability: -2.683 84.12%
Other errors Log probability: -2.688 & Entropy: 0.064 84.03%
Refer to caption
(a) Detecting lacunae with log probability and entropy features
Refer to caption
(b) Detecting lacunae with log probability feature
Refer to caption
(c) Detecting errors with log probability and entropy features
Refer to caption
(d) Detecting errors with log probability feature
Figure 6: Confusion Matrices for logistic regression models trained to flag lacuna and errors in line images with log probability and attention entropy as features.

5.4 Results from training using Leiden conventions

We observe that the model generates square brackets ([]) in the transcription but consistently fails to enclose lacuna characters within those brackets. For instance, instead of round a doll’s h[o]use it generates round a [] doll’s house. It may be worthwhile to investigate if increasing the proportion of training data containing transcriptions formatted with the Leiden conventions, coupled with using an alternative loss function, could enable the model to learn and generate the appropriate bracketing. It may also be worth exploring the interaction of these conventions with the tokenization of the decoder model. Putting brackets around whole output tokens, for instance, could still indicate the presence of a lacuna in a word without modifying output tokenization.

6 Conclusion

Transformer models like TrOCR can successfully learn to restore lacunae in line images when trained on datasets featuring such gaps. Log probability can be leveraged to flag lacunae and transcription errors (e.g., due to non-standard writing or inking issues). Our controlled experimental setup demonstrates the promising potential of transformer-based models for restoring damaged historical documents. Moreover, by studying mechanistic properties such as log probability, it is possible to infer whether a specific document or image contains damages or inconsistencies like lacunae or errors without explicitly looking at those documents or images. Looking at more fine-grained attention features over a wider range of model representations and model architectures would be an interesting area of future work. Expanding our capacity for model interpretability could prove invaluable to scholars seeking to incorporate computational models into a paleographical workflow. Extending our methodology to real historical documents with lacunae will be the focus of our future work.

References

  • [1] Arlitsch, K., Herbert, J.: Microfilm, paper, and ocr: Issues in newspaper digitization. the utah digital newspapers program 33(2), 59–67 (2004)
  • [2] Assael, Y., Sommerschield, T., Shillingford, B., Bordbar, M., Pavlopoulos, J., Chatzipanagiotou, M., Androutsopoulos, I., Prag, J., de Freitas, N.: Restoring and attributing ancient texts using deep neural networks. Nature 603(7900), 280–283 (Mar 2022), publisher: Nature Publishing Group
  • [3] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 12449–12460. Curran Associates, Inc. (2020)
  • [4] Cowen-Breen, C., Brooks, C., Haubold, J., Graziosi, B.: Logion: Machine Learning for Greek Philology (May 2023), arXiv:2305.01099 [cs]
  • [5] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
  • [6] Dong, R., Smith, D.: Multi-input attention for unsupervised OCR correction. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2363–2372. Association for Computational Linguistics, Melbourne, Australia (Jul 2018)
  • [7] van Groningen, B.A.: De signis criticis in edendo adhibendis. Mnemosyne 59(4), 362–365 (1932), https://www.jstor.org/stable/4426628, publisher: Brill
  • [8] Hämäläinen, M., Hengchen, S.: From the paft to the fiiture: a fully automatic NMT and word embeddings method for OCR post-correction. In: Mitkov, R., Angelova, G. (eds.) Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019). pp. 431–436. INCOMA Ltd., Varna, Bulgaria (Sep 2019)
  • [9] Koenecke, A., Choi, A.S.G., Mei, K., Schellmann, H., Sloane, M.: Careless Whisper: Speech-to-Text Hallucination Harms (Feb 2024), arXiv:2402.08021 [cs]
  • [10] Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., Wei, F.: Trocr: Transformer-based optical character recognition with pre-trained models. Proceedings of the AAAI Conference on Artificial Intelligence 37(11), 13094–13102 (Jun 2023)
  • [11] Nockels, J., Gooding, P., Terras, M.: The implications of handwritten text recognition for accessing the past at scale. Journal of Documentation 80(7), 148–167 (Jan 2024). https://doi.org/10.1108/JD-09-2023-0183, https://doi.org/10.1108/JD-09-2023-0183
  • [12] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
  • [13] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
  • [14] Rijhwani, S., Anastasopoulos, A., Neubig, G.: OCR Post Correction for Endangered Language Texts. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5931–5942. Association for Computational Linguistics, Online (Nov 2020)
  • [15] Rijhwani, S., Rosenblum, D., Anastasopoulos, A., Neubig, G.: Lexically aware semi-supervised learning for OCR post-correction. Transactions of the Association for Computational Linguistics 9, 1285–1302 (2021)
  • [16] Vogler, N., Allen, J., Miller, M., Berg-Kirkpatrick, T.: Lacuna reconstruction: Self-supervised pre-training for low-resource historical document transcription. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Findings of the Association for Computational Linguistics: NAACL 2022. pp. 206–216. Association for Computational Linguistics, Seattle, United States (Jul 2022)