¹¹institutetext: CVIT, IIIT Hyderabad ¹¹email: [email protected], ¹¹email: ajoy.mondal, [email protected]

Advancing Question Answering on Handwritten Documents: A State-of-the-Art Recognition-Based Model for HW-SQuAD

Aniket Pal 0000-0002-9971-8674 Ajoy Mondal 0000-0002-4808-8860 C.V. Jawahar 0000-0001-6767-7057

Abstract

Question-answering handwritten documents is a challenging task with numerous real-world applications. This paper proposes a novel recognition-based approach that improves upon the previous state-of-the-art on the HW-SQuAD and BenthamQA datasets. Our model incorporates transformer-based document retrieval and ensemble methods at the model level, achieving an Exact Match score of 82.02% and 92.55% in HW-SQuAD and BenthamQA datasets, respectively, surpassing the previous best recognition-based approach by 10.89% and 26%. We also enhance the document retrieval component, boosting the top-5 retrieval accuracy from 90% to 95.30%. Our results demonstrate the significance of our proposed approach in advancing question answering on handwritten documents. The code and trained models will be publicly available to facilitate future research in this critical area of natural language.

Keywords:

Handwritten Question-Answering Recognition-based Ensemble BERT DeBERTa

1 Introduction

Question Answering [1, 2] has become an important task in NLP that aims to automatically provide correct answers to questions articulated in natural language processing. The ability to answer questions effectively with respect to a given context is widely useful, including but not limited to information retrieval, knowledge management, and intelligent personal assistants. The Stanford Question Answering Dataset (SQuAD) [1] has become a widely-used benchmark for evaluating the performance of QA systems on text-based documents, which consists of a collection of Wikipedia articles and crowd-sourced questions, where the answers are spans of text extracted from the corresponding articles.

However, answering questions on handwritten document images introduces additional challenges compared to traditional text-based QA. Handwritten documents exhibit complex layouts, varying writing styles, and noise and distortions, making recognizing and extracting relevant information challenging. Furthermore, handwritten text recognition is challenging, requiring robust models that can handle the variability and ambiguity inherent in handwritten content.

To address these challenges, the HW-SQuAD dataset [3] was introduced as an extension of SQuAD to the handwritten domain. It consists of synthetic handwritten document images paired with questions and answers, where the answers are spans of text within the documents. This dataset has spurred research into develo** QA systems that can effectively handle the complexities of handwritten documents. The two main approaches for tackling HW-SQuAD are recognition-based methods, which rely on accurate handwritten text recognition to convert the images into machine-readable text and apply traditional text-based QA techniques, and recognition-free methods, which directly process the handwritten images without explicit text recognition, leveraging visual features and spatial layout information to locate the answer spans.

Previous works on HW-SQuAD have explored both recognition-based and recognition-free approaches. Minesh et al. [3] proposed a recognition-based method that combines handwritten text recognition with a pre-trained language model for question answering. Their approach achieved an Exact Match score of 70% on the HW-SQuAD dataset. In the recognition-free domain, the same author introduced a visual QA model that directly operates on the handwritten document images, achieving an accuracy (snippet extraction accuracy [3]) of 15.9 %. Despite these advancements, there remains significant room for improvement in QA performance on handwritten documents.

The previous recognition-based model consists of two main components: the document retriever and the document reader. The document retriever employs the naive TF-IDF algorithm to rank and select relevant documents based on the question. In contrast, the document reader utilizes the BERT QA [4] model to extract the answer span from the retrieved documents.

In this paper, we propose a novel document retriever that combines the strengths of the TF-IDF algorithm and sentence transformers, resulting in significant improvements in document retrieval performance compared to the previous state-of-the-art. Furthermore, we introduce advanced pre-processing techniques that further enhance the accuracy of the retrieval process. We employ an ensemble approach for the document reader component that leverages the BERT QA model and other extractive QA models such as SpanBERT and DeBERTa, enabling more robust and accurate answer extraction from the retrieved documents.

The main contributions of our work are as follows:

•

We propose a novel document retriever that effectively combines the TF-IDF algorithm with sentence transformers, significantly improving the retrieval performance on the HW-SQuAD dataset.
•

We introduce advanced pre-processing techniques that further enhance the accuracy of the document retrieval process, enabling more precise selection of relevant documents for the given question.
•

We employ an ensemble approach for the document reader component, leveraging the strengths of multiple extractive QA models, including BERT and DeBERTa, to achieve more robust and accurate answer extraction.
•

Through extensive experiments, we demonstrate the superiority of our proposed approach, surpassing the previous state-of-the-art performance on the HW-SQuAD and BenthamQA datasets.

The remainder of this paper is organized as follows. Section 2 discusses related work on question answering and handwritten document processing. Section 3 describes our proposed recognition-based approach in detail, including the handwritten text recognition model, question-document matching mechanism, and enhanced retrieval method. Section 4 presents the experimental setup, dataset details, and evaluation metrics. Section 5 reports and analyzes the results of our experiments, comparing our approach with previous state-of-the-art methods. Section 6 provides a discussion on the implications of our findings, the limitations of our approach, and potential future research directions. Finally, Section 7 concludes the paper and summarizes our contributions.

2 Related Works

The natural language processing (NLP) and information retrieval (IR) communities have been actively researching machine reading comprehension (MRC) and open domain question answering (QA). The introduction of large-scale datasets like SQuAD [1], MS MACRO [5], and Natural Questions [6] has spurred the development of deep learning-based QA/MRC systems [4][7]-[8] capable of answering questions about a given text corpus or passage. Our work builds upon these advancements but focuses on answering questions on handwritten document images using recognition-based approaches. Visual question answering (VQA) has gained significant attention in the computer vision community in recent years [10,31–35]. Early VQA datasets and methods often ignored text in images, treating the problem as multi-class classification with a fixed set of answers and emphasizing visual aspects like objects and attributes. However, Gurari et al. [9] showed that answering many questions asked by visually impaired individuals on their own images necessitates reading and interpreting the image text. This led to the creation of Scene Text VQA [10] and TextVQA [11] datasets, where reading image text is essential for answering questions. Our work differs from these tasks in two main aspects: (1) we focus on handwritten document images instead of "in the wild" images with scattered text tokens, and (2) we aim to answer questions on a collection of document images rather than a single image. Other relevant VQA works include VQA on charts and plots [12][13] and Textbook Question Answering (TQA) [11]. TQA seeks to answer questions given a context of text, diagrams, and images, but the textual information is provided in computer-readable format. For VQA on charts and plots, OCR is required to recognize the text and answer many questions. However, the text in these synthetically generated charts/plots is sparse and rendered in standard fonts, unlike the handwritten sentences and paragraphs in our case. Our work is inspired by the DocVQA dataset [14], which includes a wide variety of document images containing printed, typewritten, handwritten, and born-digital text in the form of sentences, forms, tables, figures, and photographs. While DocVQA Task 1 adheres to the standard VQA setting with textual answers, we propose an improved recognition-based approach to answer questions on handwritten document collections like HW-SQuAD and BenthamQA. In the field of information retrieval and keyword spotting, there have been numerous efforts on handwritten document indexing and retrieval [15][16]. The ImageCLEF 2016 Handwritten Scanned Document Retrieval challenge [16] aimed to develop retrieval systems for handwritten documents. Although there are similarities between this challenge and our work, such as using multi-token queries and retrieving document segments or snippets, the document retrieval task differs in two aspects: (1) queries are search queries, not natural language questions, and (2) the task requires all query tokens to appear in the same order in the retrieved snippet. Kise et al. [17] tackled document retrieval for building a QA system on a collection of printed document images, which is likely the first work on QA over a document image collection. They used documents with machine-printed English text, which is relatively easier to recognize compared to handwritten text. Our work advances this line of research by proposing an improved recognition-based approach for QA on handwritten document collections like HW-SQuAD and BenthamQA. DocVQA Task 2 [14] is similar to our work in that both deal with QA over a document collection. However, DocVQA Task 2 uses a collection of US candidate registration forms with the same template, while we focus on handwritten documents with diverse content. Moreover, DocVQA Task 2 aims to retrieve all documents required to answer the question correctly, while our approach focuses on returning precise answer snippets.

In this work, we improved the recognition-based approach in [3] for Question Answering on handwritten document collection. We have improved the Document Retriever by adding advanced preprocessing and the Transformer. In addition, we have improved the Document Reader part by using the Ensemble method. Our proposed end-to-end recognition-based pipeline achieved state-of-art in the HW-SQuAD and Bentham-QA datasets.

3 Method

This section will delineate the methodology of our proposed end-to-end pipeline for Question Answering. Our problem is illustrated in Figure 1. The model is provided with the question and all documents to predict the answer. The model may be either recognition-based or recognition-free. Here, the Question “What is the role of teachers in education?” is fed to the model along with all the documents, and the model needs to predict the answer “facilitate student learning.” The model may be either recognition-based or recognition-free.

We have followed the recognition-based architecture [3]. This architecture is comprised of two parts: i) Document Retriever and ii) Document Reader part. The Document Retriever part comprises TF-IDF and preprocessing, which retrieve the top k document from all the document collections. For the Document Reader part, The BERT Large model is used.

In our proposed architecture, we improve the Document Retriever as well as the Document Reader parts. For document retrieval, we use additional preprocessing techniques, such as the Sentence transformer and TF-IDF. We used an ensemble of two large BERTs and one DeBERTa model for the Document Reader part. Our proposed architecture significantly improves document retrieval and reader performance.

This section is divided into two parts: first, we discuss our proposed Document Retriever, and then we discuss the Document Reader.

Refer to caption — Figure 1: Overview of our problem statement. The Question along with all th documents are fed to the model and it needs to predict the answer.

3.1 Document Retriever Module

In the document retriever component of our approach, we propose a novel technique that enhances the traditional TF-IDF algorithm by incorporating sentence transformers and advanced NLP preprocessing steps. TF-IDF is a well-established method in information retrieval that assigns weights to terms based on their frequency within a document and their rarity across the corpus. Let $V$ be the vocabulary of unique terms across all documents in $D_{\text{processed}}$ . The TF-IDF vectorization can be represented as follows:

Vectorization

2. Vectorizer Transformation:

For a set of contexts $C=\{c_{1},c_{2},\ldots,c_{n}\}$ , where each context $c_{i}$ is preprocessed:

•

TF-IDF Vectorization: The TF-IDF (Term Frequency-Inverse Document Frequency) vectorization process converts a collection of text documents into a matrix representation, highlighting the importance of terms in each document. For a set of contexts $C=\{c_{1},c_{2},\ldots,c_{n}\}$ :

a) Term Frequency (TF):

The term frequency $\text{TF}(t,d)$ of term $t$ in document $d$ is defined as:

\text{TF}(t,d)=\frac{f_{t,d}}{\sum_{t^{\prime}\in d}f_{t^{\prime},d}}

where $f_{t,d}$ is the raw count of term $t$ in document $d$ , and the denominator is the total count of all terms in the document $d$ .

b) Inverse Document Frequency (IDF):

The inverse document frequency $\text{IDF}(t,D)$ of term $t$ across a corpus $D$ (collection of all documents) is defined as:

\text{IDF}(t,D)=\log\left(\frac{N}{1+|\{d\in D:t\in d\}|}\right)

where $N$ is the total number of documents in the corpus $D$ , and $|\{d\in D:t\in d\}|$ is the number of documents containing the term $t$ . The term $1$ in the denominator is added to prevent division by zero.

c) TF-IDF Calculation:

The TF-IDF score for term $t$ in document $d$ is given by:

\text{TF-IDF}(t,d,D)=\text{TF}(t,d)\times\text{IDF}(t,D)

d) Document-Term Matrix:

For the set of contexts $C=\{c_{1},c_{2},\ldots,c_{n}\}$ , the TF-IDF vectorization can be represented as a document-term matrix $\mathbf{M}_{\text{TF-IDF}}$ , where each entry $\mathbf{M}_{\text{TF-IDF}}[i,j]$ represents the TF-IDF score of term $t_{j}$ in document $c_{i}$ :

\mathbf{M}_{\text{TF-IDF}}[i,j]=\text{TF-IDF}(t_{j},c_{i},C)

Thus, the TF-IDF vectorization process for the set of contexts $C$ is represented as:

\mathbf{M}_{\text{TF-IDF}}=\text{TF-IDF Vectorizer}(C)

where $\mathbf{M}_{\text{TF-IDF}}$ is the resulting document-term matrix.

3. Transformer Encoding:

The Sentence Transformer model converts a collection of text documents into a matrix of dense vector representations. For a set of contexts $C=\{c_{1},c_{2},\ldots,c_{n}\}$ :

a) Context Embedding:

Each context $c_{i}\in C$ is passed through the Sentence Transformer model to obtain its dense vector representation (embedding). Let $\text{ST}(\cdot)$ denote the transformation function of the Sentence Transformer model. The embedding $\mathbf{e}_{c_{i}}$ for context $c_{i}$ is given by:

\mathbf{e}_{c_{i}}=\text{ST}(c_{i})

where $\mathbf{e}_{c_{i}}\in\mathbb{R}^{d}$ and $d$ is the dimensionality of the embedding space.

b) Embedding Matrix:

For the entire set of contexts $C$ , the Sentence Transformer model produces a matrix of embeddings. This matrix $\mathbf{E}_{C}$ is constructed by stacking the embeddings of all contexts:

\mathbf{E}_{C}=\begin{bmatrix}\mathbf{e}_{c_{1}}\\ \mathbf{e}_{c_{2}}\\ \vdots\\ \mathbf{e}_{c_{n}}\end{bmatrix}

where $\mathbf{E}_{C}\in\mathbb{R}^{n\times d}$ is the matrix of encoded vectors, with each row representing the embedding of a context.

Thus, the encoding process using the Sentence Transformer model for the set of contexts $C$ is represented as:

\mathbf{E}_{C}=\text{SentenceTransformer}(\text{model},C)

where $\mathbf{E}_{C}$ is the resulting matrix of encoded vectors obtained from the Sentence Transformer model.

Document Retrieval

Algorithm 1 Steps for Document Retrieval

1:Query Vectorization and Encoding • Input: Query

q

• Preprocess the query:

q^{\prime}=\text{preprocess}(q)

• Vectorize using the same vectorizers:

\mathbf{q}_{\text{TF-IDF}}=\text{TF-IDF Vectorizer.transform}(q^{\prime})

• Encode using the transformer model:

\mathbf{e}_{q}=\text{SentenceTransformer.encode}(\text{model},q^{\prime})

2:Cosine Similarity Calculation Compute cosine similarities between the query vectors and context matrices: • TF-IDF Cosine Similarity:

\mathbf{s}_{\text{TF-IDF}}=\cos(\mathbf{q}_{\text{TF-IDF}},\mathbf{M}_{\text{% TF-IDF}})

• Transformer Cosine Similarity:

\mathbf{s}_{\text{Transformer}}=\cos(\mathbf{e}_{q},\mathbf{E}_{C})

3:Ensemble Similarity Calculation • Combine the similarity scores using predefined weights:

\mathbf{s}_{\text{ensemble}}=0.6\cdot\mathbf{s}_{\text{TF-IDF}}+0.4\cdot% \mathbf{s}_{\text{Transformer}}

4:Top-N Document Retrieval • Retrieve the indices of the top

n

documents:

\text{top}_{n}=\text{argsort}(\mathbf{s}_{\text{ensemble}})[-n:]

• Retrieve the corresponding contexts:

\text{top}_{n}\text{ contexts}=[C_{i}\text{ for }i\text{ in top}_{n}]

In addition to the integration of sentence transformers, we employ advanced NLP preprocessing techniques to further enhance the accuracy of the document retrieval process. These techniques include tokenization and stemming. Tokenization involves breaking down the text into individual words or tokens. Stemming reduces words to their base or root form, hel** to normalize the text and handle variations of the same word. By applying these preprocessing steps, we aim to focus on the most informative terms and improve the precision of the retrieved documents.

3.2 Document Reader Module

For the end-to-end recognition model, we propose an ensemble approach that combines the strengths of multiple extractive question answering models. Our ensemble consists of two BERT models with different initializations and one DeBERTa large model. BERT (Bidirectional Encoder Representations from Transformers) and DeBERTa (Decoding-enhanced BERT with Disentangled Attention) are both transformer-based models that have achieved state-of-the-art performance on various natural language processing tasks, including question answering.

Given an input sequence $x=[x_{1},x_{2},\dots,x_{n}]$ , each model generates a sequence of hidden representations $H=[h_{1},h_{2},\dots,h_{n}]$ through multiple layers of self-attention and feed-forward networks:

$\displaystyle H_{\text{BERT}1}$	$\displaystyle=\text{BERT}1(x)$	(1)
$\displaystyle H_{\text{BERT}2}$	$\displaystyle=\text{BERT}2(x)$
$\displaystyle H_{\text{DeBERTa}}$	$\displaystyle=\text{DeBERTa}(x)$

For the question answering task, each model is fine-tuned to predict the start and end positions of the answer span within the context. The probability of a token $x_{i}$ being the start or end of the answer is computed using a softmax function over the hidden representations:

	$\displaystyle P_{\text{start}}^{(m)}(i)=\frac{\exp(W_{\text{start}}^{(m)}\cdot h% _{i}^{(m)})}{\sum_{j=1}^{n}\exp(W_{\text{start}}^{(m)}\cdot h_{j}^{(m)})}$		(2)
	$\displaystyle P_{\text{end}}^{(m)}(i)=\frac{\exp(W_{\text{end}}^{(m)}\cdot h_{% i}^{(m)})}{\sum_{j=1}^{n}\exp(W_{\text{end}}^{(m)}\cdot h_{j}^{(m)})}$		(2)

where $W_{\text{start}}^{(m)}$ and $W_{\text{end}}^{(m)}$ are learnable weight matrices for model $m\in{\text{BERT}_{1},\text{BERT}_{2},\text{DeBERTa}}$ .

To combine the predictions from the three models, we employ an ensemble approach. The ensemble predictions are obtained by taking the union of the predicted answers from all models for each question. Let $A_{1},A_{2},A_{3}$ be the sets of predicted answers from the two BERT models and the DeBERTa model, respectively. The ensemble prediction $A_{\text{ensemble}}$ for a given question is computed as:

\displaystyle A_{\text{ensemble}}=A_{1}\cup A_{2}\cup A_{3}

(3)

The ensemble approach allows us to leverage the individual strengths of each model and potentially improve the overall performance compared to using a single model.

To evaluate the performance of the ensemble model, we use a custom evaluation function that calculates the number of correct, incorrect, and similar answers based on the ground truth. The function compares the predicted answers with the ground truth answers and categorizes them as follows:

Correct: If any of the predicted answers match the ground truth exactly. Similar: If any of the predicted answers have a significant overlap with the ground truth. Incorrect: If none of the predicted answers match or have significant overlap with the ground truth.

The evaluation function also keeps track of the corresponding question and answer texts for each category.

4 Experimental setup

We evaluate the performance of our proposed model by utilizing the BenthamQA and HW-SQuAD datasets. By combining the OCR texts, we have created the context for our recognition-based pipeline. The question and the answer are derived from these two datasets. Following the extraction of context, questions, and answers, we implemented the basic preprocessing approach. Afterwards, we implemented our proposed Document Retriever. In order to implement the Recognition-based model, we converted the output to SQuAD dataset format. The HW-SQuAD dataset contains over 84,000+ pairs of QA pairs, while the Bentham-QA dataset contains 200 pairs. For training, validation, and testing, we have implemented the exact same split ratio as in [3]. We divided the Bentham-QA dataset into 80, 10, and 10 ratios for training, validation, and testing.

Our training was conducted using four 2080 ti Nvidia RTX graphics cards. AdamW is employed for BERT large, while Adam Optimizer is employed for the DeBERTa model. We employed two evaluation metrics for Document Reader: the F1 score and the EM [1][3]. The F1 score is the harmonic mean of precision and recall:

F1=2\times\frac{precision\times recall}{precision+recall}

(4)

where:

•

Precision = $\frac{TP}{TP+FP}$
•

Recall = $\frac{TP}{TP+FN}$
•

TP: True Positives
•

FP: False Positives
•

FN: False Negatives

The another metric, Exact Match (EM) is calculated as:

EM=\frac{1}{N}\sum_{i=1}^{N}I(predicted_{i}=ground\_truth_{i})

(5)

where:

•

$N$ : Total number of instances
•

$I$ : Indicator function that equals 1 if the predicted answer matches the ground truth, and 0 otherwise

Document Retriever was assessed based on the top 5-accuracy scores, as it is implemented in [3]. Top 5 accuracy is calculated as:

Top\_5\_Accuracy=\frac{1}{N}\sum_{i=1}^{N}I(ground\_truth_{i}\in top\_5\_% predictions_{i})

(6)

where:

•

$N$ : Total number of instances
•

$I$ : Indicator function that equals 1 if the ground truth is present in the top 5 predictions, and 0 otherwise

5 Results and Analysis

This section presents a three-part analysis of our experimental results. First, we investigate the effect of TF-IDF and sentence transformers on document retrieval performance. Second, we evaluate the proposed Document Reader’s performance on the HW-SQuAD and BenthamQA datasets. Finally, we conduct ablation studies to examine the impact of critical components on overall QA performance and provide an error analysis to identify common errors and propose mitigation strategies. This section offers insights into the effectiveness of our approach and its performance on benchmark datasets, contributing to a better understanding of question answering on handwritten documents.

5.1 Impact of Sentence Transformers and NLP Preprocessing on Document Retrieval

In this subsection, we discuss the effect of incorporating sentence transformers and NLP preprocessing techniques alongside the TF-IDF algorithm for document retrieval. The previous model’s document retrieval component relied solely on TF-IDF, achieving a top-5 accuracy of approximately 90%. By integrating NLP preprocessing and sentence transformers, we aim to enhance the retrieval performance. Table 2 presents the results obtained on the HW-SQuAD and BenthamQA datasets after applying these modifications.

The results demonstrate that the inclusion of sentence transformers and NLP preprocessing techniques leads to a significant improvement in document retrieval accuracy. On the HW-SQuAD dataset, our approach achieves a top-5 accuracy of 95.3%, surpassing the previous model’s performance by 4.83%. Similarly, on the BenthamQA dataset, we observe an increase in top-5 accuracy from 98.5% to 99.5%. These findings highlight the effectiveness of leveraging sentence transformers and NLP preprocessing in capturing semantic similarities between the question and the document, enabling more precise retrieval of relevant documents.

5.1.1 Comparison of the two retrieval on real-world examples

In Fig. 2 we have compared the two technique side by side on an example. We take one ground truth context (depicted on the top of the Figure) and corresponding top 5 contexts retrieved by two model. The common contexts are highlighted as light orange and the improved retrived contexts are highlighted with light green. Though Both the combined approach (TF-IDF, sentence transformer) and the TF-IDF-only method successfully retrieved the ground truth context as their top result the old approach lacks in semantic relevance, thematic consistency. We give the detail analysis in the below sections.

Semantic relevance of the retrieved contexts: The combined approach appears to retrieve contexts that are more semantically related to the ground truth context and the general theme of the CIA and intelligence agencies. For example, the second context discusses the CIA’s lack of intelligence-gathering abilities during the Korean War, while the third context mentions the CIA and FBI’s missed opportunities to prevent the 9/11 attacks. These contexts, although not directly related to the CIA’s budget, are still relevant to the overall topic of the CIA and its performance.

Thematic consistency of the retrieved contexts: The combined approach demonstrates better thematic consistency among the retrieved contexts. Apart from the fifth context (which appears to be an outlier), the other contexts are related to the CIA, intelligence agencies, or historical events involving them. This suggests that the combined approach is more effective at capturing the overall theme of the query.

Reduced Reliance on Exact Keyword Matching: The combined approach reduces the reliance on exact keyword matching by leveraging sentence transformers and count vectors. This allows for the retrieval of relevant contexts that may use synonyms, paraphrases, or related terms instead of the exact keywords present in the query. By capturing the semantic similarity between words and phrases, the combined approach can identify relevant contexts that would be missed by a purely keyword-based method like TF-IDF.

In contrast, The TF-IDF-only method has several disadvantages compared to the combined approach. It relies solely on keyword matching, lacking the ability to capture semantic meaning and contextual relationships between words. As a result, the retrieved contexts may be less relevant to the query, as seen in the examples of the commercial rivalry between “RCA Victor” and “Columbia Records” or the “European debt” crisis, which are unrelated to the “CIA” or its budget. The TF-IDF-only method also shows less thematic consistency, retrieving contexts spanning various disconnected topics based on keyword overlap alone. This lack of coherence can lead to a less focused and relevant set of results. Furthermore, the method is sensitive to keyword variations and may struggle to retrieve relevant contexts that use synonyms or related terms, potentially omitting valuable information.

In conclusion, the combined approach, which incorporates TF-IDF, sentence transformers, and count vectors, offers significant advantages over the TF-IDF-only method. By capturing semantic meaning, thematic consistency, and reducing the reliance on exact keyword matching, the combined approach retrieves more relevant and coherent contexts. On the other hand, the TF-IDF-only method’s limitations, such as limited semantic understanding, lack of thematic consistency, and sensitivity to keyword variations, can lead to the retrieval of irrelevant or disconnected contexts. Therefore, the combined approach demonstrates superior performance in identifying relevant and meaningful contexts for a given query, providing a more comprehensive and accurate representation of the information sought.

Table 1: Results of our document retriever on transcriptions of the documents in HW-SQuAD and BenthamQA

Transcriptions	Test
Transcriptions	HW-SQuAD	BenthamQA
a. TF-IDF [3]	90.2	98.5
b. TF-IDF + ST + Preprocessing (proposed)	95.30	99.5

Table 2: Result of applying Our proposed models on HW-SQuAD and BenthamQA

a. TF-IDF + BERT [3]	76.82	70.73	78.41	66.00
Model	HW-SQuAD		BenthamQA
Model	F1	Exact Match	F1	Exact Match
c. TF-IDF + ST + Ensemble (proposed)	90.10	82.02	96.12	92.55

5.2 Impact of applying Ensemble method in Document Reader

We present a comparative performance analysis between the baseline model referenced in [3] and our proposed model, evaluating their efficacy on both the HW-SQuAD and BenthamQA datasets. The results of this comparison are summarized in Table 2. Performance metrics utilized in this assessment include the F1 score and Exact Match (EM) percentage on the respective test sets.

Our proposed model demonstrates superior performance across both datasets. On the HW-SQuAD dataset, it achieves an F1 score of 90.10% and an Exact Match of 82.02%, representing a significant improvement over the baseline model’s 76.82% F1 and 70.73% EM. Similarly, on the BenthamQA dataset, our model attains a 93% F1 score and 90.42% EM, substantially outperforming the baseline’s 78.41% F1 and 66% EM.

The enhanced performance of our proposed models can be attributed to Sentence Transformer (ST) and an ensemble approach, building upon the standard TF-IDF + BERT baseline for question answering tasks on handwritten document datasets. The ensemble model’s exceptional results, approaching 90% F1 and exceeding 80% Exact Match, establish a new benchmark for performance on these challenging datasets. The methodologies introduced in this study have the potential to advance state-of-the-art question-answering systems for handwritten documents.

5.3 Analysis of predicted results of our proposed model

In this section, we analyze the effect of applying the ensemble method to our proposed model. We have taken a few questions from the HW-SQuAD and BenthamQA datasets and compared the results in Fig. 5 and 6. respectively. In the HW-SQuAD dataset, the ensemble model consistently corrects predictions where the single model falters. For instance, in responses requiring specific information such as“divide to form new pyrenoids or be produced de novo,” the single model inaccurately predicts "some of their offspring." In contrast, the ensemble model accurately matches the ground truth. This precision is also evident in questions requiring detailed understanding, such as predicting “along the plant cells cell wall” instead of “under intense light.” Additionally, for broader categorical answers like “humid subtropical,” the ensemble model refines the single model’s vague prediction of "warm" to the precise ground truth. This meticulous correction and alignment with the expected answers underline the ensemble model’s enhanced interpretative capabilities. The effectiveness of the ensemble model is equally prominent in the BenthamQA dataset. The ensemble model not only rectifies broader, less accurate predictions of the single model but also ensures precision in complex queries. For example, it refines the single model’s generic prediction of “Offences against Property Theft” to the specific term “Embezzlement,” demonstrating its ability to grasp nuanced legal terminology. Moreover, for questions requiring multiple valid responses, such as listing various historical manufacturers or recognizing titles like “Lord Pelham,” the ensemble model accurately captures all relevant details, showcasing its comprehensive understanding and reliability.

Overall, the ensemble model’s proficiency in reducing incorrect and similar incorrect results significantly enhances its reliability and precision. By amalgamating the strengths of multiple models, it ensures that predictions are more accurate and consistent with the ground truth. This integration allows the ensemble model to capture a broader spectrum of linguistic and contextual nuances, ultimately leading to more robust and dependable AI systems.

5.4 Ablation Studies

To evaluate the effectiveness of different components in our proposed model, we conducted ablation studies on the HW-SQuAD dataset, as shown in Table 3.

First, we examined the impact of our document retrieval enhancements. The baseline model [3] achieved an F1 score of 76.82 and an Exact Match of 70.2. By adding the Sentenece Transformer (ST) module to the retrieval pipeline, our TF-IDF + ST + BERT model improved the F1 score to 83.20 and the Exact Match to 71.33. This demonstrates the effectiveness of incorporating semantic information for retrieving more relevant documents.

Next, we analyzed the performance of our complete proposed pipeline, which includes an ensemble of multiple reader models in addition to the retrieval enhancements. The TF-IDF + ST + Ensemble model achieved an impressive F1 score of 90.10 and an Exact Match of 82.02, significantly outperforming the baseline. These results highlight the importance of both the improved document retrieval and the ensemble strategy in our proposed approach.

5.4.1 Analysis of Correct, Similar, and Incorrect Matches

In addition to the overall results, we have analyzed the number of correct, similar, and incorrect matches for each case mentioned in Table 3. Figure 3 provides a visual depiction of these results. Our proposed model yields 7,418 correct matches, 1,364 similar matches, and 262 incorrect matches, representing the lowest number of similar and incorrect matches among all the models evaluated. In comparison, the old model produces 676 incorrect matches, nearly three times the number generated by our proposed approach.

The incorporation of a Sentence Transformer in the Document Retriever stage leads to an increase in the number of correct matches to 6,419, an improvement of over 700 compared to the old model. Concurrently, the number of incorrect matches is reduced to 1,952. The application of an ensemble technique further decreases the number of incorrect matches to 1,364 while increasing the number of correct matches to 7,419.

Based on these observations, we can conclude that adding a Sentence Transformer in the Document Retriever stage significantly enhances the quality of the retrieved documents. Consequently, even a single BERT large model remarkably improves the Document Reader component. The implementation of an ensemble method further improves the Document Reader’s semantic understanding of the context, resulting in a three-fold decrease in the number of incorrect matches compared to the old model and a reduction by half in Similar matches.

Overall, the ablation studies validate the efficacy of the key components in our model. The combination of semantic similarity-enhanced retrieval and an ensemble of strong reader models leads to state-of-the-art performance on the HW-SQuAD benchmark.

6 Conclusion

This paper proposes a novel approach for answering questions on handwritten documents by combining advanced document retrieval techniques with an ensemble of extractive QA models. Our enhanced document retriever, which leverages the strengths of TF-IDF and sentence transformers, significantly improves retrieval performance. The ensemble-based document reader, utilizing BERT and DeBERTa large models, enables robust and accurate answer extraction from the retrieved documents.

Experimental results on the HW-SQuAD and Bentham-QA datasets demonstrate the effectiveness of our approach. They surpass previous recognition-based methods and set a new benchmark for QA performance on handwritten documents. However, room remains for further improvement, such as exploring recognition-free methods and techniques to handle noise and varying writing styles.

Our work is a significant step forward in the field of question-answering on handwritten documents. By presenting a novel approach and achieving superior results, we have not only advanced the state of the art but also opened up new avenues for research in this challenging and vital area of natural language processing.

Table 3: Ablation studies of Our Model.

Model Details	HW-SQuAD
Model Details	F1	Exact Match
a. TF-IDF + BERT [3]	76.82	70.2
b. TF-IDF + ST + BERT (proposed)	83.20	71.33
c. TF-IDF + ST + Ensemble (proposed)	90.10	82.02

References

[1] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras, Eds. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 2383–2392. [Online]. Available: https://aclanthology.org/D16-1264
[2] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for SQuAD,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), I. Gurevych and Y. Miyao, Eds. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 784–789. [Online]. Available: https://aclanthology.org/P18-2124
[3] M. Mathew, L. Gomez, D. Karatzas, and C. V. Jawahar, “Asking questions on handwritten document collections,” International Journal on Document Analysis and Recognition (IJDAR), vol. 24, no. 3, pp. 235–249, Sep 2021. [Online]. Available: https://doi.org/10.1007/s10032-021-00383-3
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
[5] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “MS MARCO: A human generated machine reading comprehension dataset,” in Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, ser. CEUR Workshop Proceedings, T. R. Besold, A. Bordes, A. S. d’Avila Garcez, and G. Wayne, Eds., vol. 1773. CEUR-WS.org, 2016. [Online]. Available: https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
[6] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: A benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 452–466, 2019. [Online]. Available: https://aclanthology.org/Q19-1026
[7] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf
[8] S. Wang and J. Jiang, “Machine comprehension using match-LSTM and answer pointer,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=B1-q5Pqxl
[9] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3608–3617.
[10] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusiñol, C. Jawahar, E. Valveny, and D. Karatzas, “Scene text visual question answering,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4290–4300.
[11] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi, “Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5376–5384.
[12] K. Kafle, B. Price, S. Cohen, and C. Kanan, “Dvqa: Understanding data visualizations via question answering,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5648–5656.
[13] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler, and Y. Bengio, “Figureqa: An annotated figure dataset for visual reasoning,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=H1mz0OyDz
[14] M. Mathew, D. Karatzas, and C. V. Jawahar, “Docvqa: A dataset for vqa on document images,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 2199–2208.
[15] A. Jain and A. Namboodiri, “Indexing and retrieval of on-line handwritten documents,” in Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., 2003, pp. 655–659.
[16] M. Villegas, J. Puigcerver, A. H. Toselli, J. Sánchez, and E. Vidal, “Overview of the imageclef 2016 handwritten scanned document retrieval task,” in Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016, ser. CEUR Workshop Proceedings, K. Balog, L. Cappellato, N. Ferro, and C. Macdonald, Eds., vol. 1609. CEUR-WS.org, 2016, pp. 233–253. [Online]. Available: https://ceur-ws.org/Vol-1609/16090233.pdf
[17] K. Kise, S. Fukushima, and K. Matsumoto, “Document image retrieval for qa systems based on the density distributions of successive terms,” IEICE Transactions, vol. 88-D, pp. 1843–1851, 08 2005.