OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition
(Supplementary Materials)

1 Implementation Details

1.1 Spatial-Window Prompting

Spatial-window prompting comprises two components: fixed mode and random mode. In the fixed mode, the image is divided into grid blocks evenly, such as 3x3 or 2x2. Conversely, in the random mode, the starting point of spatial window is randomly determined. In order to encompass more texts within the random box, the area of the random box is established to be no less than 1/9 of the original image. To elaborate further, a 30% probability is assigned for selecting the fixed mode, another 30% probability for selecting the random mode, and a 40% probability for defaulting window to cover the entire image. Following  [kil2023towards], we set the bin size of coordinate vocab as 1000. The pseudo-code of spatial-window prompting is shown in the following.

1import random
2
3# prob for different mode
4prob = random.uniform(0, 1)
5
6# quantizing coordinates with n_bins
7n_bins = 1000
8
9if prob < 0.4:
10 # default window
11 start_x, start_y, end_x, end_y = [0, 0, n_bins - 1, n_bins - 1]
12elif prob < 0.7:
13 # x-axis and y-axis are partitioned into varying numbers of blocks.
14 num_xs = [3, 3, 1, 3, 2, 2, 2, 1]
15 num_ys = [3, 1, 3, 2, 3, 2, 1, 2]
16
17 total_windows = []
18 for num_x, num_y in zip(num_xs, num_ys):
19 inter_x = min(int(n_bins / num_x), n_bins - 1)
20 inter_y = min(int(n_bins / num_y), n_bins - 1)
21
22 for i in range(num_x):
23 for j in range(num_y):
24 start_x = i*inter_x
25 start_y = j*inter_y
26 end_x = min(start_x + inter_x, n_bins - 1)
27 end_y = min(start_y + inter_y, n_bins - 1)
28 total_windows.append([start_x, start_y, end_x, end_y])
29
30 start_x, start_y, end_x, end_y = random.choice(total_windows)
31else:
32 inter = int(n_bins / 3)
33 start_x = random.randint(0, inter * 2)
34 start_y = random.randint(0, inter * 2)
35 rect_w, rect_h = random.randint(inter, n_bins - 1), random.randint(inter, n_bins - 1)
36 end_x, end_y = min(start_x + rect_w, n_bins - 1), min(start_y + rect_h, n_bins - 1)
37
38spatial_window_prompt = [start_x, start_y, end_x, end_y]

1.2 Table Recognition

Given a table image, we resize it to 1,024×\times×1,024 pixels. The Structured Points Decoder, utilizing the feature vector from the Image Encoder, simultaneously generates pure HTML tags with structural cell point sequences in the same sequence representing the table’s logical and physical structures. These structural cell point sequences serve as start-prompting input for the Content Decoder, which extracts table cell contents in parallel. The final output combines pure HTML tags with cell contents, forming complete HTML sequences faithfully representing the table’s structure and content.

Datasets. Since our model predicts both the logical structure of tables with cell bounding box central points and cell content, datasets lacking cell content and corresponding bounding box annotations, such as TABLE2LATEX-450K [deng2019challenges], TableBank [li2020tablebank], UNLV [shahab2010open], IC19B2H [gao2019icdar], WTW [long2021parsing] and TUCD [raja2021visual], are not suitable for our approach. Similarly, datasets like ICDAR2013Table [gobel2013icdar], SciTSR [chi2019complicated], and PubTables-1M [smock2022pubtables], which provide cell content and content box annotations, employ metrics based on box representations that are incompatible with our point-based format. Consequently, PubTabNet (PTN) [EDD] and FinTabNet (FTN) [GTE] are selected for our model evaluation.

GT Generation. The ground truth pure HTML tags of tables are tokenized into structural tokens. Following the previous works [TableMaster, VAST], we use the merged labels to represent a non-spanning cell to reduce the length of the HTML tags. Specifically, we use <td></td> and <td>[]</td> to denote empty cells and non-empty cells, respectively. For a cell spanning multiple rows or columns, the original HTML tags are broken into four tokens: <td, colspan=“n” or rowspan=“n”, >, and </td>. We use the first token <td to represent a spanning cell. In addition, four special symbol categories need to be added: <S>, </S>, <PAD>, and <UNK>, which represents the beginning of a sequence, the end of a sequence, padding symbols, and unknown characters, respectively. For building the GT of Structured Points Decoder, we insert center points of each cell text box to corresponding HTML tags. For building the GT of Content Decoder, we combine each cell text with corresponding center points as a whole sequence where center points can be viewed as a start-prompting input for recognizing text, and each cell text is tokenized at the character level. An example of building a training sequence GT for the Structured Points Decoder and the Content Decoder in the table recognition task is illustrated in Fig. 1.

Refer to caption
Figure 1: An Example of building training GTs for table recognition task. We use the center points of each cell text box to build GTs for the Structured Points Decoder and the Content Decoder. If the cell is empty text, the corresponding points in the GTs are left empty as well.
Refer to caption
Figure 2: A comparative analysis of partial results obtained from OmniParser and Donut on CORD. The first column depicts the original image, while columns 2 and 3 illustrate our detection results and the corresponding formatted output, respectively. Column 4 showcases the Donut’s formatted output. Notably, our model demonstrates superior performance in entity extraction.
Refer to caption
Figure 3: Illustrative failure case of Donut in table recognition task. Red text means error predictions. For readability, we only highlight two errors in this example. Due to the lack of point location information, Donut has an attention drift problem, resulting in the prediction of repeated tokens and leading to a high probability of error accumulation in long-sequence scenarios. (The figure is best viewed in color.)

2 Comparisons with Donut on KIE Task

As shown in  Fig. 2, OmniParser can achieve entity extraction while predicting the location of each entity word. However, Donut only predicts the structured sequence for entity extraction without any localization ability. Thus, the absence of direct region supervision during both training and prediction stages often leads to inferior results for entities of same values (Row 1), repeated entities (Row 2) or entities with explicit trigger names (Row 3).

3 Training Donut on Table Recognition Task

We fine-tuned the OCR-free end-to-end model Donut [kim2022donut] for table recognition on FinTabNet dataset. The ground truth sequence utilized combined HTML tags with table cell text, and we use different training hyper-parameters for adequate verification, as shown in Tab. 1. Due to GPU memory limitations, we constrained the decoder’s max length in Donut to 4,000. Note that the original HTML sequence max lengths for PubTabNet and FinTabNet are 8,722 and 8,035, respectively. For long sequence prediction tasks such as table recognition, training an end-to-end model like Donut with combined HTML stages, including cell text, is non-trivial. There is a high probability of error accumulation and attention drift in long-sequence scenarios leading to the inferior performance of Donut for table recognition. An illustrative example of a failure case for Donut in table recognition task is shown in Fig. 3. Specifically, due to the lack of region supervision, the end-to-end model Donut has demonstrated an attention drift problem, resulting in the prediction of repeated tokens and leading to a high probability of error accumulation in long-sequence scenarios. In contrast, OmniParser decomposes the location-aware structured points sequence and cell text recognition generation, alleviating the issues of attention drift and error accumulation.

Methods LR Epoch S-TEDS TEDS Donut [kim2022donut] 3e-5 20 22.2 17.2 3e-5 40 26.2 20.0 1e-4 40 30.7 29.1 1e-3 40 41.7 40.5 1e-3 100 41.9 41.2 OmniParser (ours) - - 91.55 89.75

Table 1: Comparisons of different training hyper-parameters of Donut on FinTabNet datasets. LR is short for learning rate.
Refer to caption
Figure 4: An Example of building training GTs for hierarchical text detection task.

4 Generalization to Hierarchical Text Detection Task

Thanks to the flexible expression of structured sequence in OmniParser, it is convenient for us to extend it to other OCR-related tasks, such as hierarchical text detection, which aims to group the text in the image into three levels, namely word, line, and paragraph, based on spatial position and semantic relationship. Previous methods [long2022towards] mainly achieved hierarchical results by clustering based on similarity. In our approach, we distinguish the text belonging to different hierarchical intervals by simply inserting <LINE> and <PARA> structural tags into the sequence of text center points, as shown in Fig. 4. The experiments are mainly conducted on the HierText dataset [long2022towards], which consists of 8,281 training images, 1,724 validation images, and 1,634 test images. We train the model on the training set and evaluate on the validation set. Partial visualization results are shown in Fig. 5. Without any task-specific architectural designs, our model achieves promising results, demonstrating its strong generalization ability.

Refer to caption
Figure 5: Visualization results of hierarchical text detection. Columns 1-3 represent the detection results for word, line, and paragraph levels, respectively. Text instances belonging to the same hierarchical level are enclosed within rectangles of the same color. (The figure is best viewed in color.)
Refer to caption
Figure 6: Visualization results of text spotting. Rows 1-2 depict the visual results on the Total-Text dataset, while rows 3 and 4 respectively illustrate the visual results on the ICDAR 2015 and CTW1500 datasets. (The figure is best viewed in color.)
Refer to caption
Figure 7: Visualization results of key information extraction. Rows 1-3 and row 4 demonstrate the visual results on the CORD and SROIE datasets respectively. In order to differentiate entities of different categories, we employ rectangles of varying colors. The correspondence between colors and categories can be seen in the legend on the right side. (The figure is best viewed in color.)
Refer to caption
Figure 8: Visualization results of table recognition. We present point locations and a rendered table with an additional border for readability based on the prediction sequence in each group. Blue points and red points denote the GT and predicted points respectively. (The figure is best viewed in color.)

5 More Visualizations

Fig. 6,  Fig. 7, and Fig. 8 are more qualitative results of text spotting, key information extraction, and table recognition, respectively.