OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition (Supplementary Materials)

1 Implementation Details

1.1 Spatial-Window Prompting

Spatial-window prompting comprises two components: fixed mode and random mode. In the fixed mode, the image is divided into grid blocks evenly, such as 3x3 or 2x2. Conversely, in the random mode, the starting point of spatial window is randomly determined. In order to encompass more texts within the random box, the area of the random box is established to be no less than 1/9 of the original image. To elaborate further, a 30% probability is assigned for selecting the fixed mode, another 30% probability for selecting the random mode, and a 40% probability for defaulting window to cover the entire image. Following [kil2023towards], we set the bin size of coordinate vocab as 1000. The pseudo-code of spatial-window prompting is shown in the following.

⬇

1import random

2

3# prob for different mode

4prob = random.uniform(0, 1)

5

6# quantizing coordinates with n_bins

7n_bins = 1000

8

9if prob < 0.4:

10 # default window

11 start_x, start_y, end_x, end_y = [0, 0, n_bins - 1, n_bins - 1]

12elif prob < 0.7:

13 # x-axis and y-axis are partitioned into varying numbers of blocks.

14 num_xs = [3, 3, 1, 3, 2, 2, 2, 1]

15 num_ys = [3, 1, 3, 2, 3, 2, 1, 2]

16

17 total_windows = []

18 for num_x, num_y in zip(num_xs, num_ys):

19 inter_x = min(int(n_bins / num_x), n_bins - 1)

20 inter_y = min(int(n_bins / num_y), n_bins - 1)

21

22 for i in range(num_x):

23 for j in range(num_y):

24 start_x = i*inter_x

25 start_y = j*inter_y

26 end_x = min(start_x + inter_x, n_bins - 1)

27 end_y = min(start_y + inter_y, n_bins - 1)

28 total_windows.append([start_x, start_y, end_x, end_y])

29

30 start_x, start_y, end_x, end_y = random.choice(total_windows)

31else:

32 inter = int(n_bins / 3)

33 start_x = random.randint(0, inter * 2)

34 start_y = random.randint(0, inter * 2)

35 rect_w, rect_h = random.randint(inter, n_bins - 1), random.randint(inter, n_bins - 1)

36 end_x, end_y = min(start_x + rect_w, n_bins - 1), min(start_y + rect_h, n_bins - 1)

37

38spatial_window_prompt = [start_x, start_y, end_x, end_y]

1.2 Table Recognition

Given a table image, we resize it to 1,024 $\times$ 1,024 pixels. The Structured Points Decoder, utilizing the feature vector from the Image Encoder, simultaneously generates pure HTML tags with structural cell point sequences in the same sequence representing the table’s logical and physical structures. These structural cell point sequences serve as start-prompting input for the Content Decoder, which extracts table cell contents in parallel. The final output combines pure HTML tags with cell contents, forming complete HTML sequences faithfully representing the table’s structure and content.

Datasets. Since our model predicts both the logical structure of tables with cell bounding box central points and cell content, datasets lacking cell content and corresponding bounding box annotations, such as TABLE2LATEX-450K [deng2019challenges], TableBank [li2020tablebank], UNLV [shahab2010open], IC19B2H [gao2019icdar], WTW [long2021parsing] and TUCD [raja2021visual], are not suitable for our approach. Similarly, datasets like ICDAR2013Table [gobel2013icdar], SciTSR [chi2019complicated], and PubTables-1M [smock2022pubtables], which provide cell content and content box annotations, employ metrics based on box representations that are incompatible with our point-based format. Consequently, PubTabNet (PTN) [EDD] and FinTabNet (FTN) [GTE] are selected for our model evaluation.

GT Generation. The ground truth pure HTML tags of tables are tokenized into structural tokens. Following the previous works [TableMaster, VAST], we use the merged labels to represent a non-spanning cell to reduce the length of the HTML tags. Specifically, we use <td></td> and <td>[]</td> to denote empty cells and non-empty cells, respectively. For a cell spanning multiple rows or columns, the original HTML tags are broken into four tokens: <td, colspan=“n” or rowspan=“n”, >, and </td>. We use the first token <td to represent a spanning cell. In addition, four special symbol categories need to be added: <S>, </S>, <PAD>, and <UNK>, which represents the beginning of a sequence, the end of a sequence, padding symbols, and unknown characters, respectively. For building the GT of Structured Points Decoder, we insert center points of each cell text box to corresponding HTML tags. For building the GT of Content Decoder, we combine each cell text with corresponding center points as a whole sequence where center points can be viewed as a start-prompting input for recognizing text, and each cell text is tokenized at the character level. An example of building a training sequence GT for the Structured Points Decoder and the Content Decoder in the table recognition task is illustrated in Fig. 1.

Refer to caption — Figure 1: An Example of building training GTs for table recognition task. We use the center points of each cell text box to build GTs for the Structured Points Decoder and the Content Decoder. If the cell is empty text, the corresponding points in the GTs are left empty as well.

2 Comparisons with Donut on KIE Task

As shown in Fig. 2, OmniParser can achieve entity extraction while predicting the location of each entity word. However, Donut only predicts the structured sequence for entity extraction without any localization ability. Thus, the absence of direct region supervision during both training and prediction stages often leads to inferior results for entities of same values (Row 1), repeated entities (Row 2) or entities with explicit trigger names (Row 3).

3 Training Donut on Table Recognition Task

We fine-tuned the OCR-free end-to-end model Donut [kim2022donut] for table recognition on FinTabNet dataset. The ground truth sequence utilized combined HTML tags with table cell text, and we use different training hyper-parameters for adequate verification, as shown in Tab. 1. Due to GPU memory limitations, we constrained the decoder’s max length in Donut to 4,000. Note that the original HTML sequence max lengths for PubTabNet and FinTabNet are 8,722 and 8,035, respectively. For long sequence prediction tasks such as table recognition, training an end-to-end model like Donut with combined HTML stages, including cell text, is non-trivial. There is a high probability of error accumulation and attention drift in long-sequence scenarios leading to the inferior performance of Donut for table recognition. An illustrative example of a failure case for Donut in table recognition task is shown in Fig. 3. Specifically, due to the lack of region supervision, the end-to-end model Donut has demonstrated an attention drift problem, resulting in the prediction of repeated tokens and leading to a high probability of error accumulation in long-sequence scenarios. In contrast, OmniParser decomposes the location-aware structured points sequence and cell text recognition generation, alleviating the issues of attention drift and error accumulation.

Methods LR Epoch S-TEDS TEDS Donut [kim2022donut] 3e-5 20 22.2 17.2 3e-5 40 26.2 20.0 1e-4 40 30.7 29.1 1e-3 40 41.7 40.5 1e-3 100 41.9 41.2 OmniParser (ours) - - 91.55 89.75

Table 1: Comparisons of different training hyper-parameters of Donut on FinTabNet datasets. LR is short for learning rate.

4 Generalization to Hierarchical Text Detection Task

Thanks to the flexible expression of structured sequence in OmniParser, it is convenient for us to extend it to other OCR-related tasks, such as hierarchical text detection, which aims to group the text in the image into three levels, namely word, line, and paragraph, based on spatial position and semantic relationship. Previous methods [long2022towards] mainly achieved hierarchical results by clustering based on similarity. In our approach, we distinguish the text belonging to different hierarchical intervals by simply inserting <LINE> and <PARA> structural tags into the sequence of text center points, as shown in Fig. 4. The experiments are mainly conducted on the HierText dataset [long2022towards], which consists of 8,281 training images, 1,724 validation images, and 1,634 test images. We train the model on the training set and evaluate on the validation set. Partial visualization results are shown in Fig. 5. Without any task-specific architectural designs, our model achieves promising results, demonstrating its strong generalization ability.

5 More Visualizations

Fig. 6, Fig. 7, and Fig. 8 are more qualitative results of text spotting, key information extraction, and table recognition, respectively.