License: CC BY 4.0
arXiv:2403.13761v1 [cs.CV] 20 Mar 2024

HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

Yuyi Zhang [email protected] Yuanzhi Zhu [email protected] Dezhi Peng [email protected] Peirong Zhang [email protected] Zhenhua Yang [email protected] Zhibo Yang [email protected] Cong Yao [email protected] Lianwen ** [email protected] School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. Alibaba DAMO Academy, Hangzhou, China.
Abstract

Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode’s superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed.

keywords:
Chinese text recognition, Zero-shot learning, Hierarchical information embedding, Optical character recognition
journal: Pattern Recognition

1 Introduction

Text recognition is a fundamental task in computer vision that has been intensively studied for decades shi2016end ; xie2017pami ; LIU2019Curved ; chen2021scene ; peng2022pagenet ; Wang2023MM ; YANG2024PR . Despite the success of prior methods of Chinese Text Recognition (CTR) huang2021zero ; Peng2022segment ; lyu2022maskocr , a majority of them still inherit the one-hot encoding paradigm from English text recognition approaches. This paradigm may not be the most suitable for CTR, primarily due to the following reasons:

First, one-hot encoding falls short in comprehensive feature representation of Chinese characters. Chinese script, known for its high information entropy among prevalent languages wong1976comment , holds significantly more information in its characters compared to Latin counterparts, especially in the intricate hierarchical information such as structures and radicals. Nevertheless, one-hot encoding allocates only one valid bit for each Chinese character, which fails to express the hierarchical richness within Chinese characters, leading to an extensive loss of critical structural and semantic information.

Second, models reliant on one-hot encoding are incapable of zero-shot recognition, which is a crucial capability given the extensive and ever-growing lexicon of Chinese characters. For instance, the latest Chinese standard, GB18030-2022111https://openstd.samr.gov.cn/bzgk/gb/, contains 87,887 categories, a significant increase from the 27,533 in the GB18030-2000 standard. Consequently, to fully recognize all Chinese characters, models are required to perform Out-Of-Vocabulary (OOV) character recognition, also known as zero-shot recognition, in which characters in testing are not seen in training. One-hot encoding, however, is inherently restricted to represent a limited range of characters, thus preventing models built based upon it from performing zero-shot recognition. To address the zero-shot problem, previous methods explored ways to leverage the glyph HUANG2022Hipp ; AO2022gl , radicals liu2013online ; zhang2018radical ; wang2019fewshotran ; ZHANG2020radical ; cao2020hde ; LUO2023cue ; LI2024side or strokes SU2003stroke ; LIU20012stroke ; chen2021stroke information of Chinese characters. However, these approaches primarily focus on character-level zero-shot recognition and do not easily extend to the recognition of text lines, a task associated with a broader spectrum of practical application scenarios.

Moreover, the one-hot encoding approach introduces significant barriers to deployment due to the expansive size of the classification layer required. Generally, with an increasing number of character classes, the classification layer becomes excessively large, accounting for a disproportionate amount of the model’s parameters. For example, in systems such as PP-OCR du2020pp , the classification layer for 20,000 character categories can constitute over 60% of the model’s total parameters. This presents considerable challenges in terms of computational efficiency and deployment on devices with limited resources. To decrease the parameter size, Hamming-OCR li2020hamming and EMU li2022effective propose different multi-hot encoding strategies as alternatives to one-hot encoding. However, these methods have not effectively captured the hierarchical information inherent in Chinese characters.

Refer to caption
Figure 1: Comparison between the framework of one-hot encoding methods (a) and that of the proposed method (b).

In this paper, we propose a novel hierarchical codebook HierCode to address all these issues. We first leverage the binary trees to represent hierarchical information within Chinese characters, including the structures and radicals, as shown in Fig. 1 (b). Then, we employed a RAN wang2019fewshotran model to derive a full set of radical prototypes, from which all Chinese characters can be composed. Subsequently, we generate a unique and robust multi-hot encoding representation for each Chinese character. By culminating the encodings of all characters, we establish an integrated codebook named HierCode. In the training phase, HierCode is employed to supervise the recognition model, which is a traditional encoder-decoder framework but substitutes the one-hot classification layer with multi-hot alternatives. In the inference phase, HierCode’s multi-hot encoding allows the model to compute the similarity between HierCode and visual features to match the ultimate prediction for each character in a textual image. This mechanism inherently supports character-level zero-shot recognition and can be seamlessly applied to line-level recognition tasks. Simultaneously, the multi-hot approach also ensures that HierCode remains lightweight, using fewer bits to represent characters, thus increasing the model’s inference speed without compromising on performance. Through extensive experiments on a variety of benchmarks, including handwritten, scene, document, web, and ancient texts, HierCode has demonstrated superior performance in both standard and zero-shot CTR tasks. The results reveal that HierCode not only outperforms many existing methods but also offers advantages in terms of model footprint and inference efficiency.

The contributions of this paper are three-fold:

  • 1.

    We propose a hierarchical codebook named HierCode, which provides unique and informative representations for each Chinese character through hierarchical encoding and prototype learning.

  • 2.

    The hierarchical combination of radical features of Chinese characters enables the model to deal with the zero-shot Chinese recognition at both character and line levels. Moreover, the multi-hot encoding employed by HierCode exhibits lightweight characteristics and fast inference speed, thereby significantly enhancing its practical applicability.

  • 3.

    Extensive experiments conducted on diverse datasets demonstrated that HierCode not only achieves state-of-the-art accuracy in zero-shot Chinese character recognition and outperforms the majority of existing approaches in Chinese text recognition, but also exhibits fast inference speed and small footprint, facilitating the development of lightweight Chinese text recognition networks.

2 Related Work

2.1 Chinese Character Recognition Methods

Chinese Character Recognition (CCR) methods have experienced significant development over several decades. Early CCR methods mainly relied on hand-crafted features **2001study ; su2003novel ; chang2006techniques . With the advancements in deep learning, Convolutional Neural Networks (CNN) are now widely used for feature extraction and achieve exceptional performance cirecsan2015multi ; xiao2017building ; LI2020gl . Although CNN-based techniques generally prove effective, they tend to struggle with zero/few-shot CCR problems. Given that Chinese characters can be hierarchically decomposed into sequences of structures and radicals, numerous zero/few-shot recognition techniques transform the CCR problem into a sequence prediction problem, thereby enabling the recognition of Out-Of-Vocabulary (OOV) characters. For instance, Wang et al. wang2017radical utilized multi-label learning to recognize radicals. Zhang et al. zhang2018radical and Wang et al. wang2018denseran employed spatial attention mechanisms to decode structures and radicals sequences from image features. Wang et al. wang2019fewshotran mapped radicals to the feature space and aggregated radical features via prototype learning. Cao et al. cao2020hde considered the hierarchical decomposition information while designing embedding rules for Chinese characters, facilitating zero-shot CCR. Chen et al. chen2021stroke decomposed Chinese characters into finer strokes and utilized printed character image matching to tackle the one-to-many problem between stroke sequences and Chinese characters. Luo et al. LUO2023cue quantified the significance of radicals in CCR from the information theory perspective, thereby improving recognition accuracy in zero-shot CCR experiments. While these methods effectively address zero/few-shot recognition for Chinese characters, their complex networks and decoding strategies render them unsuitable for line-level recognition, essentially damaging their practical applicability.

2.2 Chinese Text Recognition Methods

Initially, Chinese Text Recognition (CTR) methods output entire text line results through character recognition based on sliding window wang2012end and segmentation techniques bissacco2013photoocr ; jaderberg2014deep . Subsequently, Shi et al. shi2016end introduced CRNN, treating the text line holistically and predicting character sequences directly from the input image via an encoder-decoder framework. Later, various methods hu2020gtc ; yousef2020accurate ; chen2020multrenets enhanced and extended the CRNN framework, applying it to CTR xie2017pami ; liu2021searching ; Z_Wang_Writer . Concurrently, attention-based methods emerged shi2018aster ; luo2019moran ; wang2020decoupled ; LIN2021STAN , achieving breakthroughs in irregular text recognition. Meanwhile, there are some new methods fang2021read ; wang2022petr ; Peng2022segment incorporate powerful Transformer-based language models vaswani2017attention to improve performance. However, despite their commendable performance in Latin benchmarks, these methods encounter significant challenges when applied to Chinese benchmarks. Specifically, these methods primarily depend on the one-hot encoding strategy, which falls short of capturing the hierarchical information embedded within Chinese characters. Moreover, these techniques do not excel in zero-shot recognition. To tackle these challenges, ZCTRN huang2021zero introduced a class embedding module inspired by HDE cao2020hde , and Yu et al. yu2023ctrclip devised a pre-trained CLIP-like model for aligning printed character images and ideographic description sequences. These methods generate canonical feature representations for each Chinese character and achieve zero-shot recognition at the line level by matching visual features with character representations. However, their complex architectures lead to a significant increase in the parameter size for these methods. This inspires us to explore a more lightweight encoding method for Chinese characters.

2.3 Lightweight Text Recognition Methods

In recent times, there has been extensive research on develo** lightweight text recognition methods to enable the deployment of text recognition algorithms on mobile devices. For example, PP-OCR du2020pp reduced the model’s weight by decreasing the number of channels in CRNN shi2016end . Hamming-OCR li2020hamming introduced a hash encoding approach for Chinese characters, replacing the traditional one-hot encoding with multi-hot encoding. This approach significantly reduced the parameters of the classification layer in text recognizers. Consequently, it achieved a smaller model size compared to PP-OCR. Based on Hamming-OCR, EMU li2022effective implemented hash encoding for Chinese characters and employed a progressive binarization strategy to improve recognition accuracy compared to Hamming-OCR. While these previous methods effectively reduced the model’s parameters, they often resulted in decreased performance.

3 Methodology

3.1 Hierarchical Representation of Chinese Characters

In contrast to English and Latin language, Chinese characters embody a more complex structure, each comprising a set of radicals characterized by unique spatial arrangements zhang2018radical ; wang2018denseran ; wang2019fewshotran . These radicals and their configurations can be systematically categorized into twelve distinct structural types as per Unicode standards, such as above-to-below alignment, left-to-right alignment, etc., as shown in Fig. 2 (a). For practicality and computational efficiency, our methodology simplifies the encoding of certain ternary structures by deconstructing them into binary equivalents, ultimately consolidating them into ten primary structural categories for our encoding purposes.

Refer to caption
Figure 2: Illustration of (a) 12 Chinese character structures and (b) the hierarchical decomposition of a Chinese character.

To encode each character, we first decompose it into discrete structures and radicals. Then we unfold these components with a binary tree, where structures reside at non-leaf nodes and radicals reside at leaf nodes, as shown in Fig. 2 (b). This results in unique binary trees for different characters, varying in width, depth, and node arrangements. Given that the set of radicals and structures is universally shared among all Chinese characters, this hierarchical representation is comprehensive, capturing the entirety of the Chinese lexicon.

3.2 The Construction of HierCode

We illustrate the overview of the construction process of the proposed HierCode in Fig. 3, which is a multi-hot codebook consisting of the hierarchical representations of a given set of Chinese characters.

First, according to Sec. 3.1, each Chinese character has a unique hierarchical binary tree with distinctive width and depth. To accommodate different characters into a codebook, we normalize the character encoding lengths by setting a blank binary tree with a maximum depth of D𝐷Ditalic_D and a maximum width of 2D1superscript2𝐷12^{D-1}2 start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT, and denote it as the full tree. We then extract the hierarchical representation for each character and populate the full tree accordingly, as depicted in Fig. 3 (a). Second, corresponding to the structures and radicals in the hierarchical features, we extract the structural features and radical features for each Chinese character.

Refer to caption
Figure 3: A schematic overview of HierCode. The ‘bl.’ in (a) represents the empty node used to fill the binary tree.

Structural Features. The structural features capture the structural information of the Chinese character. We establish encodings of fixed length L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT for the 10 structures outlined in Sec. 3.1, resulting in the formation of a structural code set S. Given the constraint of a fixed number of character structures (set to 10), we can efficiently encode them using just four bits. Therefore, we set L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT as 4 and manually assign codes to each structure. Subsequently, to obtain the encoding for the structural features code, we traverse the nodes of the binary tree by the breadth-first algorithm, as shown in Fig. 3(a). During this traversal, if a structure is encountered on a node, the node is represented by its corresponding structure code from the structural code set S. On the other hand, if a radical or an empty node is encountered, the node is represented by an all-zero code blank𝐒𝑏𝑙𝑎𝑛subscript𝑘𝐒blank_{\mathbf{S}}italic_b italic_l italic_a italic_n italic_k start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT with length L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT. The final result is the encoding of the structural features, depicted in the blue region of Fig. 3(b). Since structures can only appear on non-leaf nodes, given the maximum depth D𝐷Ditalic_D, there are a total of 2D11superscript2𝐷112^{D-1}-12 start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT - 1 structure codes. Therefore, the structural features 𝐂𝐒subscript𝐂𝐒\mathbf{C_{S}}bold_C start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT can be formulated as:

𝐂𝐒=[S1,S2,,S2D11],Si𝐒blank𝐒.formulae-sequencesubscript𝐂𝐒subscript𝑆1subscript𝑆2subscript𝑆superscript2𝐷11subscript𝑆𝑖𝐒𝑏𝑙𝑎𝑛subscript𝑘𝐒\mathbf{C_{S}}=[S_{1},S_{2},...,S_{2^{D-1}-1}],S_{i}\in\mathbf{S}\cup blank_{% \mathbf{S}}.bold_C start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = [ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT ] , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_S ∪ italic_b italic_l italic_a italic_n italic_k start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT . (1)

Radical Features. We leverage prototype learning to acquire unique and informative representations for radicals, in which a RAN wang2019fewshotran is trained to extract radical prototypes. Subsequently, each radical prototype is binarized into -1 or 1, generating the set of radical codes R. We extract the radicals from the binary tree in the hierarchical traversal order, convert them to the corresponding code from set R, and arrange them horizontally from left to right, as shown in Fig. 3(b). Furthermore, considering that various Chinese characters consist of differing numbers of radicals, we pad the all-zero code blank𝐑𝑏𝑙𝑎𝑛subscript𝑘𝐑blank_{\mathbf{R}}italic_b italic_l italic_a italic_n italic_k start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT with length L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT to the maximum number of radicals M𝑀Mitalic_M. Therefore, the radical features 𝐂𝐑subscript𝐂𝐑\mathbf{C_{R}}bold_C start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT can be represented as:

𝐂𝐑=[R1,R2,,RM],Ri𝐑blank𝐑.formulae-sequencesubscript𝐂𝐑subscript𝑅1subscript𝑅2subscript𝑅𝑀subscript𝑅𝑖𝐑𝑏𝑙𝑎𝑛subscript𝑘𝐑\mathbf{C_{R}}=[R_{1},R_{2},...,R_{M}],R_{i}\in\mathbf{R}\cup blank_{\mathbf{R% }}.bold_C start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT = [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_R ∪ italic_b italic_l italic_a italic_n italic_k start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT . (2)

The complete multi-hot encoding 𝐂𝐂\mathbf{C}bold_C for each character consists of structural features 𝐂𝐒subscript𝐂𝐒\mathbf{C_{S}}bold_C start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT and radical feature 𝐂𝐑subscript𝐂𝐑\mathbf{C_{R}}bold_C start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT, and can be described as:

𝐂=(c1,c2,,ct)=(𝐂𝐒,𝐂𝐑),ci{1,1},formulae-sequence𝐂subscript𝑐1subscript𝑐2subscript𝑐𝑡subscript𝐂𝐒subscript𝐂𝐑subscript𝑐𝑖11\mathbf{C}=(c_{1},c_{2},...,c_{t})=(\mathbf{C_{S}},\mathbf{C_{R}}),c_{i}\in\{-% 1,1\},bold_C = ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( bold_C start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT ) , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , 1 } , (3)

where ()(\cdot)( ⋅ ) represents the concatenation process for different code bits or sequences. i𝑖iitalic_i is index and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th bit in C. t𝑡titalic_t is the length of multi-hot encoding and can be calculated as:

t=(2D11)×L𝐒+M×L𝐑.𝑡superscript2𝐷11subscript𝐿𝐒𝑀subscript𝐿𝐑t=(2^{D-1}-1)\times L_{\mathbf{S}}+M\times L_{\mathbf{R}}.italic_t = ( 2 start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT - 1 ) × italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT + italic_M × italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT . (4)

where L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT and L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT represent the length of structural and radical code, respectively. M𝑀Mitalic_M represents the maximum number of radicals in the Chinese character with the most radicals. By culminating the multi-hot encodings of all characters, we establish the HierCode 𝐇𝐇\mathbf{H}bold_H which can be expressed as:

𝐇=[C1,C2,,CN]T𝐇superscriptmatrixsubscript𝐶1subscript𝐶2subscript𝐶𝑁𝑇\mathbf{H}=\begin{bmatrix}C_{1},C_{2},\dots,C_{N}\end{bmatrix}^{T}bold_H = [ start_ARG start_ROW start_CELL italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (5)

where N𝑁Nitalic_N represents the number of character categories. Refer to Fig. 3 (c) for a more lucid visual depiction.

3.3 Text Recognition with HierCode

Fig. 4 illustrates the pipeline for text recognition using HierCode. Given a text image x𝑥xitalic_x as input and outputs a sequence of visual features 𝐕={v1,v2,,vW}𝐕subscript𝑣1subscript𝑣2subscript𝑣𝑊\mathbf{V}=\{v_{1},v_{2},...,v_{W}\}bold_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT }, with viCsubscript𝑣𝑖superscript𝐶v_{i}\in\mathbb{R}^{C}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where W𝑊Witalic_W represent the width of the feature map and C𝐶Citalic_C is the output channel size. Subsequently, the multi-hot classification layer Nclssubscript𝑁𝑐𝑙𝑠N_{cls}italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT performs a binarization operation on 𝐕𝐕\mathbf{V}bold_V and outputs a sequence of binarized vectors, which can be expressed as:

𝐁={b1,b2,,bW}=Ncls(𝐕),bi{1,1}Cformulae-sequence𝐁subscript𝑏1subscript𝑏2subscript𝑏𝑊subscript𝑁𝑐𝑙𝑠𝐕subscript𝑏𝑖superscript11𝐶\mathbf{B}=\{b_{1},b_{2},...,b_{W}\}=N_{cls}(\mathbf{V}),b_{i}\in\{-1,1\}^{C}bold_B = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT } = italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( bold_V ) , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT (6)
Refer to caption
Figure 4: The overall architecture for text recognition using HierCode.

Specifically, the multi-hot classification layer consists of a fully connected layer and an activation function. The activation function can be Sigmoid or Tanh, which activate each bit independently. Our empirical study shows that the two activation functions perform comparably, and in this work, we use Tanh.

To solve the alignment problem caused by unequal label lengths of text, we modify the vanilla CTC loss graves2009novel and propose a similarity-based CTC loss LCTCsimsubscript𝐿𝐶𝑇𝐶𝑠𝑖𝑚L_{CTC-sim}italic_L start_POSTSUBSCRIPT italic_C italic_T italic_C - italic_s italic_i italic_m end_POSTSUBSCRIPT to train the HierCode-based text recognition model, which can be expressed as:

d(𝐇T,𝐁)=𝐇T𝐁,𝐇C×N,𝐁C×Wformulae-sequence𝑑superscript𝐇𝑇𝐁superscript𝐇𝑇𝐁formulae-sequence𝐇superscript𝐶𝑁𝐁superscript𝐶𝑊d(\mathbf{H}^{T},\mathbf{B})=\mathbf{H}^{T}\cdot\mathbf{B},\mathbf{H}\in% \mathbb{R}^{C\times N},\mathbf{B}\in\mathbb{R}^{C\times W}italic_d ( bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_B ) = bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ bold_B , bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_N end_POSTSUPERSCRIPT , bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_W end_POSTSUPERSCRIPT (7)
LCTCsim=logp(ld(𝐇T,𝐁)),subscript𝐿𝐶𝑇𝐶𝑠𝑖𝑚𝑝conditional𝑙𝑑superscript𝐇𝑇𝐁L_{CTC-sim}=-\sum\log p\left(l\mid d(\mathbf{H}^{T},\mathbf{B})\right),italic_L start_POSTSUBSCRIPT italic_C italic_T italic_C - italic_s italic_i italic_m end_POSTSUBSCRIPT = - ∑ roman_log italic_p ( italic_l ∣ italic_d ( bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_B ) ) , (8)

where l𝑙litalic_l is the ground truth label, and d(𝐇T,𝐁)𝑑superscript𝐇𝑇𝐁d(\mathbf{H}^{T},\mathbf{B})italic_d ( bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_B ) is the inner product to measure the similarity between HierCode 𝐇𝐇\mathbf{H}bold_H and predicted multi-hot code 𝐁𝐁\mathbf{B}bold_B.

When HierCode is employed in character-level, attention-based, and Transformer-based text recognizers, we propose a similarity-based cross-entropy loss LCEsimsubscript𝐿𝐶𝐸𝑠𝑖𝑚L_{CE-sim}italic_L start_POSTSUBSCRIPT italic_C italic_E - italic_s italic_i italic_m end_POSTSUBSCRIPT as:

LCEsim=logp(ld(𝐇T,𝐁)),subscript𝐿𝐶𝐸𝑠𝑖𝑚𝑝conditional𝑙𝑑superscript𝐇𝑇𝐁L_{CE-sim}=-\sum\log p\left(l\mid d(\mathbf{H}^{T},\mathbf{B})\right),italic_L start_POSTSUBSCRIPT italic_C italic_E - italic_s italic_i italic_m end_POSTSUBSCRIPT = - ∑ roman_log italic_p ( italic_l ∣ italic_d ( bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_B ) ) , (9)

In the inference stage, we obtain the predicted codes and calculate their similarity with each character in the codebook by Eq. 7. The character exhibiting the highest similarity is selected as the final prediction.

4 Experiments

4.1 Datasets

ICDAR2013 yin2013icdar is a handwritten Chinese competition database, which contains subsets of text line data (denoted as ICDAR-line) and isolated character data (denoted as ICDAR-char), and we use these two subsets as the test set.

CASIA-HWDB liu2011casia is a large-scale Chinese handwritten database. In this study, we use the text line part (HWDB 2.0-2.2) and the isolated character part (HWDB 1.0-1.2) as the training set for ICDAR2013.

BCTR chen2021bctr is a large-scale Chinese text image benchmark, which consists of four subsets, i.e., scene, web, document (denoted as Doc), and handwriting (denoted as Handw).

MTHv2 ma2020joint contains contains 105,579 text line images collected from ancient Chinese scriptures.

CTW Yuan2019ctw contains 812,872 Chinese character instances collected from street views.

4.2 Implementation Details

Network Architecture. For character-level recognition, we adopt ResNet18 he2016deep as our backbone. For line-level recognition, we utilize ResNet34 he2016deep as our backbone, followed by a BLSTM Hochreiter1997lstm layer. The proposed hyper-parameters D𝐷Ditalic_D, L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT, M𝑀Mitalic_M, and t𝑡titalic_t are set to 5, 4, 36, 9, and 384, respectively, and will be analyzed in Sec. 4.6. In text line recognition experiments, the height of the training images is resized to 128, and the width is calculated with the original aspect ratio (up to 1920). The data pre-processing strategy described in Peng2022segment is adopted in the ICDAR-line. Specifically, we rectify the text-line image to make the text horizontal and remove white padding on the top and bottom of the image to highlight the text. For the character recognition, each input image is resized to 96×\times×96.

Optimization. The proposed method is implemented using PyTorch. We apply the Adadelta optimizer with an initial learning rate of 0.1. The batch size is set to 128. All experiments are conducted on NVIDIA RTX 1080Ti GPU with 11 GB memory.

Table 1: Comparison of recognition performance (%), speed, and model size with previous methods on the ICDAR-line dataset.
Methods AR ↑ CR ↑ Speed ↑ Model Size ↓
Messina et al. R_Messina_Segmentation 83.50 - - -
Wu et al. Y_Wu_Handwritten 86.64 87.43 - 71MB
Du et al. J_Du_Deep 83.89 - - -
Wang et al. S_Wang_Deep 88.79 90.67 - -
Wang et al. Z_Wang_A_Comprehensive 89.66 - - -
Xie et al. Z_Xie_Aggregation 91.25 91.68 - -
Peng et al. D_Peng_Fast 89.61 90.52 - -
Xiu et al. Y_Xiu_A_Handwritten 88.74 - - -
Xie et al. C_Xie_High 91.55 92.13 64fps 61MB
Wang et al. Z_Wang_Writer 91.58 - - -
Wang et al. Z_Wang_Weakly 87.00 89.12 - -
Zhu et al. Z_Zhu_Attention 90.86 - - -
Liu et al. huang2021zero 93.62 - - 203MB
Peng et al. Peng2022segment 94.50 94.76 70fps 119MB
HierCode (Ours) 94.68 95.11 161fps 90MB

Evaluation. For experiments on the CASIA-HWDB and ICDAR-line dataset, the evaluation criteria are accuracy rate (AR) and correct rate (CR) specified by ICDAR2013 competition yin2013icdar . For experiments on the BTCR dataset, we follow the same process as in lyu2022maskocr and compute the accuracy in sentence level over each subset and the whole dataset. We further evaluate the recall of Chinese characters (ReC) and non-Chinese characters (ReL) in the text line experiments. Non-Chinese characters mainly include Latin, numbers and symbols, etc., and their performance is used to verify the generalization of HierCode in text recognition. Furthermore, AR-zero refers to the evaluation criteria of line-level zero-shot setting in ZCTRN huang2021zero . For character recognition related experiments, we use character accuracy as the evaluation criteria.

4.3 Recognition Performance Comparison with State-of-the-Art Methods

Handwritten Chinese text recognition. We commence with experiments on the ICDAR-line dataset, with results illustrated in Tab. 1. Compared with the SOTA method Peng2022segment , our method improves AR by 0.18% and CR by 0.35%, while also achieving the highest inference speeds and smaller model sizes significantly.

Table 2: Comparison of recognition accuracy in sentence level (%) with previous methods on the BCTR dataset. ‘*’ indicates that the method uses additional data. The numbers of ‘ΔΔ\Deltaroman_Δ’ in green and blue denote the improvements over each subset and average (Avg), respectively. The first eight results are derived from lyu2022maskocr .
Methods Scene Web Doc Handw Avg
CRNN shi2016end 53.4 54.5 97.5 46.4 67.0
ASTER shi2018aster 54.5 52.3 93.1 38.9 64.7
MORAN luo2019moran 51.8 49.9 95.8 39.7 64.3
SAR Li2019show 62.5 54.3 93.8 31.4 67.3
SRN yu2020towards 60.1 52.3 96.7 18.0 65.0
SEED qiao2020seed 49.6 46.3 93.7 32.1 61.2
TransOCR chen2021scene 63.3 62.3 96.9 53.4 72.8
MaskOCR* lyu2022maskocr 76.2 76.8 99.4 67.9 82.6
ABINet fang2021read 64.4 67.4 97.2 54.8 74.1
One-hot (Baseline) 60.3 60.2 92.8 54.1 70.0
HierCode (Ours) 63.7 66.2 98.2 56.3 74.2
ΔΔ\Deltaroman_Δ +3.4 +6.0 +5.4 +2.2 +4.2

Multi-scenario text recognition. We further evaluate the efficacy of HierCode across a broader spectrum of text recognition scenarios, which comprises four distinct text types: scene, web, document, and handwritten. Results are given in Tab. 2. Compared to the one-hot encoding baseline, HierCode demonstrated notable improvements in recognition accuracy across all scenarios: an increase of 3.4% for scene text, 6.0% for web text, 5.4% for document text, and 2.2% for handwritten text. These results underscore the versatility and robustness of HierCode in handling diverse text recognition tasks.

When compared against prevailing methods, our proposed HierCode sets new records on the document and handwriting datasets. While in the realm of scene and web text recognition, HierCode’s performance was marginally outpaced by the state-of-the-art ABINet model fang2021read . This minor discrepancy in performance may be attributed to the complexities and diverse backgrounds present in scene and web data, which renders larger recognition difficulty for our relatively simpler backbone. In contrast, ABINet is specifically tailored for scene text recognition, reasonably resulting in better performances. It should be noticed that although MaskOCR lyu2022maskocr demonstrates enhanced performance metrics, its reliance on an extensive corpus of additional data for pretraining introduces an unfairness in comparison, given that our HierCode approach does not engage in a pretraining phase.

Table 3: Comparison of the recall rate (%) of Chinese (ReC) and Non-Chinese characters (ReL) between one-hot and HierCode. The numbers of ‘ΔΔ\Deltaroman_Δ’ denote the improvements over each subset.
Metric Methods ICDAR-line BCTR
Scene Web Doc Handw
ReC One-hot 93.35 82.09 79.57 98.64 91.65
HierCode 94.53 83.41 83.39 99.71 92.35
ΔΔ\Deltaroman_Δ +1.18 +1.32 +3.82 +1.07 +0.70
ReL One-hot 85.56 90.24 84.67 99.37 86.59
HierCode 85.65 90.27 85.19 99.54 86.61
ΔΔ\Deltaroman_Δ +0.09 +0.03 +0.52 +0.17 +0.02

Furthermore, we analyze the line-level recognition performance changes in Chinese and non-Chinese languages as presented in Tab. 3. In each dataset, we observe that the performance of Latin characters, numbers, and symbols on HierCode is on par with that of one-hot methods, whereas HierCode exhibits a significant improvement in the recognition of Chinese characters. This observation substantiates the premise that HierCode’s performance benefits stem primarily from its hierarchical representation of Chinese characters, which is in concordance with the foundational design objectives of our method.

4.4 Zero-shot Capability

Zero-shot character-level recognition. We keep the same settings as the previous method wang2018denseran ; cao2020hde ; chen2021stroke ; LUO2023cue . Specifically, for the handwritten characters, we use HWDB1.0-1.1 and ICDAR-char, which consist of 3,755 classes. We select first m𝑚mitalic_m classes from HWDB1.0-1.1 as the training set, where m𝑚mitalic_m ranges in {500, 1000, 1500, 2000, 2755}. The test set is composed of samples from the last 1000 classes of ICDAR-char. For the scene characters, we use CTW dataset and choose samples of the first m𝑚mitalic_m classes as the training set, where m𝑚mitalic_m ranges in {500, 1000, 1500, 2000, 3150} and choose samples of the last 1000 classes as the test set. In the training phase, we calculate the LCEsimsubscript𝐿𝐶𝐸𝑠𝑖𝑚L_{CE-sim}italic_L start_POSTSUBSCRIPT italic_C italic_E - italic_s italic_i italic_m end_POSTSUBSCRIPT only for the characters that appear in the training set. In the inference phase, the final classification results are obtained by matching the model predictions with the full set of characters comprising both the training and test sets. In addition, methods liu2023towards ; liu2022open ; yu2023ctrclip using additional glyph support samples during the training or pre-training process are outside the scope of this study.

As is shown in Tab. 4, on the handwritten dataset, our method shows significant improvement compared to the SOTA method SideNet LI2024side , with absolute accuracy gains of 1.12%, 4.51%, 1.59%, 1.57% and 5.91% at m𝑚mitalic_m in {500, 1000, 1500, 2000, 2755}, respectively. On the scene dataset, HierCode achieves SOTA performance. Furthermore, we observe that even on the full class evaluation on the handwritten dataset, HierCode can still improve the performance by 0.54% compared to the advanced method cao2020hde while maintaining the smallest model size.

Table 4: Comparison of the character zero-shot setting on the handwritten dataset ICDAR-char and scene dataset CTW with the previous method.
Handwritten/% (m𝑚mitalic_m for the classes) Scene/% (m𝑚mitalic_m for the classes) Full Class Model
500 1000 1500 2000 2755 500 1000 1500 2000 3150 Accuracy/% Size
DenseRAN wang2018denseran 1.70 8.44 14.71 19.51 30.68 0.15 0.54 1.60 1.95 5.39 96.66 287.9MB
HDE cao2020hde 4.90 12.77 19.25 25.13 33.49 0.82 2.11 3.11 6.96 7.75 97.14 -
SLD chen2021stroke 5.60 13.85 22.88 25.73 37.91 1.54 2.54 4.32 6.82 8.61 96.73 287.4MB
CUE LUO2023cue 7.43 15.75 24.01 27.04 40.55 - - - - - 96.96 -
SideNet LI2024side 5.10 16.20 33.80 44.10 50.30 - - - - - - -
HierCode (Ours) 6.22 20.71 35.39 45.67 56.21 1.67 2.59 4.54 7.02 9.13 97.68 44.2MB

Zero-shot line-level recognition. The ancient text dataset contains numerous rare and OOV characters, which can effectively verify the capability of HierCode in learning character structures and radicals. We conduct experiments on MTHv2 and follow the official training protocol. The results presented in Tab. 5 indicate that our method achieves the best recognition performance with the smallest model size. In addition, benefiting from the hierarchical representation of characters, HierCode has zero-shot recognition ability, i.e., recognizing OOV characters, which is not available in most existing SOTA methods shi2016end ; ma2020joint . We follow the zero-shot setting in ZCTRN huang2021zero that OOV characters are present in the text line data, allowing us to assess the zero-shot text line recognition capability of our method. Compared with ZCTRN, our method exhibits a significant performance improvement of 2.65%. These results underscore the superior zero-shot recognition ability of the proposed HierCode.

Table 5: Comparison of recognition performance (%) and model size with previous methods on ancient text dataset MTHv2.
Methods AR CR AR-zero Model Size
CRNN shi2016end 96.94 97.15 - 134.7MB
RAN zhang2018radical 91.56 91.79 37.22 107.6MB
Ma et al. ma2020joint 95.52 96.07 - -
ZCTRN huang2021zero 97.42 97.62 51.40 129.7MB
HierCode (Ours) 97.87 98.05 54.05 56.8MB

4.5 Lightweight Characteristics

The proposed HierCode has a significant advantage in model size compared to previous methods, as shown in Tab. 1, Tab. 5 and Tab. 4. This is due to the multi-hot encoding method employed in HierCode, which effectively reduces the number of parameters in the recognizer classification layer. In this subsection, we conduct experiments on HWDB1.0-1.1 and ICDAR-char dataset to investigate the compression capability of HierCode on lightweight backbones. Concretely, we integrate our method with two lightweight backbones, i.e., MobileNet v3 large and MobileNet v3 small, to evaluate the feasibility of our approach on mobile devices. The results in the ‘P-Cls’ row of Tab. 6 show that applying HierCode can compress the number of parameters by approximately 92.6%. Furthermore, it can be observed that given the same backbone, the model that applies HierCode achieves higher recognition accuracy than the one-hot counterparts. Moreover, the application of HierCode in smaller-sized models yields a more significant compression ratio. For example, the parameters of ResNet-18, MobileNet v3 large, and MobileNet v3 small are compressed by 13.4%, 49.0% and 68.3%, respectively. This suggests that the proposed HierCode facilitates the development of lightweight Chinese text recognition backbone.

Table 6: Comparison of HierCode and one-hot encoding in terms of the accuracy and parameters on different lightweight backbones. The ‘P-Cls’, ‘P-Model’, and ‘δ𝛿\deltaitalic_δ’ denote the parameters of the classification layer, the parameters of the whole model, and the compression ratio brought by HierCode.
Metric Methods Backbone
ResNet-18 MobileNet-L MobileNet-S
Accuracy One-hot/% 97.64 95.57 92.13
HierCode/% 97.68 95.92 92.56
P-Cls One-hot/MB 7.35 18.35 18.35
HierCode/MB 0.54 1.35 1.35
δ𝛿\deltaitalic_δ/% 92.60 92.60 92.60
P-Model One-hot/MB 50.97 34.70 24.89
HierCode/MB 44.16 17.70 7.89
δ𝛿\deltaitalic_δ/% 13.40 49.00 68.30

4.6 Ablation Studies

In this subsection, we conduct a series of ablation experiments. As illustrated in Tab. 7, we establish the baseline by setting the structural features code length L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT to 4, the radical features code length L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT to 36, and the full binary tree depth D𝐷Ditalic_D to 5. Since the maximum number of radicals M𝑀Mitalic_M in our experimental dataset does not exceed 9, t𝑡titalic_t can be calculated as 384 by Eq. 4.

Table 7: Ablation experiments on the design of radical features code length L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT, structural features code length L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT and the full binary tree depth D𝐷Ditalic_D.
L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT D𝐷Ditalic_D AR CR
Baseline 36 4 5 91.98 92.33
(a) 12 4 5 89.53 89.95
(b) 24 4 5 91.22 91.57
(c) 48 4 5 91.95 92.30
(d) 36 8 5 91.83 92.24
(e) 36 12 5 91.91 92.31
(f) 36 4 6 91.99 92.36
(g) 36 4 7 91.96 92.31

The influence of L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT. Comparing the baseline with lines (a), (b), and (c), it is observed that shorter L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT constrains recognition performances. For instance, when L𝐑=12subscript𝐿𝐑12L_{\mathbf{R}}=12italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT = 12, AR and CR are only 89.53% and 89.95%, respectively, which is 2.45% and 2.38% lower than the baseline. This is attributed to that a small code length not only fails to represent the full spectrum of radicals of Chinese characters, which has thousands of categories (212=4096<10ksuperscript212409610𝑘2^{12}=4096<10k2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT = 4096 < 10 italic_k), but also hampers the distinguishment of radical features in prototype learning. With the increase of L𝐑subscript𝐿𝐑L_{\mathbf{R}}italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT, the recognition performance gradually improves and approaches saturation. Ultimately, we choose L𝐑=36subscript𝐿𝐑36L_{\mathbf{R}}=36italic_L start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT = 36 as our standard setting as a trade-off for better performance and fewer model parameters.

The influence of L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT. By comparing the baseline with lines (d) and (e), it can be seen that the recognition performance of L𝐒=4subscript𝐿𝐒4L_{\mathbf{S}}=4italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = 4 is comparable to that of L𝐒=8subscript𝐿𝐒8L_{\mathbf{S}}=8italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = 8 and L𝐒=12subscript𝐿𝐒12L_{\mathbf{S}}=12italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = 12. Given a limited number of character structures which is fixed to 10, the increase in L𝐒subscript𝐿𝐒L_{\mathbf{S}}italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT does not result in a notable improvement in recognition performance but brings additional model parameters. Therefore, we opt L𝐒=4subscript𝐿𝐒4L_{\mathbf{S}}=4italic_L start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = 4 as our standard setting.

The influence of D𝐷Ditalic_D. The depth of the full binary tree is closely related to the radical hierarchical decomposition of the characters. If reducing D𝐷Ditalic_D to 4, we observe that numerous characters with large amounts of strokes and complex structures could not be completely decomposed, resulting in a sharp increase in the number of radicals and making it impossible to derive meaningful results. On the other hand, increasing D𝐷Ditalic_D to at least 5, enables the decomposition of most characters. However, as demonstrated by experiments (f) and (g), there is no significant improvement in performance. We select D=5𝐷5D=5italic_D = 5 as the standard experimental setting.

The influence of HierCode’s generation method. As described in Sec. 3.2, based on the hierarchical decomposition of Chinese characters, the proposed HierCode consists of structural features and radical features. Then, we maintain the overall composition of HierCode and replace the radical codes with randomly generated equal-length codes to experiment (b) in Tab. 8. By comparing settings (a) and (b) in Tab. 8, it can be seen that applying random radical codes slightly decreases the accuracy of ICDAR-char, AR, and CR of ICDAR-line by 0.22%, 0.36% and 0.37%, respectively, which indicates that the pre-trained radical prototypes can better express the unique and robust features of radicals. Notably, in cases where obtaining radical prototypes is unfeasible, e.g., in the zero-shot setting, the option of employing randomly generated radical codes becomes a viable alternative. Furthermore, we randomly initialize a unique code for each character, whose length is equal to HierCode, to conduct (c). Note that the codes in (c) no longer have the hierarchical features of Chinese characters, which is equivalent to ordinary multi-hot coding. From the results in (c) of Tab. 8, it can be seen that randomness for the entire codes will significantly reduce the accuracy of ICDAR-char, AR, and CR of ICDAR-line by 3.09%, 1.26% and 1.31%, respectively. In summary, the generation method of HierCode is meaningful, especially the hierarchical information of the Chinese characters, which is crucial to improving performance.

Table 8: Ablation studies on the generation method of HierCode.
Settings ICDAR-char ICDAR-line
Accuracy/% AR/% CR/%
(a) HierCode (Baseline) 97.43 91.98 92.33
(b) Random radical code 97.21 91.62 91.96
(c) Random entire code 94.34 90.72 91.02

4.7 Visualization

In this subsection, we present visualizations of various datasets to further analyze the strengths and weaknesses of the proposed HierCode. HierCode outperforms one-hot encoding in recognizing some similar characters. For instance, the two colored characters in the first subfigure in the ‘Strengths’ part of Fig. 5 have similar appearances and the same radicals, but different structures. Since their structural codes in HierCode are significantly different, our method can correctly recognize them, while the one-hot encoding method fails to do so.

Refer to caption
Figure 5: Visualizations analysis of the strengths and weaknesses of HierCode. Correctly and incorrectly recognized characters are marked in green and red, respectively.

4.8 Limitation & Discussion

HierCode suffers from some limitations. For characters without structures, such as the single-radical Chinese characters, Arabic numerals, and Latin characters, HierCode degenerates into ordinary multi-hot encoding, resulting in the inability to distinguish these characters when they exhibit similar visual appearances. Additionally, in cases where characters share similar radicals, HierCode may not effectively differentiate them. This issue represents the main limitation of radical-based methods and constitutes the primary focus of our future research endeavors.

5 Conclusion and Future Work

This paper presents HierCode, an innovative lightweight hierarchical codebook designed for zero-shot Chinese text recognition. Through hierarchical encoding and prototype learning, HierCode assigns unique and informative representations to each Chinese character, capturing the inherent structural and effective radical features. Our method not only addresses the challenge of zero-shot character recognition, enabling accurate identification of characters unseen during training but also proves effective in line-level recognition tasks. HierCode’s use of multi-hot encoding significantly reduces the number of parameters required, resulting in a model that is both compact and capable of fast inference. The extensive experimentation across various datasets, including handwritten, scene, document, web, and ancient texts, demonstrates HierCode’s superior performance over traditional one-hot encoding methods and many state-of-the-art approaches. Future work will focus on improving the ability to recognize text with complex backgrounds and exploring more effective ways of generating radical feature codes to distinguish similar radicals.

References

  • (1) B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11) (2016) 2298–2304.
  • (2) Z. Xie, Z. Sun, L. **, H. Ni, T. Lyons, Learning spatial-semantic context with fully convolutional recurrent network for online handwritten Chinese text recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (8) (2018) 1903–1917.
  • (3) Y. Liu, L. **, S. Zhang, C. Luo, S. Zhang, Curved scene text detection via transverse and longitudinal sequence connection, Pattern Recognition 90 (2019) 337–345.
  • (4) J. Chen, B. Li, X. Xue, Scene text telescope: Text-focused scene image super-resolution, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12026–12035.
  • (5) D. Peng, L. **, Y. Liu, C. Luo, S. Lai, PageNet: Towards end-to-end weakly supervised page-level handwritten Chinese text recognition, International Journal of Computer Vision 130 (11) (2022) 2623–2645.
  • (6) Z. Wang, H. Xie, Y. Wang, J. Xu, B. Zhang, Y. Zhang, Symmetrical linguistic feature distillation with CLIP for scene text recognition, in: ACM International Conference on Multimedia (MM), 2023, p. 509–518.
  • (7) M. Yang, B. Yang, M. Liao, Y. Zhu, X. Bai, Class-aware mask-guided feature refinement for scene text recognition, Pattern Recognition 149 (2024) 110244.
  • (8) Y. Huang, L. **, D. Peng, Zero-shot Chinese text recognition via matching class embedding, in: International Conference on Document Analysis and Recognition (ICDAR), 2021, pp. 127–141.
  • (9) D. Peng, L. **, W. Ma, C. Xie, H. Zhang, S. Zhu, J. Li, Recognition of handwritten Chinese text by segmentation: A segment-annotation-free approach, IEEE Transactions on Multimedia 25 (2022) 2368–2381.
  • (10) P. Lyu, C. Zhang, S. Liu, M. Qiao, Y. Xu, L. Wu, K. Yao, J. Han, E. Ding, J. Wang, MaskOCR: Text recognition with masked encoder-decoder pretraining (2022). arXiv:2206.00311.
  • (11) K. Wong, R. Poon, A comment on the entropy of the Chinese language, IEEE Transactions on Acoustics, Speech, and Signal Processing 24 (6) (1976) 583–585.
  • (12) G. Huang, X. Luo, S. Wang, T. Gu, K. Su, Hippocampus-heuristic character recognition network for zero-shot learning in Chinese character recognition, Pattern Recognition 130 (2022) 108818.
  • (13) X. Ao, X.-Y. Zhang, C.-L. Liu, Cross-modal prototype learning for zero-shot handwritten character recognition, Pattern Recognition 131 (2022) 108859.
  • (14) C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, Online and offline handwritten Chinese character recognition: Benchmarking on new databases, Pattern Recognition 46 (1) (2013) 155–162.
  • (15) J. Zhang, Y. Zhu, J. Du, L. Dai, Radical analysis network for zero-shot learning in printed Chinese character recognition, in: IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2018, pp. 1–6.
  • (16) T. Wang, Z. Xie, Z. Li, L. **, X. Chen, Radical aggregation network for few-shot offline handwritten Chinese character recognition, Pattern Recognition Letters 125 (2019) 821–827.
  • (17) J. Zhang, J. Du, L. Dai, Radical analysis network for learning hierarchies of Chinese characters, Pattern Recognition 103 (2020) 107305.
  • (18) Z. Cao, J. Lu, S. Cui, C. Zhang, Zero-shot handwritten Chinese character recognition with hierarchical decomposition embedding, Pattern Recognition 107 (2020) 107488.
  • (19) G.-F. Luo, D.-H. Wang, X. Du, H.-Y. Yin, X.-Y. Zhang, S. Zhu, Self-information of radicals: A new clue for zero-shot Chinese character recognition, Pattern Recognition 140 (2023) 109598.
  • (20) Z. Li, Y. Huang, D. Peng, M. He, L. **, SideNet: Learning representations from interactive side information for zero-shot Chinese character recognition, Pattern Recognition 148 (2024) 110208.
  • (21) Y.-M. Su, J.-F. Wang, A novel stroke extraction method for Chinese characters using gabor filters, Pattern Recognition 36 (3) (2003) 635–647.
  • (22) C.-L. Liu, I.-J. Kim, J. H. Kim, Model-based stroke extraction and matching for handwritten Chinese character recognition, Pattern Recognition 34 (12) (2001) 2339–2352.
  • (23) J. Chen, B. Li, X. Xue, Zero-shot Chinese character recognition with stroke-level decomposition, arXiv preprint arXiv:2106.11613 (2021).
  • (24) Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang, Q. Dang, et al., PP-OCR: A practical ultra lightweight OCR system, arXiv preprint arXiv:2009.09941 (2020).
  • (25) B. Li, X. Tang, X. Qi, Y. Chen, R. Xiao, Hamming OCR: A locality sensitive hashing neural network for scene text recognition, arXiv preprint arXiv:2009.10874 (2020).
  • (26) B. Li, X. Tang, X. Qi, Y. Chen, C.-G. Li, R. Xiao, Effective multi-hot encoding and classifier for lightweight scene text recognition with a large character set, IEEE Transactions on Circuits and Systems for Video Technology (2022) 1–1.
  • (27) L.-W. **, J.-X. Yin, X. Gao, J.-C. Huang, Study of several directional feature extraction methods with local elastic meshing technology for HCCR, in: International Conference for Young Computer Scientist, 2001, pp. 232–236.
  • (28) Y.-M. Su, J.-F. Wang, A novel stroke extraction method for Chinese characters using gabor filters, Pattern Recognition 36 (3) (2003) 635–647.
  • (29) F. Chang, Techniques for solving the large-scale classification problem in Chinese handwriting recognition, in: Arabic and Chinese Handwriting Recognition, 2006, pp. 161–169.
  • (30) D. Cireşan, U. Meier, Multi-column deep neural networks for offline handwritten Chinese character classification, in: International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–6.
  • (31) X. Xiao, L. **, Y. Yang, W. Yang, J. Sun, T. Chang, Building fast and compact convolutional neural networks for offline handwritten Chinese character recognition, Pattern Recognition 72 (2017) 72–81.
  • (32) Z. Li, Q. Wu, Y. Xiao, M. **, H. Lu, Deep matching network for handwritten Chinese character recognition, Pattern Recognition 107 (2020) 107471.
  • (33) T.-Q. Wang, F. Yin, C.-L. Liu, Radical-based Chinese character recognition via multi-labeled learning of deep residual networks, in: International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 579–584.
  • (34) W. Wang, J. Zhang, J. Du, Z.-R. Wang, Y. Zhu, DenseRAN for offline handwritten Chinese character recognition, in: International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE, 2018, pp. 104–109.
  • (35) T. Wang, D. J. Wu, A. Coates, A. Y. Ng, End-to-end text recognition with convolutional neural networks, in: International Conference on Pattern Recognition (ICPR), 2012, pp. 3304–3308.
  • (36) A. Bissacco, M. Cummins, Y. Netzer, H. Neven, PhotoOCR: Reading text in uncontrolled conditions, in: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 785–792.
  • (37) M. Jaderberg, A. Vedaldi, A. Zisserman, Deep features for text spotting, in: European Conference Computer Vision (ECCV), 2014, pp. 512–528.
  • (38) W. Hu, X. Cai, J. Hou, S. Yi, Z. Lin, GTC: Guided training of CTC towards efficient and accurate scene text recognition, in: AAAI Conference on Artificial Intelligence (AAAI), no. 07, 2020, pp. 11005–11012.
  • (39) M. Yousef, K. F. Hussain, U. S. Mohammed, Accurate, data-efficient, unconstrained text recognition with convolutional neural networks, Pattern Recognition 108 (2020) 107482.
  • (40) Z. Chen, F. Yin, X.-Y. Zhang, Q. Yang, C.-L. Liu, MuLTReNets: Multilingual text recognition networks for simultaneous script identification and handwriting recognition, Pattern Recognition 108 (2020) 107555.
  • (41) B. Liu, W. Sun, W. Kang, X. Xu, Searching from the prediction of visual and language model for handwritten Chinese text recognition, in: International Conference Document Analysis and Recognition (ICDAR), 2021, pp. 274–288.
  • (42) Z.-R. Wang, J. Du, J.-M. Wang, Writer-aware CNN for parsimonious HMM-based offline handwritten Chinese text recognition, Pattern Recognition 100 (2020) 107102.
  • (43) B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, ASTER: An attentional scene text recognizer with flexible rectification, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9) (2018) 2035–2048.
  • (44) C. Luo, L. **, Z. Sun, MORAN: A multi-object rectified attention network for scene text recognition, Pattern Recognition 90 (2019) 109–118.
  • (45) T. Wang, Y. Zhu, L. **, C. Luo, X. Chen, Y. Wu, Q. Wang, M. Cai, Decoupled attention network for text recognition, in: AAAI Conference on Artificial Intelligence (AAAI), no. 07, 2020, pp. 12216–12224.
  • (46) Q. Lin, C. Luo, L. **, S. Lai, Stan: A sequential transformation attention-based network for scene text recognition, Pattern Recognition 111 (2021) 107692.
  • (47) S. Fang, H. Xie, Y. Wang, Z. Mao, Y. Zhang, Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 7098–7107.
  • (48) Y. Wang, H. Xie, S. Fang, M. Xing, J. Wang, S. Zhu, Y. Zhang, PETR: Rethinking the capability of transformer-based language model in scene text recognition, IEEE Transactions on Image Processing 31 (2022) 5585–5598.
  • (49) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).
  • (50) H. Yu, X. Wang, B. Li, X. Xue, Chinese text recognition with a pre-trained CLIP-like model through Image-IDS aligning, in: IEEE International Conference on Computer Vision (ICCV), 2023, pp. 11943–11952.
  • (51) A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5) (2008) 855–868.
  • (52) F. Yin, Q.-F. Wang, X.-Y. Zhang, C.-L. Liu, ICDAR 2013 Chinese handwriting recognition competition, in: International Conference on Document Analysis and Recognition (ICDAR), 2013, pp. 1464–1470.
  • (53) C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, Casia online and offline Chinese handwriting databases, International Conference on Document Analysis and Recognition (ICDAR) (2011) 37–41.
  • (54) H. Yu, J. Chen, B. Li, J. Ma, M. Guan, X. Xu, X. Wang, S. Qu, X. Xue, Benchmarking Chinese text recognition: Datasets, baselines, and an empirical study, arXiv preprint arXiv:2112.15093 (2021).
  • (55) W. Ma, H. Zhang, L. **, S. Wu, J. Wang, Y. Wang, Joint layout analysis, character detection and recognition for historical document digitization, in: International Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 31–36.
  • (56) T.-L. Yuan, Z. Zhu, K. Xu, C.-J. Li, T.-J. Mu, S.-M. Hu, A large Chinese text dataset in the wild, Journal of Computer Science and Technology 34 (2019) 509–521.
  • (57) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
  • (58) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780.
  • (59) R. Messina, J. Louradour, Segmentation-free handwritten Chinese text recognition with LSTM-RNN, in: International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 171–175.
  • (60) Y.-C. Wu, F. Yin, Z. Chen, C.-L. Liu, Handwritten Chinese text recognition using separable multi-dimensional recurrent neural network, in: International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 79–84.
  • (61) J. Du, Z.-R. Wang, J.-F. Zhai, J.-S. Hu, Deep neural network based hidden Markov model for offline handwritten Chinese text recognition, in: International Conference on Pattern Recognition (ICPR), 2016, pp. 3428–3433.
  • (62) S. Wang, L. Chen, L. Xu, W. Fan, J. Sun, S. Naoi, Deep knowledge training and heterogeneous CNN for handwritten Chinese text recognition, in: International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 84–89.
  • (63) Z.-R. Wang, J. Du, W.-C. Wang, J.-F. Zhai, J.-S. Hu, A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition, International Journal on Document Analysis and Recognition (IJDAR) (2018) 241–251.
  • (64) Z. Xie, Y. Huang, Y. Zhu, L. **, Y. Liu, L. Xie, Aggregation cross-entropy for sequence recognition, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6538–6547.
  • (65) D. Peng, L. **, Y. Wu, Z. Wang, M. Cai, A fast and accurate fully convolutional network for end-to-end handwritten Chinese text segmentation and recognition, in: International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 25–30.
  • (66) Y. Xiu, Q. Wang, H. Zhan, M. Lan, Y. Lu, A handwritten Chinese text recognizer applying multi-level multimodal fusion network, in: International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1464–1469.
  • (67) C. Xie, S. Lai, Q. Liao, L. **, High performance offline handwritten Chinese text recognition with a new data preprocessing and augmentation pipeline, in: Document Analysis Systems (DAS), 2020, pp. 45–59.
  • (68) Z.-X. Wang, Q.-F. Wang, F. Yin, C.-L. Liu, Weakly supervised learning for over-segmentation based handwritten Chinese text recognition, in: International Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 157–162.
  • (69) Z.-Y. Zhu, F. Yin, D.-H. Wang, Attention combination of sequence models for handwritten Chinese text recognition, in: International Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 288–294.
  • (70) H. Li, P. Wang, C. Shen, G. Zhang, Show, attend and read: A simple and strong baseline for irregular text recognition, in: AAAI Conference on Artificial Intelligence (AAAI), no. 01, 2019, pp. 8610–8617.
  • (71) D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, E. Ding, Towards accurate scene text recognition with semantic reasoning networks, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12113–12122.
  • (72) Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, W. Wang, SEED: semantics enhanced encoder-decoder framework for scene text recognition, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13525–13534.
  • (73) C. Liu, C. Yang, H.-B. Qin, X. Zhu, C.-L. Liu, X.-C. Yin, Towards open-set text recognition via label-to-prototype learning, Pattern Recognition 134 (2023) 109109.
  • (74) C. Liu, C. Yang, X.-C. Yin, Open-set text recognition via character-context decoupling, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4523–4532.