Sequence-to-Label Script Identification for Multilingual OCR

Fujii, Yasuhisa; Driesen, Karel; Baccash, Jonathan; Hurst, Ash; Popat, Ashok C.

Computer Science > Computer Vision and Pattern Recognition

arXiv:1708.04671 (cs)

[Submitted on 15 Aug 2017 (v1), last revised 17 Aug 2017 (this version, v2)]

Title:Sequence-to-Label Script Identification for Multilingual OCR

Authors:Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, Ashok C. Popat

View PDF

Abstract:We describe a novel line-level script identification method. Previous work repurposed an OCR model generating per-character script codes, counted to obtain line-level script identification. This has two shortcomings. First, as a sequence-to-sequence model it is more complex than necessary for the sequence-to-label problem of line script identification. This makes it harder to train and inefficient to run. Second, the counting heuristic may be suboptimal compared to a learned model. Therefore we reframe line script identification as a sequence-to-label problem and solve it using two components, trained end-toend: Encoder and Summarizer. The encoder converts a line image into a feature sequence. The summarizer aggregates the sequence to classify the line. We test various summarizers with identical inception-style convolutional networks as encoders. Experiments on scanned books and photos containing 232 languages in 30 scripts show 16% reduction of script identification error rate compared to the baseline. This improved script identification reduces the character error rate attributable to script misidentification by 33%.

Comments:	ICDAR2017, The 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
MSC classes:	68T45
ACM classes:	I.7.5
Cite as:	arXiv:1708.04671 [cs.CV]
	(or arXiv:1708.04671v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1708.04671

Submission history

From: Karel Driesen [view email]
[v1] Tue, 15 Aug 2017 20:14:51 UTC (570 KB)
[v2] Thu, 17 Aug 2017 20:20:25 UTC (570 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Sequence-to-Label Script Identification for Multilingual OCR

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Sequence-to-Label Script Identification for Multilingual OCR

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators