An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis

Akihiro Kokawa, Lazaro S.P. Busagala, Wataru Oyama, Tetsushi Wakabayashi, Fumitaka Kimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.

Original languageEnglish
Title of host publicationProceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011
Pages543-547
Number of pages5
DOIs
Publication statusPublished - Dec 2 2011
Event11th International Conference on Document Analysis and Recognition, ICDAR 2011 - Beijing, China
Duration: Sep 18 2011Sep 21 2011

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
ISSN (Print)1520-5363

Other

Other11th International Conference on Document Analysis and Recognition, ICDAR 2011
CountryChina
CityBeijing
Period9/18/119/21/11

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis'. Together they form a unique fingerprint.

Cite this