An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis

Akihiro Kokawa, Lazaro S.P. Busagala, Wataru Oyama, Tetsushi Wakabayashi, Fumitaka Kimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.

Original languageEnglish
Title of host publicationProceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011
Pages543-547
Number of pages5
DOIs
Publication statusPublished - Dec 2 2011
Event11th International Conference on Document Analysis and Recognition, ICDAR 2011 - Beijing, China
Duration: Sep 18 2011Sep 21 2011

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
ISSN (Print)1520-5363

Other

Other11th International Conference on Document Analysis and Recognition, ICDAR 2011
CountryChina
CityBeijing
Period9/18/119/21/11

Fingerprint

Optical character recognition
Speech analysis
Linguistics

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Cite this

Kokawa, A., Busagala, L. S. P., Oyama, W., Wakabayashi, T., & Kimura, F. (2011). An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis. In Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011 (pp. 543-547). [6065370] (Proceedings of the International Conference on Document Analysis and Recognition, ICDAR). https://doi.org/10.1109/ICDAR.2011.115

An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis. / Kokawa, Akihiro; Busagala, Lazaro S.P.; Oyama, Wataru; Wakabayashi, Tetsushi; Kimura, Fumitaka.

Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011. 2011. p. 543-547 6065370 (Proceedings of the International Conference on Document Analysis and Recognition, ICDAR).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kokawa, A, Busagala, LSP, Oyama, W, Wakabayashi, T & Kimura, F 2011, An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis. in Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011., 6065370, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp. 543-547, 11th International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, 9/18/11. https://doi.org/10.1109/ICDAR.2011.115
Kokawa A, Busagala LSP, Oyama W, Wakabayashi T, Kimura F. An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis. In Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011. 2011. p. 543-547. 6065370. (Proceedings of the International Conference on Document Analysis and Recognition, ICDAR). https://doi.org/10.1109/ICDAR.2011.115
Kokawa, Akihiro ; Busagala, Lazaro S.P. ; Oyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka. / An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis. Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011. 2011. pp. 543-547 (Proceedings of the International Conference on Document Analysis and Recognition, ICDAR).
@inproceedings{2eb2fe89b12e44f2aab934674fc0496a,
title = "An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis",
abstract = "The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100{\%}. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.",
author = "Akihiro Kokawa and Busagala, {Lazaro S.P.} and Wataru Oyama and Tetsushi Wakabayashi and Fumitaka Kimura",
year = "2011",
month = "12",
day = "2",
doi = "10.1109/ICDAR.2011.115",
language = "English",
isbn = "9780769545202",
series = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",
pages = "543--547",
booktitle = "Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011",

}

TY - GEN

T1 - An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis

AU - Kokawa, Akihiro

AU - Busagala, Lazaro S.P.

AU - Oyama, Wataru

AU - Wakabayashi, Tetsushi

AU - Kimura, Fumitaka

PY - 2011/12/2

Y1 - 2011/12/2

N2 - The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.

AB - The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.

UR - http://www.scopus.com/inward/record.url?scp=82355160795&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82355160795&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2011.115

DO - 10.1109/ICDAR.2011.115

M3 - Conference contribution

AN - SCOPUS:82355160795

SN - 9780769545202

T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

SP - 543

EP - 547

BT - Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011

ER -