TY - GEN
T1 - An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis
AU - Kokawa, Akihiro
AU - Busagala, Lazaro S.P.
AU - Ohyama, Wataru
AU - Wakabayashi, Tetsushi
AU - Kimura, Fumitaka
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2011
Y1 - 2011
N2 - The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.
AB - The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.
UR - http://www.scopus.com/inward/record.url?scp=82355160795&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=82355160795&partnerID=8YFLogxK
U2 - 10.1109/ICDAR.2011.115
DO - 10.1109/ICDAR.2011.115
M3 - Conference contribution
AN - SCOPUS:82355160795
SN - 9780769545202
T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
SP - 543
EP - 547
BT - Proceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011
T2 - 11th International Conference on Document Analysis and Recognition, ICDAR 2011
Y2 - 18 September 2011 through 21 September 2011
ER -