TY - GEN
T1 - The impact of OCR accuracy and feature transformation on automatic text classification
AU - Murata, Mayo
AU - Busagala, Lazaro S.P.
AU - Oyama, Wataru
AU - Wakabayashi, Tetsushi
AU - Kimura, Fumitaka
PY - 2006/7/7
Y1 - 2006/7/7
N2 - Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.
AB - Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.
UR - http://www.scopus.com/inward/record.url?scp=33745557026&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33745557026&partnerID=8YFLogxK
U2 - 10.1007/11669487_45
DO - 10.1007/11669487_45
M3 - Conference contribution
AN - SCOPUS:33745557026
SN - 3540321403
SN - 9783540321408
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 506
EP - 517
BT - Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings
T2 - 7th International Workshop on Document Analysis Systems, DAS 2006
Y2 - 13 February 2006 through 15 February 2006
ER -