The impact of OCR accuracy and feature transformation on automatic text classification

Mayo Murata, Lazaro S.P. Busagala, Wataru Oyama, Tetsushi Wakabayashi, Fumitaka Kimura

研究成果: 著書/レポートタイプへの貢献会議での発言

5 引用 (Scopus)

抄録

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

元の言語英語
ホスト出版物のタイトルDocument Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings
ページ506-517
ページ数12
DOI
出版物ステータス出版済み - 7 7 2006
外部発表Yes
イベント7th International Workshop on Document Analysis Systems, DAS 2006 - Nelson, ニュージ―ランド
継続期間: 2 13 20062 15 2006

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
3872 LNCS
ISSN(印刷物)0302-9743
ISSN(電子版)1611-3349

その他

その他7th International Workshop on Document Analysis Systems, DAS 2006
ニュージ―ランド
Nelson
期間2/13/062/15/06

Fingerprint

Optical character recognition
Text Classification
Text Retrieval
Decrease
Digitization
Feature Vector
Analog to digital conversion
Classifier
Classifiers
Text
Experimental Results

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

これを引用

Murata, M., Busagala, L. S. P., Oyama, W., Wakabayashi, T., & Kimura, F. (2006). The impact of OCR accuracy and feature transformation on automatic text classification. : Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings (pp. 506-517). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 巻数 3872 LNCS). https://doi.org/10.1007/11669487_45

The impact of OCR accuracy and feature transformation on automatic text classification. / Murata, Mayo; Busagala, Lazaro S.P.; Oyama, Wataru; Wakabayashi, Tetsushi; Kimura, Fumitaka.

Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings. 2006. p. 506-517 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 巻 3872 LNCS).

研究成果: 著書/レポートタイプへの貢献会議での発言

Murata, M, Busagala, LSP, Oyama, W, Wakabayashi, T & Kimura, F 2006, The impact of OCR accuracy and feature transformation on automatic text classification. : Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 巻. 3872 LNCS, pp. 506-517, 7th International Workshop on Document Analysis Systems, DAS 2006, Nelson, ニュージ―ランド, 2/13/06. https://doi.org/10.1007/11669487_45
Murata M, Busagala LSP, Oyama W, Wakabayashi T, Kimura F. The impact of OCR accuracy and feature transformation on automatic text classification. : Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings. 2006. p. 506-517. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/11669487_45
Murata, Mayo ; Busagala, Lazaro S.P. ; Oyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka. / The impact of OCR accuracy and feature transformation on automatic text classification. Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings. 2006. pp. 506-517 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{8dff69bfb76349f796c79a36daff1709,
title = "The impact of OCR accuracy and feature transformation on automatic text classification",
abstract = "Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.",
author = "Mayo Murata and Busagala, {Lazaro S.P.} and Wataru Oyama and Tetsushi Wakabayashi and Fumitaka Kimura",
year = "2006",
month = "7",
day = "7",
doi = "10.1007/11669487_45",
language = "English",
isbn = "3540321403",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "506--517",
booktitle = "Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings",

}

TY - GEN

T1 - The impact of OCR accuracy and feature transformation on automatic text classification

AU - Murata, Mayo

AU - Busagala, Lazaro S.P.

AU - Oyama, Wataru

AU - Wakabayashi, Tetsushi

AU - Kimura, Fumitaka

PY - 2006/7/7

Y1 - 2006/7/7

N2 - Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

AB - Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

UR - http://www.scopus.com/inward/record.url?scp=33745557026&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745557026&partnerID=8YFLogxK

U2 - 10.1007/11669487_45

DO - 10.1007/11669487_45

M3 - Conference contribution

AN - SCOPUS:33745557026

SN - 3540321403

SN - 9783540321408

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 506

EP - 517

BT - Document Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings

ER -