An impact of linguistic features on automated classification of OCR texts

Gudila Paul Moshi, Lazaro S.P. Busagala, Wataru Ohyama, Tetsushi Wakabayashi, Fumitaka Kimura

研究成果: 著書/レポートタイプへの貢献会議での発言

抄録

Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.

元の言語英語
ホスト出版物のタイトルProceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10
ページ287-292
ページ数6
DOI
出版物ステータス出版済み - 8 2 2010
イベント2010 IAPR Workshop on Document Analysis Systems, DAS 2010 - Boston, MA, 米国
継続期間: 6 9 20106 11 2010

出版物シリーズ

名前ACM International Conference Proceeding Series

その他

その他2010 IAPR Workshop on Document Analysis Systems, DAS 2010
米国
Boston, MA
期間6/9/106/11/10

Fingerprint

Linguistics
Speech analysis

All Science Journal Classification (ASJC) codes

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

これを引用

Moshi, G. P., Busagala, L. S. P., Ohyama, W., Wakabayashi, T., & Kimura, F. (2010). An impact of linguistic features on automated classification of OCR texts. : Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10 (pp. 287-292). (ACM International Conference Proceeding Series). https://doi.org/10.1145/1815330.1815367

An impact of linguistic features on automated classification of OCR texts. / Moshi, Gudila Paul; Busagala, Lazaro S.P.; Ohyama, Wataru; Wakabayashi, Tetsushi; Kimura, Fumitaka.

Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10. 2010. p. 287-292 (ACM International Conference Proceeding Series).

研究成果: 著書/レポートタイプへの貢献会議での発言

Moshi, GP, Busagala, LSP, Ohyama, W, Wakabayashi, T & Kimura, F 2010, An impact of linguistic features on automated classification of OCR texts. : Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10. ACM International Conference Proceeding Series, pp. 287-292, 2010 IAPR Workshop on Document Analysis Systems, DAS 2010, Boston, MA, 米国, 6/9/10. https://doi.org/10.1145/1815330.1815367
Moshi GP, Busagala LSP, Ohyama W, Wakabayashi T, Kimura F. An impact of linguistic features on automated classification of OCR texts. : Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10. 2010. p. 287-292. (ACM International Conference Proceeding Series). https://doi.org/10.1145/1815330.1815367
Moshi, Gudila Paul ; Busagala, Lazaro S.P. ; Ohyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka. / An impact of linguistic features on automated classification of OCR texts. Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10. 2010. pp. 287-292 (ACM International Conference Proceeding Series).
@inproceedings{a4296d3a284c48fcbec8700881dbb28d,
title = "An impact of linguistic features on automated classification of OCR texts",
abstract = "Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100{\%}. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.",
author = "Moshi, {Gudila Paul} and Busagala, {Lazaro S.P.} and Wataru Ohyama and Tetsushi Wakabayashi and Fumitaka Kimura",
year = "2010",
month = "8",
day = "2",
doi = "10.1145/1815330.1815367",
language = "English",
isbn = "9781605587738",
series = "ACM International Conference Proceeding Series",
pages = "287--292",
booktitle = "Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10",

}

TY - GEN

T1 - An impact of linguistic features on automated classification of OCR texts

AU - Moshi, Gudila Paul

AU - Busagala, Lazaro S.P.

AU - Ohyama, Wataru

AU - Wakabayashi, Tetsushi

AU - Kimura, Fumitaka

PY - 2010/8/2

Y1 - 2010/8/2

N2 - Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.

AB - Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.

UR - http://www.scopus.com/inward/record.url?scp=77954979705&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954979705&partnerID=8YFLogxK

U2 - 10.1145/1815330.1815367

DO - 10.1145/1815330.1815367

M3 - Conference contribution

AN - SCOPUS:77954979705

SN - 9781605587738

T3 - ACM International Conference Proceeding Series

SP - 287

EP - 292

BT - Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS '10

ER -