An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis

Akihiro Kokawa, Lazaro S.P. Busagala, Wataru Ohyama, Tetsushi Wakabayashi, Fumitaka Kimura

研究成果: Chapter in Book/Report/Conference proceedingConference contribution

1 被引用数 (Scopus)

抄録

The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.

本文言語英語
ホスト出版物のタイトルProceedings - 11th International Conference on Document Analysis and Recognition, ICDAR 2011
ページ543-547
ページ数5
DOI
出版ステータス出版済み - 2011
イベント11th International Conference on Document Analysis and Recognition, ICDAR 2011 - Beijing, 中国
継続期間: 9 18 20119 21 2011

出版物シリーズ

名前Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
ISSN(印刷版)1520-5363

その他

その他11th International Conference on Document Analysis and Recognition, ICDAR 2011
国/地域中国
CityBeijing
Period9/18/119/21/11

All Science Journal Classification (ASJC) codes

  • コンピュータ ビジョンおよびパターン認識

フィンガープリント

「An impact of OCR errors on automated classification of OCR Japanese texts with parts-of-speech analysis」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル