The impact of OCR accuracy and feature transformation on automatic text classification

Mayo Murata, Lazaro S.P. Busagala, Wataru Oyama, Tetsushi Wakabayashi, Fumitaka Kimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

Original languageEnglish
Title of host publicationDocument Analysis Systems VII - 7th International Workshop, DAS 2006, Proceedings
Pages506-517
Number of pages12
DOIs
Publication statusPublished - Jul 7 2006
Externally publishedYes
Event7th International Workshop on Document Analysis Systems, DAS 2006 - Nelson, New Zealand
Duration: Feb 13 2006Feb 15 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3872 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other7th International Workshop on Document Analysis Systems, DAS 2006
Country/TerritoryNew Zealand
CityNelson
Period2/13/062/15/06

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'The impact of OCR accuracy and feature transformation on automatic text classification'. Together they form a unique fingerprint.

Cite this