Improving OCR for Historical Documents by Modeling Image Distortion

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.

Original languageEnglish
Title of host publicationDigital Libraries at the Crossroads of Digital Information for the Future - 21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019, Proceedings
EditorsAdam Jatowt, Akira Maeda, Sue Yeon Syn
PublisherSpringer
Pages312-316
Number of pages5
ISBN (Print)9783030340575
DOIs
Publication statusPublished - 2019
Event21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019 - Kuala Lumpur, Malaysia
Duration: Nov 4 2019Nov 7 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11853 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019
CountryMalaysia
CityKuala Lumpur
Period11/4/1911/7/19

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Improving OCR for Historical Documents by Modeling Image Distortion'. Together they form a unique fingerprint.

Cite this