Improving OCR for Historical Documents by Modeling Image Distortion

Keiya Maekawa, Yoichi Tomiura, Satoshi Fukuda, Emi Ishita, Hideaki Uchiyama

研究成果: Chapter in Book/Report/Conference proceedingConference contribution

1 被引用数 (Scopus)

抄録

Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.

本文言語英語
ホスト出版物のタイトルDigital Libraries at the Crossroads of Digital Information for the Future - 21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019, Proceedings
編集者Adam Jatowt, Akira Maeda, Sue Yeon Syn
出版社Springer
ページ312-316
ページ数5
ISBN(印刷版)9783030340575
DOI
出版ステータス出版済み - 2019
イベント21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019 - Kuala Lumpur, マレーシア
継続期間: 11 4 201911 7 2019

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
11853 LNCS
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

会議

会議21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019
Countryマレーシア
CityKuala Lumpur
Period11/4/1911/7/19

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

フィンガープリント 「Improving OCR for Historical Documents by Modeling Image Distortion」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル