Optical character recognition in microfilmed newspaper library collections: A feasibility study

Riitta Alkula, Kari Pieskä

Research output: Book/ReportReport

Abstract

The aim of the OCR Index project was to investigate the feasibility of optical character recognition (OCR) in generating full text indexes for newspaper collections. The project comprised a literature survey and a controlled experiment with 35 mm microfilm frames and original newspaper pages. The test material was scanned with a microfilm scanner and a A4 size page scanner. The resulting image files were processed by OCR software to produce editable text files. The purpose was to determine, whether OCR is accurate enough for producing indexes automatically. A major problem with microfilm scanning is the relative newness of the technique. Scanners suitable for volume conversion of roll film have only recently become available and require extra accessories for handling 35 mm roll film, the commonest format used in libraries. Another problem is incompatibility of image files and OCR software, requiring special conversion programs. In the OCR Index project, text files produced by the OCR software were analysed and recognition errors grouped into main categories. These were: unrecognised, substituted, split, joined, inserted, and deleted characters. The first two appeared to be the commonest error types. The accuracy of recognition was poorer than that of ordinary office documents. This is partly due to the problematic nature of newspaper text, which has tight character spacing, and contains multiple columns and various typefaces. Information retrieval demands good accuracy, not only of character accuracy but also of words; a misspelled word will not match the search term. The amount of correct words appeared to be much lower than that of correct characters. A character accuracy below 98 per cent gives such poor word recognition that it no longer seems feasible to produce indexes from text files obtained via OCR. To improve recognition results, OCR software dedicated to newspaper text should be produced. Also, automatic spelling correction methods for text produced by OCR should be improved. Text retrieval methods that can cope with incorrect words should be developed.
Original languageEnglish
Place of PublicationEspoo
PublisherVTT Technical Research Centre of Finland
Number of pages54
ISBN (Print)951-38-4707-1
Publication statusPublished - 1994
MoE publication typeNot Eligible

Publication series

SeriesVTT Tiedotteita - Meddelanden - Research Notes
Number1592
ISSN1235-0605

Keywords

  • optical character recognition
  • information systems
  • scanners
  • archiving
  • information retrieval
  • electronic archives
  • optical archiving
  • microfilm

Fingerprint

Dive into the research topics of 'Optical character recognition in microfilmed newspaper library collections: A feasibility study'. Together they form a unique fingerprint.

Cite this