Optical character recognition in microfilmed newspaper library collections: A feasibility study

Riitta Alkula, Kari Pieskä

Research output: Book/ReportReport

Abstract

The aim of the OCR Index project was to investigate the feasibility of optical character recognition (OCR) in generating full text indexes for newspaper collections. The project comprised a literature survey and a controlled experiment with 35 mm microfilm frames and original newspaper pages. The test material was scanned with a microfilm scanner and a A4 size page scanner. The resulting image files were processed by OCR software to produce editable text files. The purpose was to determine, whether OCR is accurate enough for producing indexes automatically. A major problem with microfilm scanning is the relative newness of the technique. Scanners suitable for volume conversion of roll film have only recently become available and require extra accessories for handling 35 mm roll film, the commonest format used in libraries. Another problem is incompatibility of image files and OCR software, requiring special conversion programs. In the OCR Index project, text files produced by the OCR software were analysed and recognition errors grouped into main categories. These were: unrecognised, substituted, split, joined, inserted, and deleted characters. The first two appeared to be the commonest error types. The accuracy of recognition was poorer than that of ordinary office documents. This is partly due to the problematic nature of newspaper text, which has tight character spacing, and contains multiple columns and various typefaces. Information retrieval demands good accuracy, not only of character accuracy but also of words; a misspelled word will not match the search term. The amount of correct words appeared to be much lower than that of correct characters. A character accuracy below 98 per cent gives such poor word recognition that it no longer seems feasible to produce indexes from text files obtained via OCR. To improve recognition results, OCR software dedicated to newspaper text should be produced. Also, automatic spelling correction methods for text produced by OCR should be improved. Text retrieval methods that can cope with incorrect words should be developed.
Original languageEnglish
Place of PublicationEspoo
PublisherVTT Technical Research Centre of Finland
Number of pages54
ISBN (Print)951-38-4707-1
Publication statusPublished - 1994
MoE publication typeNot Eligible

Publication series

SeriesVTT Tiedotteita - Meddelanden - Research Notes
Number1592
ISSN1235-0605

Fingerprint

Optical character recognition
Microfilm
Accessories
Information retrieval
Scanning

Keywords

  • optical character recognition
  • information systems
  • scanners
  • archiving
  • information retrieval
  • electronic archives
  • optical archiving
  • microfilm

Cite this

Alkula, R., & Pieskä, K. (1994). Optical character recognition in microfilmed newspaper library collections: A feasibility study. Espoo: VTT Technical Research Centre of Finland. VTT Tiedotteita - Meddelanden - Research Notes, No. 1592
Alkula, Riitta ; Pieskä, Kari. / Optical character recognition in microfilmed newspaper library collections : A feasibility study. Espoo : VTT Technical Research Centre of Finland, 1994. 54 p. (VTT Tiedotteita - Meddelanden - Research Notes; No. 1592).
@book{a92c5c3355de4eaa8fc100cf97e8ee14,
title = "Optical character recognition in microfilmed newspaper library collections: A feasibility study",
abstract = "The aim of the OCR Index project was to investigate the feasibility of optical character recognition (OCR) in generating full text indexes for newspaper collections. The project comprised a literature survey and a controlled experiment with 35 mm microfilm frames and original newspaper pages. The test material was scanned with a microfilm scanner and a A4 size page scanner. The resulting image files were processed by OCR software to produce editable text files. The purpose was to determine, whether OCR is accurate enough for producing indexes automatically. A major problem with microfilm scanning is the relative newness of the technique. Scanners suitable for volume conversion of roll film have only recently become available and require extra accessories for handling 35 mm roll film, the commonest format used in libraries. Another problem is incompatibility of image files and OCR software, requiring special conversion programs. In the OCR Index project, text files produced by the OCR software were analysed and recognition errors grouped into main categories. These were: unrecognised, substituted, split, joined, inserted, and deleted characters. The first two appeared to be the commonest error types. The accuracy of recognition was poorer than that of ordinary office documents. This is partly due to the problematic nature of newspaper text, which has tight character spacing, and contains multiple columns and various typefaces. Information retrieval demands good accuracy, not only of character accuracy but also of words; a misspelled word will not match the search term. The amount of correct words appeared to be much lower than that of correct characters. A character accuracy below 98 per cent gives such poor word recognition that it no longer seems feasible to produce indexes from text files obtained via OCR. To improve recognition results, OCR software dedicated to newspaper text should be produced. Also, automatic spelling correction methods for text produced by OCR should be improved. Text retrieval methods that can cope with incorrect words should be developed.",
keywords = "optical character recognition, information systems, scanners, archiving, information retrieval, electronic archives, optical archiving, microfilm",
author = "Riitta Alkula and Kari Piesk{\"a}",
year = "1994",
language = "English",
isbn = "951-38-4707-1",
series = "VTT Tiedotteita - Meddelanden - Research Notes",
publisher = "VTT Technical Research Centre of Finland",
number = "1592",
address = "Finland",

}

Alkula, R & Pieskä, K 1994, Optical character recognition in microfilmed newspaper library collections: A feasibility study. VTT Tiedotteita - Meddelanden - Research Notes, no. 1592, VTT Technical Research Centre of Finland, Espoo.

Optical character recognition in microfilmed newspaper library collections : A feasibility study. / Alkula, Riitta; Pieskä, Kari.

Espoo : VTT Technical Research Centre of Finland, 1994. 54 p. (VTT Tiedotteita - Meddelanden - Research Notes; No. 1592).

Research output: Book/ReportReport

TY - BOOK

T1 - Optical character recognition in microfilmed newspaper library collections

T2 - A feasibility study

AU - Alkula, Riitta

AU - Pieskä, Kari

PY - 1994

Y1 - 1994

N2 - The aim of the OCR Index project was to investigate the feasibility of optical character recognition (OCR) in generating full text indexes for newspaper collections. The project comprised a literature survey and a controlled experiment with 35 mm microfilm frames and original newspaper pages. The test material was scanned with a microfilm scanner and a A4 size page scanner. The resulting image files were processed by OCR software to produce editable text files. The purpose was to determine, whether OCR is accurate enough for producing indexes automatically. A major problem with microfilm scanning is the relative newness of the technique. Scanners suitable for volume conversion of roll film have only recently become available and require extra accessories for handling 35 mm roll film, the commonest format used in libraries. Another problem is incompatibility of image files and OCR software, requiring special conversion programs. In the OCR Index project, text files produced by the OCR software were analysed and recognition errors grouped into main categories. These were: unrecognised, substituted, split, joined, inserted, and deleted characters. The first two appeared to be the commonest error types. The accuracy of recognition was poorer than that of ordinary office documents. This is partly due to the problematic nature of newspaper text, which has tight character spacing, and contains multiple columns and various typefaces. Information retrieval demands good accuracy, not only of character accuracy but also of words; a misspelled word will not match the search term. The amount of correct words appeared to be much lower than that of correct characters. A character accuracy below 98 per cent gives such poor word recognition that it no longer seems feasible to produce indexes from text files obtained via OCR. To improve recognition results, OCR software dedicated to newspaper text should be produced. Also, automatic spelling correction methods for text produced by OCR should be improved. Text retrieval methods that can cope with incorrect words should be developed.

AB - The aim of the OCR Index project was to investigate the feasibility of optical character recognition (OCR) in generating full text indexes for newspaper collections. The project comprised a literature survey and a controlled experiment with 35 mm microfilm frames and original newspaper pages. The test material was scanned with a microfilm scanner and a A4 size page scanner. The resulting image files were processed by OCR software to produce editable text files. The purpose was to determine, whether OCR is accurate enough for producing indexes automatically. A major problem with microfilm scanning is the relative newness of the technique. Scanners suitable for volume conversion of roll film have only recently become available and require extra accessories for handling 35 mm roll film, the commonest format used in libraries. Another problem is incompatibility of image files and OCR software, requiring special conversion programs. In the OCR Index project, text files produced by the OCR software were analysed and recognition errors grouped into main categories. These were: unrecognised, substituted, split, joined, inserted, and deleted characters. The first two appeared to be the commonest error types. The accuracy of recognition was poorer than that of ordinary office documents. This is partly due to the problematic nature of newspaper text, which has tight character spacing, and contains multiple columns and various typefaces. Information retrieval demands good accuracy, not only of character accuracy but also of words; a misspelled word will not match the search term. The amount of correct words appeared to be much lower than that of correct characters. A character accuracy below 98 per cent gives such poor word recognition that it no longer seems feasible to produce indexes from text files obtained via OCR. To improve recognition results, OCR software dedicated to newspaper text should be produced. Also, automatic spelling correction methods for text produced by OCR should be improved. Text retrieval methods that can cope with incorrect words should be developed.

KW - optical character recognition

KW - information systems

KW - scanners

KW - archiving

KW - information retrieval

KW - electronic archives

KW - optical archiving

KW - microfilm

M3 - Report

SN - 951-38-4707-1

T3 - VTT Tiedotteita - Meddelanden - Research Notes

BT - Optical character recognition in microfilmed newspaper library collections

PB - VTT Technical Research Centre of Finland

CY - Espoo

ER -

Alkula R, Pieskä K. Optical character recognition in microfilmed newspaper library collections: A feasibility study. Espoo: VTT Technical Research Centre of Finland, 1994. 54 p. (VTT Tiedotteita - Meddelanden - Research Notes; No. 1592).