TY - BOOK
T1 - Optical character recognition in microfilmed newspaper library collections
T2 - A feasibility study
AU - Alkula, Riitta
AU - Pieskä, Kari
PY - 1994
Y1 - 1994
N2 - The aim of the OCR Index project was to investigate the
feasibility of optical
character recognition (OCR) in generating full text
indexes for newspaper
collections. The project comprised a literature survey
and a controlled
experiment with 35 mm microfilm frames and original
newspaper pages. The test
material was scanned with a microfilm scanner and a A4
size page scanner. The
resulting image files were processed by OCR software to
produce editable text
files. The purpose was to determine, whether OCR is
accurate enough for
producing indexes automatically.
A major problem with microfilm scanning is the relative
newness of the
technique. Scanners suitable for volume conversion of
roll film have only
recently become available and require extra accessories
for handling 35 mm roll
film, the commonest format used in libraries. Another
problem is
incompatibility of image files and OCR software,
requiring special conversion
programs.
In the OCR Index project, text files produced by the OCR
software were analysed
and recognition errors grouped into main categories.
These were: unrecognised,
substituted, split, joined, inserted, and deleted
characters. The first two
appeared to be the commonest error types. The accuracy of
recognition was
poorer than that of ordinary office documents. This is
partly due to the
problematic nature of newspaper text, which has tight
character spacing, and
contains multiple columns and various typefaces.
Information retrieval demands good accuracy, not only of
character accuracy but
also of words; a misspelled word will not match the
search term. The amount of
correct words appeared to be much lower than that of
correct characters. A
character accuracy below 98 per cent gives such poor word
recognition that it
no longer seems feasible to produce indexes from text
files obtained via OCR.
To improve recognition results, OCR software dedicated to
newspaper text should
be produced. Also, automatic spelling correction methods
for text produced by
OCR should be improved. Text retrieval methods that can
cope with incorrect
words should be developed.
AB - The aim of the OCR Index project was to investigate the
feasibility of optical
character recognition (OCR) in generating full text
indexes for newspaper
collections. The project comprised a literature survey
and a controlled
experiment with 35 mm microfilm frames and original
newspaper pages. The test
material was scanned with a microfilm scanner and a A4
size page scanner. The
resulting image files were processed by OCR software to
produce editable text
files. The purpose was to determine, whether OCR is
accurate enough for
producing indexes automatically.
A major problem with microfilm scanning is the relative
newness of the
technique. Scanners suitable for volume conversion of
roll film have only
recently become available and require extra accessories
for handling 35 mm roll
film, the commonest format used in libraries. Another
problem is
incompatibility of image files and OCR software,
requiring special conversion
programs.
In the OCR Index project, text files produced by the OCR
software were analysed
and recognition errors grouped into main categories.
These were: unrecognised,
substituted, split, joined, inserted, and deleted
characters. The first two
appeared to be the commonest error types. The accuracy of
recognition was
poorer than that of ordinary office documents. This is
partly due to the
problematic nature of newspaper text, which has tight
character spacing, and
contains multiple columns and various typefaces.
Information retrieval demands good accuracy, not only of
character accuracy but
also of words; a misspelled word will not match the
search term. The amount of
correct words appeared to be much lower than that of
correct characters. A
character accuracy below 98 per cent gives such poor word
recognition that it
no longer seems feasible to produce indexes from text
files obtained via OCR.
To improve recognition results, OCR software dedicated to
newspaper text should
be produced. Also, automatic spelling correction methods
for text produced by
OCR should be improved. Text retrieval methods that can
cope with incorrect
words should be developed.
KW - optical character recognition
KW - information systems
KW - scanners
KW - archiving
KW - information retrieval
KW - electronic archives
KW - optical archiving
KW - microfilm
M3 - Report
SN - 951-38-4707-1
T3 - VTT Tiedotteita - Meddelanden - Research Notes
BT - Optical character recognition in microfilmed newspaper library collections
PB - VTT Technical Research Centre of Finland
CY - Espoo
ER -