A Guide to Capturing Text from Historical Documents: Report commissioned by the Oxford University Digital Library.

Research output: Book/ReportCommissioned reportpeer-review

68 Downloads (Pure)

Abstract

Text capture is the process of converting textual content that exists in physical
artefacts or in digital images into machine readable text. This is a significant
challenge for documents from earlier than the 1950’s as automated processes
for text capture work best on modern printed text that is very clear, consistent
and uncomplicated in layout and language. Oxford University has collections
that date several centuries before the 1950’s and thus the complexity and cost
of text capture from these collections are manifestly greater than for modern
collections.
The current text capture activities at Oxford University (such as Optical
Character recognition and keying) are characterised by being distributed across
many projects and stakeholders. They currently carry a sense that text capture
could be done better, with greater accuracy or more cost efficiency. There is a
high level of activity and much good practice at Oxford, but consolidating and
sharing this effectively has proved difficult.
The Oxford University Digital Library (ODL) has commissioned this Text
Capture Study to develop guidance, workflows and scenarios to enable a more
uniform approach to delivering effective, accurate and cost efficient text
capture at Oxford University. A core purpose of this report is to raise the level
of knowledge and experience of digitisation projects to enable them to assess
the most efficient mechanism from a basic assessment of the original
materials.
A number of images from the Oxford Digital Library have been assessed and
tested in a number of OCR Engines. The results are contained in this report.
More importantly than any single set of test results are the overall methods
and workflows for text capture that such tests reveal as being the most
effective at delivering accurate text representation. These methods and
workflows are reported. Some scenarios are listed and examples given that
demonstrate how to assess textual resources as being more suitable for certain
text capture methods than others.
This document thus illuminates a number of key factors to consider in
designing a text capture method. This document will also provide advice to
digitisation projects on how to approach a text capture project and
Original languageEnglish
PublisherKing's College London
Number of pages35
Publication statusPublished - 1 Jun 2006

Keywords

  • Optical Character Recognition
  • OCR
  • Digitisation
  • Digital library

Fingerprint

Dive into the research topics of 'A Guide to Capturing Text from Historical Documents: Report commissioned by the Oxford University Digital Library.'. Together they form a unique fingerprint.

Cite this