TY - BOOK
T1 - A Guide to Capturing Text from Historical Documents
T2 - Report commissioned by the Oxford University Digital Library.
AU - Tanner, Simon George
PY - 2006/6/1
Y1 - 2006/6/1
N2 - Text capture is the process of converting textual content that exists in physicalartefacts or in digital images into machine readable text. This is a significantchallenge for documents from earlier than the 1950’s as automated processesfor text capture work best on modern printed text that is very clear, consistentand uncomplicated in layout and language. Oxford University has collectionsthat date several centuries before the 1950’s and thus the complexity and costof text capture from these collections are manifestly greater than for moderncollections.The current text capture activities at Oxford University (such as OpticalCharacter recognition and keying) are characterised by being distributed acrossmany projects and stakeholders. They currently carry a sense that text capturecould be done better, with greater accuracy or more cost efficiency. There is ahigh level of activity and much good practice at Oxford, but consolidating andsharing this effectively has proved difficult.The Oxford University Digital Library (ODL) has commissioned this TextCapture Study to develop guidance, workflows and scenarios to enable a moreuniform approach to delivering effective, accurate and cost efficient textcapture at Oxford University. A core purpose of this report is to raise the levelof knowledge and experience of digitisation projects to enable them to assessthe most efficient mechanism from a basic assessment of the originalmaterials.A number of images from the Oxford Digital Library have been assessed andtested in a number of OCR Engines. The results are contained in this report.More importantly than any single set of test results are the overall methodsand workflows for text capture that such tests reveal as being the mosteffective at delivering accurate text representation. These methods andworkflows are reported. Some scenarios are listed and examples given thatdemonstrate how to assess textual resources as being more suitable for certaintext capture methods than others.This document thus illuminates a number of key factors to consider indesigning a text capture method. This document will also provide advice todigitisation projects on how to approach a text capture project and
AB - Text capture is the process of converting textual content that exists in physicalartefacts or in digital images into machine readable text. This is a significantchallenge for documents from earlier than the 1950’s as automated processesfor text capture work best on modern printed text that is very clear, consistentand uncomplicated in layout and language. Oxford University has collectionsthat date several centuries before the 1950’s and thus the complexity and costof text capture from these collections are manifestly greater than for moderncollections.The current text capture activities at Oxford University (such as OpticalCharacter recognition and keying) are characterised by being distributed acrossmany projects and stakeholders. They currently carry a sense that text capturecould be done better, with greater accuracy or more cost efficiency. There is ahigh level of activity and much good practice at Oxford, but consolidating andsharing this effectively has proved difficult.The Oxford University Digital Library (ODL) has commissioned this TextCapture Study to develop guidance, workflows and scenarios to enable a moreuniform approach to delivering effective, accurate and cost efficient textcapture at Oxford University. A core purpose of this report is to raise the levelof knowledge and experience of digitisation projects to enable them to assessthe most efficient mechanism from a basic assessment of the originalmaterials.A number of images from the Oxford Digital Library have been assessed andtested in a number of OCR Engines. The results are contained in this report.More importantly than any single set of test results are the overall methodsand workflows for text capture that such tests reveal as being the mosteffective at delivering accurate text representation. These methods andworkflows are reported. Some scenarios are listed and examples given thatdemonstrate how to assess textual resources as being more suitable for certaintext capture methods than others.This document thus illuminates a number of key factors to consider indesigning a text capture method. This document will also provide advice todigitisation projects on how to approach a text capture project and
KW - Optical Character Recognition
KW - OCR
KW - Digitisation
KW - Digital library
M3 - Commissioned report
BT - A Guide to Capturing Text from Historical Documents
PB - King's College London
ER -