Automated Collation and Digital Editions
: From Theory to Practice

Student thesis: Doctoral ThesisDoctor of Philosophy


The purpose of the dissertation is to investigate from a theoretical and methodological perspective the different tools that allow automated collation, and study the application of such tools to the creation of a digital critical edition in the context of Classical literature. By doing so, the dissertation examines many foundational but often neglected components of the philological method, such as the definition and wider implication of transcription, reading, and variant.
The goal is to provide a reflection on automated collation and the theoretical as well as practical challenges it poses: what is automated collation? How is it performed, and what are the main differences with manual collation? What are the benefits of automated collation? Why has it not been widely adopted yet, despite the fact that it was developed to help scholars? How to process the results of collation programmes? As a case study, a Classical Latin text has been used to test automated collation and to compare the various existing tools.
The method I follow in this dissertation is to apply automated collation to a selected text, the Declamations of Calpurnius Flaccus. To this purpose, the manuscript tra-dition, as well as the editio princeps, have been entirely transcribed. Afterwards, the transcriptions have been collated with different collation programmes. The results of the collation with different programmes have been examined and compared, as well as the possibilities offered to scholarly editors for visualising and further processing those results.
The content of the thesis is divided into two distinct parts, from theory to practice: a first part discusses the relevance of collation in the broader context of critical editing, introduces automated collation and its various issues, as well as the implications of automated collation for key concepts such as ‘witnesses’, ‘readings’ and ‘variants’; the second part of the thesis will describe the practical work on the text of Calpurnius Flaccus with automated collation programmes and the visualisation tool that was created to examine the collation results. Each part comprises a short introduction and conclusion which summarises its content and its outcome.

In the process of editing a text, the collation of manuscripts and previous scholarly editions is a necessary and fundamental step. The work leading to the production of a critical edition is divided into two phases: the recension of witnesses, followed by the constitution of the text (Chiesa 2002). The editor, therefore, starts with the recension by gathering all the witnesses, manuscripts or editions, bearing a version of the edited text. Collation is the next step: comparing those witnesses to find the differences (or variants) between the versions. Finally, the editor analyses the variants in order to determine the genealogical relationships of the witnesses, if possible, and present those relationships in the form of a stemma codicum. The stemma’s purpose is to help the editor produce a text ‘as close as possible to the original’ (Maas 1958, 1), and decide which variants are accepted as authorial and which variants are rejected as errors that got included in the tradition by the copyists of the manuscripts. After the recension, the editor prepares a critical text, selecting variant readings and making emendations when necessary.
Collation is important because it is one of the first stages in the editing process and the data gathered during collation forms the basis upon which the editor will later make critical decisions (Whittaker 1991). Collation is performed because witnesses of a text always contain variant readings, and the editor needs to be in possession of all the alternatives in order to establish the text. As complete collations are not usually published, this represents a regrettable loss of information, especially given the amount of time and effort invested in collation by the editors (West 1973, 63).
While collation is an essential part of textual criticism, collation is also a long, tedious and error-prone activity and needs to be checked more than once. For this reason, a new method was created: automated collation, which takes advantage of computers in order to compare texts and find variant readings. Scholars have been developing collation tools since the 1960s, but with limited success at first: what was considered a fairly mechanical process turned out to be more sophisticated than expected (Hockey 2000, 125). Since the pioneering work of Dearing (1962) and Froger (1968), automated collation has been studied for decades and has been constantly improved. In the past 50 years, close to thirty tools have been devised in order to obtain a collation with the support of increasingly complex algorithms.
How does automated collation work? To simplify, the tools for automated collation take, as an input, a transcription of each witness that needs to be compared. The transcriptions are then aligned with each other through an alignment algorithm.
The task of collation seems to highly benefit from the application of computing methods. The advantages offered by computing methods are various: consistency in the comparison, possibility to reuse the material in order to add new manuscripts to the collation, a common format to share collation data with other scholars. The results of automated collation can also be formatted for further processing, such as building a stemma with a different program or creating a digital scholarly edition.
However, automated collation is not completely accepted by the community of scholars, nor is its method fully understood. In 1973, at the beginning of automated collation, scepticism was understandable because of the many restrictions of early tools such as the small number of witnesses collated, or the limitation to comparing lines of poetry. Faced with those limitations, West (1973, 71) stated that ‘the time has not yet come when manuscripts can be collated automatically’. Forty years later, in spite of huge technical improvements, opinions have not really changed, to the point that Reeve declared that he was not convinced by computer methods, especially for large manuscript traditions (Reeve 2011, 393). But the computer is indeed supposed to be better at handling large amounts of data, and with more consistency than a human being. In fact, when dealing with large traditions, it is not always possible to sort out the relationships between manuscripts and draw a stemma by hand, yet editors are not keen on turning to electronic methods.
Some of the obstacles to the wide adoption of automated collation seem connected to a general misunderstanding. There is a fear that somehow the computer will eventually be replacing the editor, and that editors will lose their right to apply their individual judgement to the texts. Greetham (2007, 23) regrets, for instance, that the role of individual, subjective evaluation ‘has not always been recognised, especially by those wishing to emphasise the “scientific” aspects of the field’. The importance of individual judgement, and the role of the editor compared to the role of the computer, can be closely related to the black box issue (Sculley and Pasanek 2008): if scholars do not understand what a piece of software does, how can they trust that the programme did not deprive them of making certain choices or applying their own judgement?
In the course of my PhD I had several exchanges with colleagues in the field of Classics which have highlighted several underlying misunderstandings on either side.
For instance I was once told in an email that ‘automated collation is impossible, because computers cannot read manuscripts’. This statement does not recognise, for instance, the fact that computers are collating from transcriptions made by scholars. From this point of view, transcription and collation are strictly connected activities. This statement, however, illustrates perfectly the main tension between traditional, manual collation and automated collation: the difference of methodol-ogy. The confusion arises here because the full transcription of manuscripts is not a part of the traditional heuristics in textual criticism.
A generalised lack of tools and guidelines for the production of digital scholarly editions may also explain why automated collation is not the solution of choice. As yet, there is still no clear consensus regarding digital critical editions and their definition (‘what they should be’), which results in a lack of widely applicable tools (Andrews 2012). The scarcity of user-friendly tools for Humanities scholars is also noted by Robinson (2005, §13) and Monella (2014b), whereas Cayless (2015) men-tions the absence of precise guidelines and orientation among the many options available. In fact, there is no lack of collation tools, but still little literature offers criticism to help scholars evaluate their growing number. Furthermore, collation is not the final goal but only one step in the course of editing. The results of auto-mated collation are not easily manageable by scholars: they need to be analysed and manipulated to be fully understood. However there are not many options to do so with the current collation tools.

In light of this discussion, it appears that many aspects of collation and automated collation need to be discussed. It is the purpose of this dissertation to address concerns expressed by Classicists while clarifying the misunderstandings which can hamper the adoption of automated collation. Ultimately, the aim of the dissertation is to show how the flexible results of automated collation can support scholars in their work. The disciplinary background of the thesis is thus primarily in Classics, but it will also explore practices in the wider domain of textual criticism.
Chapter 1 presents traditional collation and its method, which is necessary in order to understand what changes with the adoption of a new automated method. Chapter 2 is dedicated to the history and methodology of automated collation. Since an accurate transcription is the crucial first step to achieve in order to undertake an automated collation, the dissertation devotes two chapters to transcription: Chapter 3 for theoretical aspects, and Chapter 6 to describe the transcription of Calpurnius’ Declamations. Transcription also represents a major change in methodology, and the implications of the differences in heuristics will be explored further on in the first part of the dissertation, especially in Chapter 2 and Chapter 3.
The content of collation is determined by what editors consider a variant worthy of being recorded: therefore the definition of a variant, especially a significant variant, is crucial (Orlandi 2010, 115). Every editor has a different sensitivity about what constitutes a significant variant. Two editors will not produce the same edition of the same text. Moreover, what is significant for a historical linguist is not the same as what is relevant to the medievalist or the stemmatology specialist. This dissertation will therefore provide a reflection on what is a reading, a variant, a significant variant, and how these definitions may change according to the editor’s field of study and perspective on the text (Chapter 4).
Since many collation tools are available and new ones are regularly created, there is a growing need for scholars to compare different tools and assess which one will be best suited to their needs. Chapter 2 provides a theoretical framework of tool criticism for automated collation with several criteria, and Chapter 7 applies those criteria to the comparison of three tools in practice. As a result of this comparison, it was possible to identify the need for a tool that would help scholars to manipulate collation results in a way that supports traditional textual criticism, which led to the development of a tool for this purpose.
In Chapter 8, I will describe the tool that I have created to visualise collation results and show how its application can support editors in applying their own judgement to the text, and also make their conclusions reproducible by others. While the collation tool does not make any judgement regarding the correctness of the text, scholars should nevertheless be aware of how the collation results format can affect the visualisation of variant readings and ultimately of the critical apparatus.
The shift from the traditional manual comparison of manuscripts to a transcription followed by an automated collation represents an important change in heuristics. What does this change implies for the understanding of what scholars do when they collate? This dissertation ultimately argues that the transition to computing methodology for textual scholarship has a profound impact on the understanding of the edited text, its variation and its meaning.
Date of Award2018
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorElena Pierazzo (Supervisor) & Victoria Moul (Supervisor)

Cite this