Skip to main content

IMC 2019: Sessions

Session 1203: Enhanced Access to Texts: Online Sources, Optical Character Recognition (OCR), and Multispectral Imaging

Wednesday 3 July 2019, 14.15-15.45

Moderator/Chair:Dominique Stutzmann, Institut de Recherche et d'Histoire des Textes (IRHT), Centre National de la Recherche Scientifique (CNRS), Paris
Paper 1203-aThe Internet Medieval Sourcebook: Intentions and Impacts
(Language: English)
Paul Halsall, Internet Medieval Sourcebook, Fordham University, New York
Index terms: Byzantine Studies, Sexuality, Teaching the Middle Ages, Women's Studies
Paper 1203-bUsing Optical Character Recognition (OCR) to Transcribe Medieval Manuscripts
(Language: English)
Gianmarco Saretto, Department of English & Comparative Literature, Columbia University
Jenna Schoen, Department of English & Comparative Literature, Columbia University
Index terms: Computing in Medieval Studies, Language and Literature - Middle English, Manuscripts and Palaeography
Paper 1203-c'Miraculen': Multispectral Imaging and the Recovery of a Lost Spieghel Historiael Fragment
(Language: English)
Daan Doesborgh, Regionaal Archief Tilburg
Stef Uijens, Regionaal Archief Tilburg
Index terms: Language and Literature - Dutch, Manuscripts and Palaeography

Paper -a:
Since 1996, the Internet Medieval Sourcebook has become a major resource for educators. From the onset, its central goal was to make electronic primary sources available for teachers of undergraduates. But there were also other goals - to represent the medieval world in a global context, to suggest Byzantine sources and parallels, to make sure women's history and LGBT history were covered. How well have these goals been achieved? Usage data and anecdotal information will be used to assess this pioneering digital humanities project.

Paper -b:
This paper will discuss an ongoing project to train an OCR system on a corpus of medieval manuscripts. Specifically, our team is training an OCR engine (Kraken) developed by OpenITI to transcribe early 15th-century Middle English manuscripts. Our current model, trained on Scribe D's handwriting, has a 90% training accuracy rate, and we plan to train and test the model on more manuscripts over the next year, improving its accuracy and expanding it to more scribes and scripts. This paper will discuss the hurdles we faced when preparing our training data (such as non-standard abbreviations, multilingual manuscripts, lack of digitized images, etc.). We will then suggest certain collaborative measures moving forward which will make this technology more robust and practically useful for scholars. Such a tool would have an immense impact on medieval studies. Scholars could more easily compare manuscripts across a single textual tradition, create digital editions for lesser-known texts, perform keyword searches within digital manuscript images, and work on a massive number of untranscribed texts that might be 'lost' on the current academic radar.

Paper -c:
Abstract withheld.