Colin Swaelens, Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Abstract

In today’s landscape of language technology, dominated by large language models, tasks like part-of-speech tagging and lemmatisation receive less attention in current NLP research. However, these tasks still pose significant challenges, especially for under-resourced, morphologically rich languages like Ancient Greek. Our project focuses on the verbatim transcriptions of Byzantine marginal poetry stored in the Database of Byzantine Book Epigrams (DBBE). Due to the highly interconnected nature of the poems, we aim to eventually perform similarity detection across the corpus. As a first step, we sought to annotate the DBBE with part-of-speech tags, morphological analyses, and lemmas. Although research on these tasks dates back to more straightforward rule-based systems from the 1970s, current taggers struggle with these unedited texts. The inconsistent orthography — largely due to itacism — adds to this complexity. To mitigate these issues, we trained a transformer-based language model encompassing classical, medieval, and modern Greek. Our experiments, however, revealed that fine-tuning the model for each annotation task was not always fruitful. There is a growing tendency to address such challenges with a multi-task head, allowing the model to process multiple annotations concurrently, drawing inspiration from cognitive psychology. This raises the question: will this more intricate solution outshine the seemingly more transparent methods of the past?

Practical information

This lecture will be given at the Computational Humanities Research Group Seminar Series, organised by the Department of Digital Humanities of King’s College London.

Date & time: Tuesday 10 December 2024, 4:00 pm

Location: Bush House, Strand Campus (30 Aldwych, London) & online

More information about this conference and the full programme can be found here.

Kyriaki Giannikou, Assessing and Reassessing Formulaicity: are editorial practices a blessing or a curse?

Abstract

Formulaicity is a widely discussed concept in the study of historical Greek, primarily due to the influence of the Homeric epics, where it is traditionally understood to arise from oral contexts where formulaic sequences reduce processing effort during lengthy recitations. Besides that, formulaic language also appears in entirely written contexts, such as post-classical Greek administrative and legal documents, where high standardisation meets the need of accuracy and efficiency (see e.g. Nachtergaele 2023; Saradi 2019). The corpus I focus on, Byzantine book epigrams — short, metrical texts found in the margins of Byzantine manuscripts — presents a unique case. These paratexts, embedded in the medieval manuscript tradition, blend literary and documentary functions without any oral performance context, oscillating between practical precision and creative expression. This paper explores a methodological challenge in studying formulaic language within historical Greek corpora, focusing specifically on the Database of Byzantine Book Epigrams.

Even recent comprehensive research on Homer’s formulaic language (Bozzone 2024) relies on modern editions of the Homeric epics that attempt to reconstruct an ‘archetype’ based on medieval manuscript ‘witnesses’. In contrast, the DBBE diverges from strict adherence to traditional editorial practices by presenting epigrams preserving all original scribal choices (‘Occurrences’) while also offering ‘normalised’ versions (‘Types’) that group similar instances of the originals (Ricceri et al. 2023). This raises questions: To what extent can we rely on edited texts to analyse formulaicity? How might editorial choices, driven by the desire for a cohesive text, obscure the original variability of formulaic sequences? Does the interaction between formulaicity and editorial practices facilitate research, or does this create the impression of greater fixedness in formulae, potentially skewing certain aspects of the analysis?

This paper explores the potential impact of editorial intervention on formulaicity research, advocating for a more flexible methodology that balances the use of both edited and original sources. Through a case study on supplications for salvation within a subset of the DBBE corpus, I will demonstrate how formulaic expressions function in this hybrid referential-poetic (cf. Jacobson 1960) context, and how editorial practices may shape our understanding of formulaicity. Ultimately, this study seeks to position this material within the broader framework of formulaicity research and to discuss the implications of editorial practices for linguistic research in historical corpora.

Practical information

This lecture will be given at the conference ‘Formulaic Language in Historical Linguistics: data, methods, tools, and theory’, organised by the Academy of Finland project “The learning of Latin in the 8th to 12th century: a linguistic approach to medieval Latin literacies” in collaboration with the Classical Philological Society of Finland.

Date & time: to be confirmed

Location: Tieteiden talo (Kirkkokatu 6, Helsinki, Finland)

More information about this conference and the full programme can be found here.

Colin Swaelens, Similarity Detection: A Starting Point for Greek

Abstract

Antique literature survived thanks to scribes painstakingly copying texts from one manuscript to the other, prior to the art of printing. Occasionally, these scribes added metrical paratexts to the manuscripts, i.e. texts standing next to the main text (Genette, 1987) and introduced in Byzantine scholarship by Lauxtermann (2003) as book epigrams. Ghent University’s Database of Byzantine Book Epigrams (Ricceri et al., 2023) stores more than 12,000 of such epigrams, being verbatim transcriptions precisely as they are found in the manuscripts. This entails that the Greek of these epigrams is interspersed with orthographic inconsistencies, mainly due to phonetic changes like the itacism. These verbatim transcriptions are called occurrences and are grouped under one or more so-called types, a readable representation of its occurrences in standardised, classical Greek. Eventually, we aim to develop a dynamic system to group hemistichs, verses and epigrams based on distinct similarity measures in order for scholars to find all kinds of similar texts instead of only the ones that pop up in their mind. While developing those similarity measures, just like any other algorithm, evaluation is an essential part of the development process. However, a gold standard for the evaluation of verse similarity measures does not exist. At this point, we already conducted a pilot study on pairwise annotation of 2 verses with 10 annotators. Each verse was set off alongside six pairs of verses, of which the annotator had to mark the most similar one in their opinion. The inter-annotator agreement (IAA) yielded an agreement score of 57.69%, which is seen as a moderate agreement (Landis & Koch, 1977). This agreement score is the arithmetic mean of the agreement between each pair of annotators, as all annotators annotated the exact same set of verses. Despite the rather modest size of this pilot study, it is possible to unravel the distinct lines of reasoning of the annotators. They did not receive detailed instructions for the annotation process, because of which every annotator was free to have their own focal point. The most remarkable of those focal points was the metre. One of the annotators based their judgement on the amount of syllables a verse counts. The majority, however, seemed to take syntax as a decisive factor to determine the most similar verse; semantics were only deciding, if the syntax of both options was identical. While the gold standard is being annotated, we already started computing similarity between words. These similarities will, in a next stage, be used to compute similarity between (half) verses. The main goal of the experiment is to find out whether transformer embeddings take into account enough context to find identical or similar words with deviant orthography.

Practical information

This lecture will be given at the ‘Computational Approaches to Ancient Greek and Latin Workshop’, organised by KU Leuven and the University of Groningen. This workshop series started in 2021 with the aim of further exploring the potential of computational approaches (Natural Language Processing) applied to Ancient Greek and Latin. The 2024 edition will be held hybridly on November 28th and 29th, 2024.

Date & time: Friday 29 November 2024, 13:45-14:30

Location: KU Leuven: Mgr. Sencie Instituut (Erasmusplein 2, 3000 Leuven, Belgium) & online

Register via this link. Registration for in-person attendance is not possible anymore. The deadline for registration for online attendance is 27 November 2024.

More information about this conference and the full programme can be found here.

Kristoffel Demoen, Kyriaki Giannikou & Colin Swaelens, The Database of Byzantine Book Epigrams. Paratextual Poems from the Margins of Medieval Manuscripts to a Searchable Digital Corpus

This lecture will be given at the 8th International Byzantine Seminar Lecture Series (2024) on “Digital Methods for Byzantine Studies”, organised by the Institute for the History of Ancient Civilizations at the Northeast Normal University in Changchun (China), in collaboration with the Department og Byzantine and Modern Greek Studies at the University of Cologne and the Department of Historical and Classical Studies at the Norwegian University of Science and Technology.

Date & time: Thursday 21 November 2024, 11:00 am (CET)

Location: online via Zoom

Registration is free, but required. The Zoom link will be provided upon registration. To register or for more information, email with “IBSLS Registration” to liq762@hotmail.com.

LW Research Day 2024: poster session

The fourth LW Research Day will take place on Wednesday 27 November 2024, in the Ghent University Museum (GUM). Central theme is ‘From Source to Understanding’.

What is the role of interpretation in our journey from studying source material to scientific understanding? Indeed, that journey can never be devoid of interpretation, which, in many cases, serves as the quintessential bridge between source material and understanding, whether it pertains to a historical study based on ego documents, the archaeological perspective on the material culture of the past or the anthropological view of human behaviour. Not infrequently, interpretation itself becomes the object of research. For instance, translation scholars examine translation choices that result from interpretations. Literary and art scholars investigate works that themselves provide an interpretation of the world in which they originate and the world they create. Similarly, language itself reflects a particular understanding of the world in a historical and sociological sense, which linguists further explore. In times of digital humanities, the interpretation of (big) data by AI becomes not only conceivable but even the norm. What do interpretation and hermeneutics signify for our fields today? What constitutes a successful or legitimate interpretation, and what are the pitfalls of interpretation?

The PhD students of the DBBE team will present a poster on their research projects in the framework of the Database of Byzantine Book Epigrams.

  • Kyriaki Giannikou – Dealing with Building Blocks of Expression: Formulaic Elements & their Creative Variations in Byzantine Book Epigrams
  • Eleonora Lauro – Epigrams in Context: Glimpses into Medieval Southern Italian Book Culture

More information can be found on the LW Research Day website.

Kyriaki Giannikou, Navigating Digital Frontiers: Unveiling Formulaicity in Byzantine Book Epigrams

Abstract

Byzantine book epigrams, featuring as paratexts in manuscript margins, seamlessly intertwine poetic expression with practical details, illuminating aspects such as the manuscripts’ patrons and the identities of the scribes involved in transcription. Although deeply rooted in traditional book production practices and very formulaic in nature, these epigrams present noteworthy linguistic variation. While their formulaicity has been acknowledged, a thorough exploration of the formulaic sequences present in the Database of Byzantine Book Epigrams (DBBE) or similar corpora remains a gap in current research. My research, to be conducted on the well-established DBBE corpus, acts as a bridge between linguistic research on formulas inherent in everyday speech and those studied within the context of oral poetry.

This interdisciplinary project, adopting a corpus-driven approach, seeks to combine close-reading along with digital methods for navigating a vast corpus of Byzantine book epigrams. This research addresses the challenge of identifying formulaic constructions (i.e. pairings of form and meaning in the context of Construction Grammar) that function as “verse building blocks” and their variation within a historical linguistic corpus that combines poetic expression and practical information. However, the digital journey of pattern identification encounters challenges arising from inherent complexities of Greek – from flexible syntax to extensive morphological variety – compounded by great linguistic variation across registers, ranging from Homeric and classicizing Greek to medieval forms interwoven with vernacular elements. The absence of critical texts for numerous epigrams further complicates matters, preserving the idiosyncrasies of original scribal choices on the one hand, but impeding uniformization for digital analysis on the other.

This presentation serves to illuminate the challenges inherent in working on Byzantine paratextual material in the Digital Humanities context of a project that endeavours to unravel the intricate linguistic nuances within Byzantine book epigrams, displaying commitment to deeper understand the complexities inherent in the intersection of Byzantine literature and Digital Humanities.

Practical information

This lecture will be given at the international workshop ‘The Impact of Digital Methods and Approaches on Ancient Studies Research‘ (13-14 May 2024, Berlin).

Date & time: Monday 13 May 2024, 4:40 pm

Location: Freie Universität Berlin (Hittorfstraße 18, 14195 Berlin)

 

More information about this workshop and the full programme can be found here.

Eleonora Lauro, Alongside the Text: Byzantine Metrical Paratexts in Gospel Manuscripts from Medieval Southern Italy

Abstract

This paper aims to investigate the relationship between Byzantine metrical paratexts, also known as book epigrams, and the biblical text found in Gospel-Books and Lectionaries from medieval Southern Italy.

In the field of New Testament textual scholarship, recent years have witnessed an increased interest in aspects of manuscripts that extend beyond their textual content. Scholars now recognize that insights gained from studying scribal corrections and paratextual features help us to understand how texts were transmitted and received in historical contexts (Lanier-Han, 2021).

Despite this considerable shift in New Testament studies, Byzantine book epigrams and their affiliation to the biblical text remain an intriguing and less-explored domain. These paratexts represent an interesting research object for philologists and historians studying the manuscript tradition of the Greek New Testament. Often copied alongside the main text, book epigrams can help to establish genealogies between manuscripts. Moreover, they offer relevant information on the communities writing and reading these books.

Specifically, my research will consider the following questions:

  1. What kind of book epigrams can be found in Gospel-Books and Lectionaries produced in medieval Southern Italy? Are they original compositions or just conventional formulas?
  2. Do the metrical paratexts reveal specific regional and cultural influences? And how do they differ from Gospel-Books and Lectionaries from other regions?
  3. Are there thematic correlations between book epigrams and biblical text?
  4. Which reading strategies do the book epigrams prescribe?
  5. what is the relation between the chain of transmission of the metrical paratexts and that of the main texts?

This study will focus on a corpus of Byzantine book epigrams found in a selected group of Gospel-Books and Lectionaries produced in Southern Italy (10th-13th century). The combination of cultural exploration and examination of textual and extratextual features presents a model for integrating various disciplines to enrich our understanding of New Testament manuscript tradition.

Practical information

This lecture will be given at the international CSNTM Text & Manuscript Conference ‘Intersection. Interdisciplinary Approaches to New Testament Text and Manuscript Studies‘, organised by the Center for the Study of New Testament Manuscripts in Plano (Texas).

The “Intersection” theme aims to explore how the many disciplines of the study of ancient Christian documents (paleography, art history, exegesis, paratext, linguistics, conservation, etc.) collaborate to help us better understand their content.

Date & time: Thursday 30 May 2024, 10:55 pm

Location: The Marriott at Legacy Town Center (7121 Bishop Rd Plano, TX 75024)

 

More information about this conference and the full programme can be found here.

Data-driven Approaches to Ancient Languages (DAAL)

On Thursday 27 June 2024, the Database of Byzantine Book Epigrams project (DBBE) is organising a workshop on Data-driven Approaches to Ancient Languages (DAAL) in Ghent, Belgium. This workshop will follow immediately after the conference “Paratexts in Premodern Writing Cultures”.

Premodern or historically attested languages are invaluable resources of both the study of diachronic linguistics and their contemporary culture. Although these languages might be from various language families or have a different script, researchers face common challenges, among which illegible or lost text (parts), inexistent gold standards and, very important these days, scarcity of data. Luckily, more and more texts become available, but the language of those texts might be so different from their modern pendant — should that modern pendant exist — that it considerably impacts the performance of existing tools. This workshop aims to provide a platform to a broad field of researchers engaged in digital approaches to pre-modern languages.

 

For all further information, please visit the conference website: https://www.dbbe2024.ugent.be/workshop/.
For any additional questions you may have, please contact the organisers at daal2024@ugent.be.

Maxime Deforche, Ilse De Vos, Antoon Bronselaer & Guy De Tré, An Orthographic Similarity Measure for Graph-Based Text Representations

This presentation will be given at theThe Dutch-Belgian DataBase Day (DBDBD), a yearly one-day workshop, organized in a Belgian or Dutch university, whose general topic is database research. DBDBD 2023 will be held in Ghent, Belgium.

At DBDBD 2023, junior and senior researchers from the Netherlands and Belgium can present their recent results, and meet fellow researchers in the field of data management. It is an excellent opportunity to meet up with your Belgian/Dutch colleagues, and to get informed about the (recent) database-related research performed in Belgian/Dutch universities. The workshop welcomes non-Belgian/Dutch participants (presentations are in English). DBDBD has a tradition of favouring presentations by junior researchers.

Practical information

Date & time: Thursday 21 December 2023, 10:30am

Location:Technicum (building T2) (Sint-Pietersnieuwstraat 41, 9000 Gent)

More information about this workshop and the full programme can be found here.

Crash Course in Greek Palaeography

The Leiden University Centre for the Arts in Society, Leiden University Library and the Greek department of Ghent University offer a two-day course in Greek palaeography in collaboration with the Research School OIKOS. The course is intended for MA, ResMA and doctoral students in the areas of Classics, Ancient History, Ancient Civilizations and Medieval studies with a good command of Greek. It offers a chronological introduction into Greek palaeography from the Hellenistic period until the end of the Middle Ages and is specifically aimed at acquiring practical skills for research involving literary and documentary papyri and/or manuscripts. This course gives the unique opportunity to practice reading on original papyri and manuscripts from the collection of the Leiden Papyrological Institute and the special collections of the Leiden University Library.

Programme

The course is set up as an intensive two-day seminar. Five lectures by specialists in the field will give a chronological overview of the development of Greek handwriting, each followed by a practice session reading relevant extracts from papyri and manuscripts in smaller groups under the supervision of young researchers.

Monday, May 27

  • 10:00 Introduction
  • 10:15-11:15 Papyri of the Ptolemaic and Roman period (3rd cent. BCE – 3rd cent. CE) (Dr. Joanne Stolk)
  • 11:15-12:30 Practice with papyri of the Ptolemaic and Roman period
  • 12:30-13:30 Lunch break
  • 13:30-14:30 Papyri of the Byzantine period (4th-8th centuries) (Dr. Yasmine Amory)
  • 14:30-15:45 Practice papyri of the Byzantine period
  • 15:45-16:15 Coffee break
  • 16:15-17:00 Presentation of Greek manuscripts from the Leiden University Library
  • 17:00-17:45 Presentation of Greek papyri from the Leiden Papyrological Institute
  • 19:00 Dinner

 

Tuesday, May 28

  • 9:00-10:00 Majuscule and early minuscule bookhands (4th-9th centuries) (Dr. Rachele Ricceri)
  • 10:00-11:15 Practice majuscule and early minuscule bookhands
  • 11:15-11:45 Coffee break
  • 11:45-12:45 The development of minuscule script (10th-12th centuries) (Prof. dr. Floris Bernard)
  • 12:45-13:45 Lunch break
  • 13:45-15:00 Practice minuscule script of the 10th-12th centuries
  • 15:00-15:30 Coffee break
  • 15:30-16:30 Manuscripts and scholars of the Palaeologan period (13th-15th centuries) (Prof. dr. Andrea Cuomo)
  • 16:30-17:45 Practice manuscripts of the Palaeologan period

Practical information

The study load is the equivalent of 2 ECTS (2×28 hours). Participants will be asked to read up on secondary literature in preparation for the seminar (distributed several weeks before the course). Extra material will be handed out during the course in order to continue to improve your reading skills afterwards.

There are no fees for participation in this course. Lunches on both days and dinner on the first day are provided free of charge. Travel costs and accommodation in Leiden are at your own expense.

Registration

Please register by sending an e-mail with a short motivation (ca. 300 words, including your background, research interests and why you would like to follow this course) to yasmine.amory@ugent.be. Priority is given to OIKOS doctoral students and those who did not have the opportunity to follow course(s) on palaeography before. Registration closes by the final deadline of February 15th, 2024. Successful applicants will be notified soon afterwards.