The VELUM Project. Building a Corpus for Medieval Latin Lexicography
• Bruno Bon (Comité Du Cange – IRHT, France)
• Renaud Alexandre (Comité Du Cange – IRHT, France)
• Sébastien Hamel (Comité Du Cange – IRHT, France)
• Nathalie Picque (Comité Du Cange – IRHT, France)
This paper will present the first results of the « Velum. Visualising, exploring and linking ressources for
Medieval Latin » (2018-2023 [https://www.glossaria.eu/velum]) project. Our aims is to build a
representative textual corpus of Medieval Latin for the redaction of the « Novum Glossarium Mediae
Latinitatis » (= NGML [https://www.glossaria.eu/ngml]), the international dictionary of Medieval Latin
that was launched in the 1920s to describe the main language of the European textual production between
800 and 1200 AD. While the dictionary was for a long time only based on empirical quotations, the actual
digital text collections, mostly literary, don’t fit our lexicographical needs.
The Velum 100 million words corpus is composed from a large selection of the 10.000 sources of the
NGML, based on the « Index Scriptorum Mediae Latinitatis » [https://www.glossaria.eu/scriptores], either
extracted from existing digital collections, or from scratch when PDF-files were only available. While the
first texts had to be ‘only’ XML structured, we had to massprocess the 1.500 latter through image
extraction and optimisation, OCR-isation, correction, reduction and zone selection, so the corpus as to be
able to get the pure Latin text (without any editorial content).
All the texts will soon be annotated both on text (genre, localisation, datation) and word level (PoS,
lemma), using the tools developped for Medieval Latin by our previous project « Omnia »
[https://www.glossaria.eu/lemmatisation].