Calfa-GREgORI Patrologia Graeca
ciol |
Presentation and aim
Last updated version : October 4th 2025
The project, led by the GREgORI project (UCLouvain) and Calfa (Paris) under the academic supervision of Professor Jean-Marie Auwers (UCLouvain), aims to provide scholars with a digital version of texts from the Patrologia Graeca (PG) that have not yet been digitised or are not yet available online in open access.
Text transcription (OCR; word accuracy 94,60%) and linguistic analysis (lemmatization and POS-tagging; Lemma-pos accuracy 94,74%) are performed with specialized AI models developed within the scope of the project, with minimal manual proofreading of the results.
This OCR software, specially developed for this purpose, preserves the complex layout of the pages from the PG volumes, and produces a mostly reliable text, because of the well-known occasionally unclear typography of the J.-P. Migne’s publications. Despite this inconvenience and the remainder of imperfectly recognized words, the results produce a searchable version of the texts. Users will have to check and possibly complete the text they need, and are invited to send their corrections.
In addition, linguistic analysis, based on linguistic resources, computer tools, and IA models jointly developed by GREgORI and Calfa, assigns a lemma and a part-of-speech to each word attested in the processed texts.
An evaluation of the results, allowing to provide scholars with an accurate assessment of the effectiveness of the AI models, will be presented in a forthcoming paper.
Scholars interested in acquiring Greek texts from the PG (with or without linguistic analysis) are invited to email us (info-gregori@uclouvain.be or contact@calfa.fr) for terms and conditions.
About input and output files (results), see below.
Foundings
This project has received fundings from (alphabetical order):
ASBL Byzantion

Calfa (Paris)


■ UCLouvain - FSS - Fondation Sedes Sapientiae

■ UCLouvain GREgORI Project
■ UCLouvain - INCAL - Institut des Civilisations Arts et Lettres
■ UCLouvain - RSCS - Institut de recherche pluridisciplinaire Religions Spiritualités Cultures Sociétés

And other private financing.
Members
- Professor Emeritus Jean-Marie Auwers (UCLouvain, RSCS)
- Professor Sébastien Moureau (UCLouvain/CIOL)
- Doctor Véronique Somers (UCLouvain/CIOL)
- Doctor Bastien Kindt (UCLouvain/CIOL)
- Chahan Vidal-Gorène (Université Paris Sciences & Lettres and École nationale des Chartes et Calfa)
Related bibliography
Kindt B., Auwers J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (WEB version).
Kindt B., Vidal-Gorène C., Delle Donne S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (WEB version).
Kindt B., Vidal-Gorène C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (WEB version).
Vidal-Gorène C., Cafiero F., Kindt B., Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac, 2025, published online on the HAL Science ouverte portal (WEB version).
Vidal-Gorène C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) (WEB version).
Vidal-Gorène C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024),(WEB version) (translated from the original in French published in 2023).
Input files
Input files processed by the OCR are PDF files available from the Patritisca.net portal or from the Roger Pearse weblog. These files are mainly digitized by Google, and, therefore, are also available from the Google Books portal. See also the Archive.org portal.
Output files and results
- OCR ground truth is available on the "Patrologia Graeca (OCR ground truth)" Zenodo repositroy.
- texts with markups are available on the Calfa’s GitHub repository of the project.
- analyzed texts are available on the "Patrologia Graeca (OCRized and analyzed texts)" Zenodo repository.
- A sample of the CGPG corpus can be used on the Sketch Engine platform (see below).
File formats description
All files are encoded in UTF-8 plain text format (this format ensures data intercoperability).
- [file_name]_text.txt: texts with markups (volume number, page number of the processed PDF file), no hyphenation, empty linrd deketion.
- [file_name]_tagged_text.vert: vertical texts enriched with intuitive form, lemma, intuitive lemma, and POS for each wordform (analysis performed by AI with minimal manual proofreading of the results). These *.vert files can be used on the Sketch Engine platform, allowing text analysis and text mining (see screenshot below)

List of currently processed texts
Total of currently processed words in the CGPG corpus : 4.138.834 tokens, 4.107.355 words.
Click here to see the files
PG 71
author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : Commentarius in Oseam prophetam, in Joelem prophetam, In Amos prophetam, In Abdiam prophetam, In Jonam prophetam, In Michæam prophetam, In Nahum prophetam, In Habacuc prophetam, In Sophoniam prophetam, In Aggæum prophetam.
word count : 210.957
PG 73
author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : In Joannis Evangelium
word count : 191.303
PG 87.1
author : Procopius the Christian Sophist
author's date : 5th-6th AD?
edition’s PDF file
work : Commentarii in OT
word count : 151.167
PG 101
author : Photios I of Constantinople
author's date : 9th AD
edition’s PDF file
work : Amphilochiana, Commentarii in NT
word count : 178.850
PG 109
author : Scriptores Post Theophanem
author's date : 9th-10th AD
edition’s PDF file
work : varia
word count : 148.584
PG 112
author : Constantine Porphyrogenitus
author's date : 10th AD
edition’s PDF file
work : De Ceremoniis
word count : 129.556
PG 123
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 208.024
PG 124
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 210.302
PG 125
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 172.696
PG 126
author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT et alia opera
word count : 164.706
PG 134
author : Joannes Zonaras
author's date : 11th-12th AD
edition’s PDF file
work : Annales
word count : 169.859
PG 146
author : Nikephoros Kallistos Xanthopoulos
author's date : 13th-14th AD
edition’s PDF file
work : Ecclesiastica Historia
word count : 156.848
PG 148
author : Nicephorus Gregoras
author's date : 13th-14th AD
edition's PDF file
work : Roman History
word count : 234.855
PG 151
author : Gregory Palamas (et al.)
author's date : 13th-14th AD
edition's PDF file
work : Opera Omnia (et al.)
word count : 399.518
PG 153
author : John Kantakouzenos
author's date : 13th-14th AD
edition's PDF file
work : Opera Omnia
word count : 230.239
PG 155
author : Simeon of Thessalonica
author's date : 14th-15th AD
edition’s PDF file
work : Dialogus in Christo (et alia opera)
word count : 175.482
PG 157
author : George Kodinos (et al.)
author's date : 15th AD
edition's PDF file
work : Opera Omnia (et al.)
word count : 95.020
PG 158
author : Michael Glykas (et al.)
author's date : 12th AD
edition’s PDF file
work : Annales (et alia)
word count : 163.148