Aller au contenu principal

Calfa-GREgORI Patrologia Graeca

ciol |

Presentation and aim

Last updated version : October 4th 2025

The project, led by the GREgORI project (UCLouvain) and Calfa (Paris) under the academic supervision of Professor Jean-Marie Auwers (UCLouvain), aims to provide scholars with a digital version of texts from the Patrologia Graeca (PG) that have not yet been digitised or are not yet available online in open access.

Text transcription (OCR; word accuracy 94,60%) and linguistic analysis (lemmatization and POS-tagging; Lemma-pos accuracy 94,74%) are performed with specialized AI models developed within the scope of the project, with minimal manual proofreading of the results.

This OCR software, specially developed for this purpose, preserves the complex layout of the pages from the PG volumes, and produces a mostly reliable text, because of the well-known occasionally unclear typography of the J.-P. Migne’s publications. Despite this inconvenience and the remainder of imperfectly recognized words, the results produce a searchable version of the texts. Users will have to check and possibly complete the text they need, and are invited to send their corrections.

In addition, linguistic analysis, based on linguistic resources, computer tools, and IA models jointly developed by GREgORI and Calfa, assigns a lemma and a part-of-speech to each word attested in the processed texts.

An evaluation of the results, allowing to provide scholars with an accurate assessment of the effectiveness of the AI models, will be presented in a forthcoming paper.

Scholars interested in acquiring Greek texts from the PG (with or without linguistic analysis) are invited to email us (info-gregori@uclouvain.be or contact@calfa.fr) for terms and conditions.

About input and output files (results), see below.

Foundings

This project has received fundings from (alphabetical order):

  • ASBL Byzantion

Logo ByzantionLogo CalfaLogo CIOL

■ UCLouvain - FSS - Fondation Sedes Sapientiae

Logo Sapientiae

■ UCLouvain GREgORI Project

Logo Gregori

■ UCLouvain - INCAL - Institut des Civilisations Arts et Lettres

Logo INCAL

■ UCLouvain - RSCS - Institut de recherche pluridisciplinaire Religions Spiritualités Cultures Sociétés

Logo RSCS

 And other private financing.

Members

  • Professor Emeritus Jean-Marie Auwers (UCLouvain, RSCS)
  • Professor Sébastien Moureau (UCLouvain/CIOL)
  • Doctor Véronique Somers (UCLouvain/CIOL) 
  • Doctor Bastien Kindt (UCLouvain/CIOL)
  • Chahan Vidal-Gorène (Université Paris Sciences & Lettres and École nationale des Chartes et Calfa)

Related bibliography

Kindt B., Auwers J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (WEB version).

Kindt B., Vidal-Gorène C., Delle Donne S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (WEB version).

Kindt B., Vidal-Gorène C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (WEB version).

Vidal-Gorène C., Cafiero F., Kindt B., Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac, 2025, published online on the HAL Science ouverte portal (WEB version).

Vidal-Gorène C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) (WEB version).

Vidal-Gorène C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024),(WEB version) (translated from the original in French published in 2023).

Input files

Input files processed by the OCR are PDF files available from the Patritisca.net portal or from the Roger Pearse weblog. These files are mainly digitized by Google, and, therefore, are also available from the Google Books portal. See also the Archive.org portal.

Output files and results 

File formats description

All files are encoded in UTF-8 plain text format (this format ensures data intercoperability).

  • [file_name]_text.txt: texts with markups (volume number, page number of the processed PDF file), no hyphenation, empty linrd deketion.
  • [file_name]_tagged_text.vert: vertical texts enriched with intuitive form, lemma, intuitive lemma, and POS for each wordform (analysis performed by AI with minimal manual proofreading of the results). These *.vert files can be used on the Sketch Engine platform, allowing text analysis and text mining (see screenshot below)

 

Screenshot
 

List of currently processed texts

Total of currently processed words in the CGPG corpus : 4.138.834 tokens, 4.107.355 words.

Click here to see the files
PG 71

author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : Commentarius in Oseam prophetam, in Joelem prophetam, In Amos prophetam, In Abdiam prophetam, In Jonam prophetam, In Michæam prophetam, In Nahum prophetam, In Habacuc prophetam, In Sophoniam prophetam, In Aggæum prophetam.
word count : 210.957

PG 73

author : Cyril of Alexandria
author's date : 4th-5th AD
edition’s PDF file
work : In Joannis Evangelium
word count : 191.303

PG 87.1

author : Procopius the Christian Sophist
author's date : 5th-6th AD?
edition’s PDF file
work : Commentarii in OT
word count : 151.167

PG 101

author : Photios I of Constantinople
author's date : 9th AD
edition’s PDF file
work : Amphilochiana, Commentarii in NT
word count : 178.850

PG 109

author : Scriptores Post Theophanem
author's date : 9th-10th AD
edition’s PDF file
work : varia
word count : 148.584

PG 112

author : Constantine Porphyrogenitus
author's date : 10th AD
edition’s PDF file
work : De Ceremoniis
word count : 129.556

PG 123

author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 208.024

PG 124

author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 210.302

PG 125

author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT
word count : 172.696

PG 126

author : Theophylact of Ohrid
author's date : 11th-12th AD
edition’s PDF file
work : Commentarii in NT et alia opera
word count : 164.706

PG 134

author : Joannes Zonaras
author's date : 11th-12th AD
edition’s PDF file
work : Annales
word count : 169.859

PG 146

author : Nikephoros Kallistos Xanthopoulos
author's date : 13th-14th AD
edition’s PDF file
work : Ecclesiastica Historia
word count : 156.848

PG 148

author : Nicephorus Gregoras
author's date : 13th-14th AD
edition's PDF file
work : Roman History
word count : 234.855

PG 151

author : Gregory Palamas (et al.)
author's date : 13th-14th AD
edition's PDF file
work : Opera Omnia (et al.)
word count : 399.518

PG 153

author : John Kantakouzenos
author's date : 13th-14th AD
edition's PDF file
work : Opera Omnia
word count : 230.239

PG 155

author : Simeon of Thessalonica
author's date : 14th-15th AD
edition’s PDF file
work : Dialogus in Christo (et alia opera)
word count : 175.482

PG 157

author : George Kodinos (et al.)
author's date : 15th AD
edition's PDF file
work : Opera Omnia (et al.)
word count : 95.020

PG 158

author : Michael Glykas (et al.)
author's date : 12th AD
edition’s PDF file
work : Annales (et alia)
word count : 163.148