Samenvatting
The project Entangled Histories used early modern printed normative
texts. The computer used to have significant problems being able to read
Dutch Gothic print, which is used in the vast majority of the sources. Using the Handwritten Text Recognition suite Transkribus (v.1.07-v.1.10), we
reprocessed the original scans that had poor quality OCR, obtaining a
Character Error Rate (CER) much lower than our initial expectations of
<5% CER. This result is a significant improvement that enables the searching through 75,000 pages of printed normative texts from the seventeen
provinces, also known as the Low Countries.
The books of ordinances are compilations; thus, segmentation is essential
to retrace the individual norms. We have applied – and compared – four
different methods: ABBYY, P2PaLA, NLE Document Recognition and a
custom rule-based tool that combines lexical features with font recognition.
Each text (norm) in the books concerns one or more topics or categories.
A selection of normative texts was manually labelled with internationally
used (hierarchical) categories. Using Annif, a tool for automatic subject
indexing, the computer was trained to apply the categories by itself. Automatic metadata makes it easier to search relevant texts and allows further
analysis.
Text recognition, segmentation and categorisation of norms together
constitute the datafication of the Early Modern Ordinances. Our experiments for automating these steps have resulted in a provisional process
for datafication of this and similar collections.
Originele taal-2 | Engels |
---|---|
Tijdschrift | DH Benelux Journal |
Volume | 2 |
Status | Gepubliceerd - aug. 2020 |