Classifying the Quality of Digitized VOC Documents

Carsten Schnober, Kay Pepping, Maartje Hids, Lodewijk Petram

Onderzoeksoutput: Bijdrage aan conferentieAbstractWetenschappelijk


The GLOBALISE project project is developing an "online infrastructure that unlocks the key series of VOC documents and reports". Said handwritten documents and reports, dating from the 17th and 18th centuries, have been scanned by the Dutch National Archives in The Hague, as part of their ongoing efforts to make the most used archives in their collection accessible online. The ca. 5 million scans of the GLOBALISE corpus were converted to machine-readable text using the open source HTR tool Loghi. In spite of recent improvements in recognizing handwritten texts, error rates vary widely per page. This may be due to variation in handwriting, but also, for example, to different types of documents or pages with a different scan orientation.

For downstream tasks required in the GLOBALISE project, such as event detection or named entity recognition, pages with bad quality HTR are useless or even harmful. In order to improve the results of those tasks, we aim to identify bad quality pages automatically, and reprocess them in a targeted manner. The goal is thus to identify as many bad quality pages as possible (recall), while minimizing the number of documents for re-processing (precision).

We will present our findings while adapting the method proposed for Luxembourgish by Schneider & Maurer (2022) to our historic Dutch dataset. We have analysed our specific data and present differences, commonalities, and other findings. The result is publicly available, as well as the pipeline comprising all steps to train a similar classifier, e.g. for other languages or time periods.

We have manually annotated 500 documents in terms of quality class (Good, Medium, Bad), out of which we eventually used 328 documents that were not empty, and in which both annotators agreed on the quality.

In summary, we have applied a set of features that resemble what Schneider & Maurer (2022) have proposed:

- dictionary score: the number of tokens that occur in a dictionary of the language; we have used both a modern day Dutch dictionary, and a dictionary generated from VOC documents.

- tri-gram comparison: a metric comparing the distribution of character tri-grams of a document with the expected distribution of the target language.

- Garbage token detection: a metric based on tokens that appear to be no real words, hence 'garbage', based on a set of heuristics such as token length, unusual sequences of vowels or consonants etc.

We have used our annotated data set for training various classifiers, including k Nearest Neighbours and Feed-foward Neural Networks. The latter performed best, but different classifier choices and parametrizations led to small differences in accuracy only, in the range between 0.72 and 0.74. This led to the preliminary conclusion that classifier algorithms cannot lead to significant improvents on accuracy. Instead, non-textual features such as layout and text ordering contribute to the misclassifications. We will present our error analysis in more depth and show possibilities and shortcomings of this text-based approach.

We will also show how well the features defined by Schneider & Maurer (2022) translate to our historic Dutch dataset. In a very pragmatic approach, we will demonstrate how our implementation follows the principles of open science and open source software so that others can apply our implementation on their own data, or adapt it to their own needs.
Originele taal-2Engels
StatusGepubliceerd - 22 sep. 2023
EvenementThe 33rd Meeting of Computational Linguistics in The Netherlands - Antwerp, België
Duur: 22 sep. 202322 sep. 2023
Congresnummer: 33


ConferentieThe 33rd Meeting of Computational Linguistics in The Netherlands
Verkorte titelCLIN
Internet adres


Duik in de onderzoeksthema's van 'Classifying the Quality of Digitized VOC Documents'. Samen vormen ze een unieke vingerafdruk.

Citeer dit