Abstract
The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created.
Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research.
The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten).
The experimental pilot project TRIADO focuses on two complementary research questions:
1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility?
2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1?
A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.
Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research.
The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten).
The experimental pilot project TRIADO focuses on two complementary research questions:
1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility?
2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1?
A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.
Original language | English |
---|---|
Title of host publication | DATeCH2019 |
Subtitle of host publication | Proceedings of the 3rd International Conference on Ditigal Access to Textual Cultural Heritage |
Publisher | Association for Computing Machinery (ACM) |
Pages | 105-110 |
DOIs | |
Publication status | Published - 08 May 2019 |
Keywords
- digital humanities
- machine-learining
- named entity recognition
- automated text recognition
- auto-classification