Page Embeddings: Extracting and Classifying Historical Documents with Generic Vector Representations

Carsten Schnober*, Renate Smit, Manjusha Kuruppath, Kay Pepping, Leon van Wissen, Lodewijk Petram

*Corresponding author for this work

Research output: Chapter in book/volumeContribution to conference proceedingsScientificpeer-review

Abstract

We propose a neural network architecture designed to generate region and page embeddings for boundary detection and classification of documents within a large and heterogeneous historical archive. Our approach is versatile and can be applied to other tasks and datasets. This method enhances the accessibility of historical archives and promotes a more inclusive utilization of historical materials.
Original languageEnglish
Title of host publicationProceedings of the Computational Humanities Research Conference 2024
Subtitle of host publicationAarhus, Denmark, December 4-6, 2024
Pages999-1011
Number of pages13
Volume3834
Publication statusPublished - 18 Nov 2024

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR Workshop Proceedings
ISSN (Print)1613-0073

Keywords

  • Natural Language Processing
  • Sequence Tagging
  • Document Metadata Enhancement
  • Machine Learning

Cite this