Tutorial: Introduction to Annif and automated subject indexing

Romein, A. (Speaker), Sara Floor Veldhoen (Speaker), Osma Suominen (Speaker), Koraljka Golub (Speaker)

Activity: Talk or presentationSocietal

Description

Manually indexing documents for subject-based access is a very labour-intensive intellectual process. A machine could perform similar subject indexing much faster. In this series of presentations and demonstrations, we will show practical examples of automated subject indexing and discuss how such systems can be evaluated.
In the first part of this presentation, Osma Suominen will introduce the general idea of automated subject indexing using a controlled vocabulary such as a thesaurus or a classification system; and the open source automated subject indexing tool Annif, which integrates several different machine learning algorithms for text classification. By combining multiple approaches, Annif can be adapted to different settings. The tool can be used with any vocabulary; and, with suitable training data, documents in many different languages may be analysed. Annif is both a command line tool and a microservice-style API service which can be integrated with other systems. We will demonstrate how to use Annif to train a model using metadata from an existing bibliographic database and how it can then provide subject suggestions for new, unseen documents.

In the second part of the presentation, Koraljka Golub will discuss the topic of evaluating automated subject indexing systems. There are many challenges in evaluation, for example the lack of gold standards to compare against, the inherently subjective nature of subject indexing, relatively low inter-indexer consistency in typical settings, and dominating out-of-context, laboratory-like evaluation approaches.

In the third part of the presentation, Annemieke Romein and Sara Veldhoen will present a case study of how they have applied Annif in a Digital Humanities research project to categorize early modern legislative texts using a hierarchical subject vocabulary and a pre-trained set.

For practitioners that would like to learn how to use the Annif tool on their own, there is also a follow-up hands-on tutorial. The hands-on tutorial consists of short prerecorded video presentations, written instructions and practical exercises that explain and introduce various aspects of Annif and its use.
Period21 Sep 2020
Degree of RecognitionInternational

Keywords

  • DCMI
  • Annif
  • Automatic subject indexing
  • Entangled Histories