EXCITE - Extraction of Citations from PDF Documents

  • Körner, Martin (PI)
  • Steffen, Staab (PI)
  • Mayr, Philipp (Collaborator)
  • Scharnhorst, Andrea (Advisor)

Project Details

Description

The shortage of citation data for the international and especially the German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems and knowledge discovery processes. The accessibility of information in the social sciences lags behind other fields (e.g. the natural sciences) where more citation data is available. The EXCITE project aims to close this gap by developing a tool chain of software components for reference extraction which will be applied on existing scientific databases (esp. full texts in the social sciences). The tools will be made available to other researchers. The project will develop a number of algorithms for extracting references and citations from PDF full texts. It will also improve the matching of reference strings to bibliographic databases. The extraction of citations will be implemented as a five step process: 1) Extraction of text from the source documents, 2) identification of reference sections in the text, 3) segmentation of individual references in fields such as author, title, etc., 4) matching of reference strings against bibliographic databases, 5) export of the matched references in usable formats and services. Special attention will be paid to the optimization of individual components of the citation extraction. This will be done with the help of machine learning methods which control the quality of the extracted data of the individual components. The extracted citation data will be integrated into the services maintained by the proposers (sowiport and Related­Work.net) and published as linked open data under permissive licenses to enable reuse. The resulting software of this project will be published under open source licenses and made accessible via a WebService API.
Short titleEXCITE
StatusFinished
Effective start/end date01/09/201631/08/2018