Refining Statistical Data on the Web

A. Meroño-Peñuela

Research output: PhD ThesisPhD thesis

Abstract

The Web has grown into a heterogeneous open data space of interlinked documents,
tables, and databases. Open datasets on the Web are often used as input
for many knowledge discovery processes, which aim at finding patterns within
those data. However, open datasets on the Web are hardly ever ready for analysis,
and require careful data preparation. Even though current efforts focus
on making analysis more efficient, empirical studies show that data preparation
takes 60% of the total time spent.
Statistical data are data subject to analysis by statistical methods and tools.
A number of problems in these statistical datasets severely hamper their preparation.
First, non-standard legacy formats have a decaying support over time
that negatively affects accessibility of these data. Second, data errors, typos and
other flaws are hard to detect and correct, and affect how meaningful results are
in analysis. Third, data curation procedures are often hard-coded in implementations
or hidden in closed-source systems, obstructing their reusability. Moreover,
if these datasets contain also a historical dimension, two additional problems occur.
First, operational sources of historical statistics have often been lost over
time, leaving partial analytical views as the only representation preserved in
archives. Second, time series are usually poorly harmonized, due to the incompatibility
of changing classification systems. Data scientists try to resolve all
these data preparation issues by resort to painful data munging, which results
in the aforementioned time spent.
In this thesis, solutions to these problems that take advantage of Semantic
Web technologies are proposed. Multiple statistical datasets in the domain of
Social and Economic History, where this kind of data is prototypical, are used as
a case study. Therefore, the main research question addressed in the thesis is:
How can Semantic Web technologies contribute to solve integration
problems of legacy statistical collections, lower their access costs, measure
the quality of their diachronic schemas and their constrained instances,
and facilitate their transformation in a standards-compliant
and implementation-independent way?
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • VU University Amsterdam
Supervisors/Advisors
  • van Harmelen, Frank, Promotor
  • Schlobach, Stefan, Co-promotor
  • Scharnhorst, Andrea, Co-promotor
Award date09 May 2016
Place of PublicationAmsterdam
Publisher
Publication statusPublished - May 2016

Fingerprint Dive into the research topics of 'Refining Statistical Data on the Web'. Together they form a unique fingerprint.

  • Cite this