Projects per year
Abstract
The Web has grown into a heterogeneous open data space of interlinked documents,
tables, and databases. Open datasets on the Web are often used as input
for many knowledge discovery processes, which aim at finding patterns within
those data. However, open datasets on the Web are hardly ever ready for analysis,
and require careful data preparation. Even though current efforts focus
on making analysis more efficient, empirical studies show that data preparation
takes 60% of the total time spent.
Statistical data are data subject to analysis by statistical methods and tools.
A number of problems in these statistical datasets severely hamper their preparation.
First, non-standard legacy formats have a decaying support over time
that negatively affects accessibility of these data. Second, data errors, typos and
other flaws are hard to detect and correct, and affect how meaningful results are
in analysis. Third, data curation procedures are often hard-coded in implementations
or hidden in closed-source systems, obstructing their reusability. Moreover,
if these datasets contain also a historical dimension, two additional problems occur.
First, operational sources of historical statistics have often been lost over
time, leaving partial analytical views as the only representation preserved in
archives. Second, time series are usually poorly harmonized, due to the incompatibility
of changing classification systems. Data scientists try to resolve all
these data preparation issues by resort to painful data munging, which results
in the aforementioned time spent.
In this thesis, solutions to these problems that take advantage of Semantic
Web technologies are proposed. Multiple statistical datasets in the domain of
Social and Economic History, where this kind of data is prototypical, are used as
a case study. Therefore, the main research question addressed in the thesis is:
How can Semantic Web technologies contribute to solve integration
problems of legacy statistical collections, lower their access costs, measure
the quality of their diachronic schemas and their constrained instances,
and facilitate their transformation in a standards-compliant
and implementation-independent way?
tables, and databases. Open datasets on the Web are often used as input
for many knowledge discovery processes, which aim at finding patterns within
those data. However, open datasets on the Web are hardly ever ready for analysis,
and require careful data preparation. Even though current efforts focus
on making analysis more efficient, empirical studies show that data preparation
takes 60% of the total time spent.
Statistical data are data subject to analysis by statistical methods and tools.
A number of problems in these statistical datasets severely hamper their preparation.
First, non-standard legacy formats have a decaying support over time
that negatively affects accessibility of these data. Second, data errors, typos and
other flaws are hard to detect and correct, and affect how meaningful results are
in analysis. Third, data curation procedures are often hard-coded in implementations
or hidden in closed-source systems, obstructing their reusability. Moreover,
if these datasets contain also a historical dimension, two additional problems occur.
First, operational sources of historical statistics have often been lost over
time, leaving partial analytical views as the only representation preserved in
archives. Second, time series are usually poorly harmonized, due to the incompatibility
of changing classification systems. Data scientists try to resolve all
these data preparation issues by resort to painful data munging, which results
in the aforementioned time spent.
In this thesis, solutions to these problems that take advantage of Semantic
Web technologies are proposed. Multiple statistical datasets in the domain of
Social and Economic History, where this kind of data is prototypical, are used as
a case study. Therefore, the main research question addressed in the thesis is:
How can Semantic Web technologies contribute to solve integration
problems of legacy statistical collections, lower their access costs, measure
the quality of their diachronic schemas and their constrained instances,
and facilitate their transformation in a standards-compliant
and implementation-independent way?
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 09 May 2016 |
Place of Publication | Amsterdam |
Publisher | |
Publication status | Published - May 2016 |
Fingerprint
Dive into the research topics of 'Refining Statistical Data on the Web'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Census data open linked – CEDA_R From fragment to fabric – Dutch census data in a web of global cultural and historic information
Scharnhorst, A., Mandemakers, K., van Harmelen, F., Doorn, P., Guéret, C., Ashkpour, A., Meroño-Peñuela, A. & Schlobach, S.
01/10/2011 → 31/03/2016
Project: Research