Pipeline: Hebrew data from ETCBC to Github

Dirk Roorda (Developer)

Research output: Non-textual formSoftwareScientific

Abstract

This pipeline delivers, among other good things, a file bhsa_xx.mql.bz2 which contains all ETCBC data and research additions to it. The form is MQL, compressed, and the size is less than 30 MB. Where ever you have Emdros installed, you can query this data.

If you take this file from the continuous version, right here, the data is also state-of-the-art, less than a week old, provided the pipeline is executed frequently.

Two pipes
This repo contains a pipeline in software by which the ETCBC can update its public data sources. The pipeline has two main pipes:

ETCBC to TF
TF to SHEBANQ.
Between the two pipes there is a set of open GitHub repositories that contain the data in a compact, text-based format, text-fabric, which is uniquely suited to frictionless data processing.

Only the first pipe has been fully developed so far, the second one only partly.

Purpose
The public data of the ETCBC is live data, in the sense that it is actively developed at the ETCBC. Mistakes are corrected, new insights are carried through, and the fruits of research are added as enrichments.

The ETCBC wants to expose its current data to researchers and to the public.

All public incarnations of the ETCBC data at a given point in time should be in sync.

The refresh rate should be at least weekly, preferably more frequent.

Buffer function
The ETCBC does not yet produce a data export that satisfies all the requirements posed by users further down the line. Especially SHEBANQ is fussy about the details of the text-carrying features, of which the contents and organization have changed from version 4 to 4b to 2016. Sometimes features are missing in the export, and have to be reconstructed from other data, sometimes values seem to have been mangled somewhere in the creation workflow.

This pipeline is a useful tool to work around those issues temporarily and to provide feedback to the ETCBC, which will hopefully lead to a more consistent data interface over time.

Versioning
The pipeline produces versions of the whole spectrum of interconnected ETCBC data. There will be fixed versions (2017, 2019, ...) and a continuous version (c). Version c is the one to receive the weekly updates.

The name of the version is the most important parameter of the pipeline.
Original languageEnglish
Place of PublicationGeneva
PublisherZenodo
Media of outputsource code/data file (online)
SizeGigabytes
DOIs
Publication statusPublished - 31 Oct 2017

Keywords

  • Hebrew Text Database, queries, annotations

Fingerprint

Dive into the research topics of 'Pipeline: Hebrew data from ETCBC to Github'. Together they form a unique fingerprint.
  • SHEBANQ: SHEBANQ

    van Peursen, W. T. & Roorda, D.

    01/05/201331/07/2014

    Project: Research

Cite this