Samenvatting
# Fusus
This is a workflow that transforms scanned pages into readable text.
The pages come from several printed Arabic books from the past few centuries.
The workflow takes care of cleaning, OCR and postprocessing.
A user can copy and paste image fragments of specks and symbols that must be removed before doing OCR.
The workflow detects column layout and line boundaries.
Individual lines will be passed to the OCR engine, which is Kraken using a model trained
on many printed Arabic books.
See [model](https://among.github.io/fusus/about/model.html).
The result is stored in tab-separated files, with the transcription computed by the OCR step,
plus position and confidence info resulting from that same step.
The workflow can generate proofing pages that support manually checking the OCR results.
# Next steps
Once we have scanned a significant amount of pages, we'll construct a dataset in
[Text-Fabric]()
format out of it, with features that preserve positions of the words on the page and their confidence.
From there we can implement steps to correct OCR mistakes and to perform intertextuality research between
the ground work (the "Fusus" by Ibn Arabi) and its commentary books.
# Authors
This is work done by Cornelis van Lit and Dirk Roorda.
There is more documentation about sources, the research project, and how to use
this software in the
[docs](https://among.github.io/fusus/).
This is a workflow that transforms scanned pages into readable text.
The pages come from several printed Arabic books from the past few centuries.
The workflow takes care of cleaning, OCR and postprocessing.
A user can copy and paste image fragments of specks and symbols that must be removed before doing OCR.
The workflow detects column layout and line boundaries.
Individual lines will be passed to the OCR engine, which is Kraken using a model trained
on many printed Arabic books.
See [model](https://among.github.io/fusus/about/model.html).
The result is stored in tab-separated files, with the transcription computed by the OCR step,
plus position and confidence info resulting from that same step.
The workflow can generate proofing pages that support manually checking the OCR results.
# Next steps
Once we have scanned a significant amount of pages, we'll construct a dataset in
[Text-Fabric]()
format out of it, with features that preserve positions of the words on the page and their confidence.
From there we can implement steps to correct OCR mistakes and to perform intertextuality research between
the ground work (the "Fusus" by Ibn Arabi) and its commentary books.
# Authors
This is work done by Cornelis van Lit and Dirk Roorda.
There is more documentation about sources, the research project, and how to use
this software in the
[docs](https://among.github.io/fusus/).
| Originele taal-2 | Engels |
|---|---|
| Uitgever | Zenodo |
| Outputmedia | source code/data file (online) |
| DOI's | |
| Status | Gepubliceerd - 07 dec. 2020 |
Vingerafdruk
Duik in de onderzoeksthema's van 'Fusus: a workflow to transform Arabic classical works in printed form to structured text'. Samen vormen ze een unieke vingerafdruk.Onderzoekersoutput
- 2 Software
-
Text-Fabric v8.3.4: Text-Fabric with a new display algorithm
Roorda, D. (Ontwikkelaar), 26 jun. 2020Onderzoeksoutput: Niet-tekstuele vorm › Software › Wetenschappelijk
-
Text-Fabric: version 7.3.5
Roorda, D. (Ontwikkelaar), 12 dec. 2018Onderzoeksoutput: Niet-tekstuele vorm › Software › Wetenschappelijk
Datasets
-
Uruk
Roorda, D. (Maker) & Johnson, J. C. (Maker), Zenodo, 01 feb. 2018
DOI: 10.5281/zenodo.1193841, https://github.com/Nino-cunei/uruk en nog één link, https://cdli.ucla.edu (minder tonen)
Dataset
Activiteiten
- 2 Toespraak of presentatie
-
Generale Missieven - clariah wp6 use case 1
Roorda, D. (Speaker)
17 nov. 2020Activiteit: Toespraak of presentatie › Academisch
Bestand -
Text display - when it gets tricky
Roorda, D. (Invited speaker)
22 sep. 2020Activiteit: Toespraak of presentatie › Academisch
Bestand
Citeer dit
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver