You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans
Text line based OCR
19.000 text lines in Antiqua
14.000 text lines in Fraktur
Transcribed using double-keying (99.95% accuracy)
Public Domain, CC0 (See copyright notice)
Best for training an OCR engine
The newspapers used are:
Le Gratis luxembourgeois (1857-1858)
Luxemburger Volks-Freund (1869-1876)
L'Arlequin (1848-1848)
Courrier du Grand-Duché de Luxembourg (1844-1868)
L'Avenir (1868-1871)
Der Wächter an der Sauer (1849-1869)
Luxemburger Zeitung (1844-1845)
Luxemburger Zeitung = Journal de Luxembourg (1858-1859)
Der Volksfreund (1848-1849)
Cäcilia (1862-1871)
Kirchlicher Anzeiger für die Diözese Luxemburg (1871-1878)
L'Indépendance luxembourgeoise (1871-1878)
Luxemburger Anzeiger (1856)
L'Union (1860-1871)
Diekircher Wochenblatt (1837-1848)
Das Vaterland (1869-1870)
D'Wäschfra (1868-1878)
Luxemburger Bauernzeitung (1857)
Luxemburger Wort (1848-1878)
Dataset modality
Mixed
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
As a download from a repository/website
size of dataset
500MB-2GB
Confirm the dataset has an open licence
To the best of my knowledge, this dataset is accessible via an open licence
I think this got created as a model, so I've just moved it to a dataset. I think it could also be good to write a loading script for this to make the data easier to load using the datasets library. I'll hopefully have some time to help with that later this week.
A URL for this dataset
https://data.bnl.lu/data/historical-newspapers/
Dataset description
33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans
Text line based OCR
19.000 text lines in Antiqua
14.000 text lines in Fraktur
Transcribed using double-keying (99.95% accuracy)
Public Domain, CC0 (See copyright notice)
Best for training an OCR engine
The newspapers used are:
Dataset modality
Mixed
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
As a download from a repository/website
size of dataset
500MB-2GB
Confirm the dataset has an open licence
Contact details for data custodian
[email protected]
The text was updated successfully, but these errors were encountered: