bnl_ground_truth_newspapers_before_1878 #79

ymaurer · 2022-08-05T13:23:23Z

A URL for this dataset

https://data.bnl.lu/data/historical-newspapers/

Dataset description

33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans

Text line based OCR
19.000 text lines in Antiqua
14.000 text lines in Fraktur
Transcribed using double-keying (99.95% accuracy)
Public Domain, CC0 (See copyright notice)
Best for training an OCR engine

The newspapers used are:

Le Gratis luxembourgeois (1857-1858)
Luxemburger Volks-Freund (1869-1876)
L'Arlequin (1848-1848)
Courrier du Grand-Duché de Luxembourg (1844-1868)
L'Avenir (1868-1871)
Der Wächter an der Sauer (1849-1869)
Luxemburger Zeitung (1844-1845)
Luxemburger Zeitung = Journal de Luxembourg (1858-1859)
Der Volksfreund (1848-1849)
Cäcilia (1862-1871)
Kirchlicher Anzeiger für die Diözese Luxemburg (1871-1878)
L'Indépendance luxembourgeoise (1871-1878)
Luxemburger Anzeiger (1856)
L'Union (1860-1871)
Diekircher Wochenblatt (1837-1848)
Das Vaterland (1869-1870)
D'Wäschfra (1868-1878)
Luxemburger Bauernzeitung (1857)
Luxemburger Wort (1848-1878)

Dataset modality

Mixed

Dataset licence

Creative Commons Public Domain Dedication and Certification

Other licence

No response

How can you access this data

As a download from a repository/website

size of dataset

500MB-2GB

Confirm the dataset has an open licence

To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

[email protected]

ymaurer · 2022-08-05T14:28:11Z

I transformed the original dataset slightly into jsonl and zipped the images

https://huggingface.co/ymaurer/bnl_ground_truth_newspapers_before_1878

ymaurer · 2022-08-05T19:39:17Z

Moved the dataset to the biglam organisation
biglam/bnl_ground_truth_newspapers_before_1878

davanstrien · 2022-08-07T14:02:51Z

Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878

I think this got created as a model, so I've just moved it to a dataset. I think it could also be good to write a loading script for this to make the data easier to load using the datasets library. I'll hopefully have some time to help with that later this week.

ymaurer added the candidate-dataset Proposed dataset to be added label Aug 5, 2022

davanstrien added dataset Dataset to be added and removed candidate-dataset Proposed dataset to be added labels Aug 5, 2022

bigscience-workshop-projects bot moved this to Todo in BigLAM: BigScience Libraries, Archives and Museums Aug 5, 2022

bigscience-workshop-projects bot added this to BigLAM: BigScience Libraries, Archives and Museums Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bnl_ground_truth_newspapers_before_1878 #79

bnl_ground_truth_newspapers_before_1878 #79

ymaurer commented Aug 5, 2022

ymaurer commented Aug 5, 2022 •

edited

Loading

ymaurer commented Aug 5, 2022

davanstrien commented Aug 7, 2022

bnl_ground_truth_newspapers_before_1878 #79

bnl_ground_truth_newspapers_before_1878 #79

Comments

ymaurer commented Aug 5, 2022

A URL for this dataset

Dataset description

Dataset modality

Dataset licence

Other licence

How can you access this data

size of dataset

Confirm the dataset has an open licence

Contact details for data custodian

ymaurer commented Aug 5, 2022 • edited Loading

ymaurer commented Aug 5, 2022

davanstrien commented Aug 7, 2022

ymaurer commented Aug 5, 2022 •

edited

Loading