Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: europeana_newspapers #70

Open
1 task done
davanstrien opened this issue Jul 25, 2022 · 40 comments
Open
1 task done

Add dataset: europeana_newspapers #70

davanstrien opened this issue Jul 25, 2022 · 40 comments
Assignees
Labels
dataset Dataset to be added

Comments

@davanstrien
Copy link
Collaborator

A URL for this dataset

https://pro.europeana.eu/page/iiif#download

Dataset description

This is a dataset of historic newspapers digitised by various national libraries and made available via the Europeana platform.

Dataset modality

Text

Dataset licence

Other license

Other licence

Public Domain Mark for full text and http://creativecommons.org/publicdomain/zero/1.0/ for the metadata

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

@davanstrien davanstrien added the candidate-dataset Proposed dataset to be added label Jul 25, 2022
@davanstrien
Copy link
Collaborator Author

I suggest leaving this as a candidate dataset until we have worked out the best approach. Tagging others who have been discussing this: @bmschmidt @stefan-it

@davanstrien
Copy link
Collaborator Author

Data access:

Currently, we have a few options for accessing the data:

  • use the data from https://pro.europeana.eu/page/iiif#download
  • use the API for access (do this once and save output)
  • use the API inside the loading script (I think this is only a good idea if we're sure more titles will be added regularly)

Sharing the data:

Since this is a large corpus, and there may also be a little bit of concern about things being removed from the Europeana website, I think it makes sense to upload a version of the data that has been made more amenable for computational research. @bmschmidt has some code for doing this we could use as a starting point.

There are then a few options we need to decide on:

  • do we create on 'top level' dataset and use the loading script to allow people to control what they download?
  • what filters are likely to be helpful when downloading data. the ones I can immediately think of:
    • date of publication (min, max)
    • OCR quality
    • language
    • possibly title
    • possibly source country
  • What fields do we want to include in the data. We want a balance between not losing too much fidelity and including surplus data that isn't going to be useful for most users (for very bespoke use cases, people can always go back to the XML).

@bmschmidt
Copy link

Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.

Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.

@davanstrien
Copy link
Collaborator Author

Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.

This depends a bit on how we decide to distribute the files. But one option is to have a dataset_script, which allows some control over what parts of the data are loaded. For example, if we have a bunch of files with a naming structure like TITLE_ID_YEAR.arrow, a script could be used to only load the requested parts. This means in practice when someone downloads those files, they only need to download the files they actually plan to use. It is also possible to do this filtering once files are downloaded but obviously, this only saves some processing time/space since the data/files still had to be downloaded before.

Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.

Perhaps a compromise between granularity and keeping the total number of files reasonable would be to organize each title into decade (or some other time span) buckets. Something like:

TITLE_A_1850_1859.arrow
TITLE_A_1860_1869.arrow
TITLE_A_1870_1879.arrow
TITLE_B_1850_1859.arrow

Then the dataset script can filter which files to load/download. I'll try and dig out some example scripts that have this kind of functionality and link them here. Happy to hear other suggestions for structuring things too.

@bmschmidt
Copy link

It seems like it would be possible to create the dataset according to reasonable chunkings and then afterwards write any post-hoc loading scripts that seemed like they'd be especially important? ("Austrian Papers," "German-language papers", "Communist papers", "Papers that published in the 1870s," etc.?)

FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.

In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.

@davanstrien
Copy link
Collaborator Author

FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.

I guess we could also use bigger chunks if we're going to end up with some very small slices. This is probably a little unavoidable for titles with very few issues but perhaps using 20 year chunks or larger would make sense for these titles?

In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.

Agree with this. I also feel uneasy using suggested article segmentation information since the quality of that can be so variable and will vary between titles/dates of publication. Do you start from the ALTO XML in your current script?

@stefan-it
Copy link
Contributor

Hi guys!

I worked with the Europeana Dumps last week/weekend. Here are some obvervations:

The language information is stored in dc:language from edm:ProvidedCHO attribute. Normally, you would expect a string or an array of languages. But... it is mixed. For some issues, it is a string and for some it is an array. So in our final metadata representation we should use an array. And the language code is e.g. "de" instead of "deu".

Regarding to the OCR confidence: it is not stored in the metadata dump. You need to manually calculate it, and it is stored in ALTO on word-level (!). For page-level or issue-level it needs to be manually calculated. And you definitely need to download the ALTO dump for that!

For German and French I did create some plots that show the number of issues per year, based on the language information in the metadata and using the dcterms:issued information.

For German it is:

image

For French:

image

So it seems that French data is very limited. I talked to @cneud and a license change from public domain to Gallica could explain that.

I've also extracted plain text data from ALTO files for German. The resulting plain text file has a size of 63GB. For pretraining the German Europeana BERT models I've used an older dump and the resulting plain text data had a size of 51GB, so this newer dump is larger.

@davanstrien
Copy link
Collaborator Author

Thanks so much for that @stefan-it. @bmschmidt @stefan-it, my suggested next step is to start with the smallest dataset from that dump to get to a format we're happy with. This will likely involve starting from the ALTO XML.

I think between us we probably all have some code for doing the ALTO XML parsing, as a starting point, I suggest we share that code (either linking here or adding a pull request to this repository), so we're not starting from scratch.

Does that sound okay to you both?

@cneud
Copy link

cneud commented Jul 28, 2022

Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.

@davanstrien
Copy link
Collaborator Author

Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.

That would be great — one option would be to include this in the datacard? We could also include it as part of the dataset too.

It would also be great to have any context for this data. If you think there is anyone in particular at Europeana who would be good to keep in the loop about this work, let me know.

@cneud
Copy link

cneud commented Jul 28, 2022

one option would be to include this in the datacard

Good suggestion, but indeed I wonder if the datacard will always be distributed with the data? If not, a simple README.txt might be more suitable perhaps?

anyone in particular at Europeana

Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.

@davanstrien
Copy link
Collaborator Author

Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.

Perfect! If you have time I'm happy to set up a meeting to discuss an approach that also works from the Europeana side?
Would also be good to hear about any similar efforts, definitely don't want to duplicate existing work.

@cneud
Copy link

cneud commented Jul 28, 2022

Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?

@davanstrien
Copy link
Collaborator Author

Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?

That sounds good, I'll drop you an email.

@cneud
Copy link

cneud commented Jul 29, 2022

code for doing the ALTO XML parsing

Perhaps some of this code could be useful/repurposed:

@cneud
Copy link

cneud commented Jul 29, 2022

Some initial input for the dataset card/README:

@stefan-it
Copy link
Contributor

Hi @cneud , many thanks for that list!

I have one question left : was there any re-ocr done in the past years?

@cneud
Copy link

cneud commented Jul 29, 2022

was there any re-ocr done in the past years

Unfortunately no. We are currently finalizing a report where we compare the old OCR quality with the performance that can be achieved with state-of-the-art neural OCR/layout analysis methods (such as e.g. our eynollah) and I can already say that the quality improvements by re-OCRing would be considerable. Europeana currently has no capacity to re-OCR though, and the computational and organisational effort for doing this in a distributed setting would likely require another project with funding :(

@cneud
Copy link

cneud commented Jul 29, 2022

Here is a quick mapping from Europeana Dataset IDs to content providers

europeana-ID library
9200359 National Library of the Netherlands
9200356 National Library of Estonia
9200301 National Library of Finland
9200408 National Library of France (unpublished due to license)
9200333 Tessmann Library South-Tyrol
9200303 National Library of Latvia
9200357 National Library of Poland
9200300 Austrian National Library
9200338 Hamburg State and University Library
9200355 Berlin State Library
9200339 Belgrade University Library
9200396 National Library of Luxembourg

@davanstrien
Copy link
Collaborator Author

davanstrien commented Aug 1, 2022

Notes for discussion:

Background

  • Background documentation
  • Are ALTO formats consistent across collections?

Documentation

  • What to document

source

target format

  • Target output format: jsonl.gz, arrow?
  • What to include in target output (metadata + content)
  • how to split between files and/or directories

Info to include for each page:

{'OCRProcessing': {'processingDateTime': '2014-09-08',
                   'softwareCreator': 'ABBYY',
                   'softwareName': 'ABBYY FineReader Engine',
                   'softwareVersion': '11'},
 'language': 'FR',
 'mean_ocr': 0.8,
 'std_ocr': 0.1,
 'text': 'Text for page'}
@id: large_string
nc:text: large_string
newspaper_id: large_string
page: int32
dc:identifier: large_string
dc:language: large_string
dc:source: large_string
dc:subject: large_string
dc:title: large_string
dc:type: large_string
dc:extent: large_string
dc:isPartOf: large_string
dc:spatial: large_string
dc:relation: large_string
dc:hasPart: large_string
newspaper: large_string
dc:issued: date32[day]
  • IIIF image URLs for each page

Info to store in the path:

  • the title of the newspaper?
  • year(s) of publication (or range)?
  • language?

configuration

  • sample pack
  • text mining pack
  • XML co-ordinates

@stefan-it
Copy link
Contributor

stefan-it commented Aug 1, 2022

@davanstrien I will investigate some of issues that have multiple languages in the dc:language field (resulting in array as data type) for both dump and API.

@cneud
Copy link

cneud commented Aug 1, 2022

Are ALTO formats consistent across collections?

Within Europeana Newspapers, all OCR xml files are consistent, in that they are all using ALTO schema version 2.0.

Info to include for each page:

{'OCRProcessing': {'processingDateTime': '2014-09-08',
                   'softwareCreator': 'ABBYY',
                   'softwareName': 'ABBYY FineReader Engine',
                   'softwareVersion': '11'},

This part should be identical for most files and could also be documented on global dataset level. If there are any different entries though, this would allow identifying pages that were also processed with article separation by CCS software (docWORKS) merely from the ALTO (i.e. without EDM or METS).

* sample pack

* text mining pack

* XML co-ordinates

I personally like the different "packs" example from the National Library of Luxemburg (see https://data.bnl.lu/data/historical-newspapers/ and scroll down a bit) - they offer different sizes and different flavours of the data. I wonder how much of the creation of such "packs" could be done dynamically by the loading script?

The plain text should be straightforward to extract (beware of hyphenation and reading order), but I suppose @stefan-it has already done that.

A simplistic way to calculate OCR confidence per page is here, but there are certainly better ways that consider string length, compute mean/avg etc.

For those interested in image content in newspapers, it may be sufficient to extract bounding box coordinates of illustrations (and possibly also check <GraphicalElement>) and keep that with the IIIF image URLs for each page, so that snippets of any image content detected on the pages can be automatically collected.

@bmschmidt
Copy link

As mentioned in the call, I'm slapping my parsing code online. As mentioned in the blog post this is all throwaway notebooks I wrote primarily just to get the Neue Freie Presse out for grad student in my class, but I suspect it wouldn't be crazy hard to get it working on the other ALTO-XML dumps, if desirable.

Repo at https://github.com/bmschmidt/europapers

@davanstrien
Copy link
Collaborator Author

@bmschmidt @cneud @stefan-it

Just to let you know, I am currently putting some processing code together for this. I'm essentially Frankensteinining the code you all shared already. I'll hopefully have something to share tomorrow.

@stefan-it
Copy link
Contributor

stefan-it commented Aug 5, 2022

Hi @davanstrien , I prepared a GIST that shows how to parse metadata information.

You basically just need to download the zip archives, there's no need to unpack them (it is all done in-memory):

https://gist.github.com/stefan-it/2b9b04caad3fd1d3ec94e5f1456cbd63

There are two examples:

  • issue per year distribution for German and French issues
  • Extract all issue ids and metadata for issues with more than one detected languages

Here's the list of issues with more than one detected language:

issues_with_more_languages.txt

It is pretty interesting, because we need to discuss them, e.g. these kind of entries:

3000051869684: MetaData(title='Österreichische Buchhändler-Correspondenz - 1870-02-20', year='1870', pages=8, languages=['==', 'de'])

Where == is mistakenly used as language identifier?!

The ALTO parsing stuff is coming in another GIST, soon :)

@davanstrien
Copy link
Collaborator Author

Thanks for this, @stefan-it. I have the alto parsing done (adapting code from @cneud) but feel free to share if it's ready anyway :) For the metadata, I'm currently getting this via the API (adapting @bmschmidt's code).

I will check to see how different these two sets of metadata are for an item. I assume the API should hold fresher metadata in theory, but I don't know how much of a difference this makes. If it's possible to use the dumps and there isn't much of a difference in the metadata, then we'll probably prefer to use the dump files instead.

I'm also adding the IIIF manifest URLs for each item, plus links to the IIIF image of the page (and for those items where the ALTO XML predicts illustrations, I'm including a list of IIIF URLs with those regions cropped).

@bmschmidt @cneud I don't feel it makes sense to include the full IIIF manifest in the records (just the URL) but let me know if you disagree. I would suggest including some demo code of how to grab the full manifest in the dataset card.

@davanstrien
Copy link
Collaborator Author

Where == is mistakenly used as language identifier?!

I'm shocked you don't speak == 😜

@davanstrien
Copy link
Collaborator Author

Example of metadata from dump:

{'rdf:RDF': {'@xmlns:cc': 'http://creativecommons.org/ns#',
  '@xmlns:dc': 'http://purl.org/dc/elements/1.1/',
  '@xmlns:dcterms': 'http://purl.org/dc/terms/',
  '@xmlns:doap': 'http://usefulinc.com/ns/doap#',
  '@xmlns:edm': 'http://www.europeana.eu/schemas/edm/',
  '@xmlns:foaf': 'http://xmlns.com/foaf/0.1/',
  '@xmlns:ore': 'http://www.openarchives.org/ore/terms/',
  '@xmlns:owl': 'http://www.w3.org/2002/07/owl#',
  '@xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
  '@xmlns:rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
  '@xmlns:skos': 'http://www.w3.org/2004/02/skos/core#',
  '@xmlns:svcs': 'http://rdfs.org/sioc/services#',
  '@xmlns:wgs84_pos': 'http://www.w3.org/2003/01/geo/wgs84_pos#',
  'edm:Place': {'@rdf:about': 'http://d-nb.info/gnd/4016680-6',
   'skos:prefLabel': 'Feldkirch'},
  'edm:ProvidedCHO': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663',
   'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_2:ONB_00268/1850/ONB_00268_18500115.zip',
   'dc:language': 'de',
   'dc:source': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
   'dc:subject': {'@rdf:resource': 'http://d-nb.info/gnd/4067510-5'},
   'dc:title': 'Vorarlberger Zeitung - 1850-01-15',
   'dc:type': [{'@rdf:resource': 'http://schema.org/PublicationIssue'},
    {'#text': 'Analytic serial', '@xml:lang': 'en'},
    {'#text': 'Newspaper', '@xml:lang': 'en'},
    {'#text': 'Newspaper Issue', '@xml:lang': 'en'}],
   'dcterms:extent': {'#text': 'Pages: 4', '@xml:lang': 'en'},
   'dcterms:isPartOf': [{'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073527530'},
    {'@rdf:resource': 'http://data.theeuropeanlibrary.org/Collection/a0600'},
    {'#text': 'Europeana Newspapers', '@xml:lang': 'en'}],
   'dcterms:issued': '1850-01-15',
   'dcterms:spatial': {'@rdf:resource': 'http://d-nb.info/gnd/4016680-6'},
   'edm:isNextInSequence': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073479497'},
   'edm:type': 'TEXT'},
  'edm:WebResource': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg',
    'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg',
    'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
    'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg',
    'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
    'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg',
    'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
    'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004'}}],
  'ore:Aggregation': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663#aggregation',
   'edm:aggregatedCHO': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663'},
   'edm:dataProvider': 'Österreichische Nationalbibliothek - Austrian National Library',
   'edm:hasView': [{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
    {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
    {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg'}],
   'edm:isShownAt': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
   'edm:isShownBy': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
   'edm:object': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
   'edm:provider': {'#text': 'The European Library', '@xml:lang': 'en'},
   'edm:rights': {'@rdf:resource': 'http://creativecommons.org/publicdomain/mark/1.0/'}},
  'svcs:Service': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003',
    'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
    'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002',
    'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
    'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004',
    'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
    'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
   {'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001',
    'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
    'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}}]}}

@bmschmidt
Copy link

@albertvillanova does the Huggingsets datasets API support any standards for rich descriptions like this in the arrow metadata, at either the file or recordbatch level? It seems like a shame to throw it away. I've had on the back burner for a while a scheme to get ML people using the column description format of the W3C's CSV on the web spec, which is a bit too much to bite off here; but as a stopgap I often try to put some of this stuff into arrow metadata where it won't get into anyone's way. But sometimes loading scripts won't copy the metadata parts of the arrow schema.

(Sorry if I'm just making this over-complicated--I'm asking b/c I think this is an interesting test case of some places where these fields don't speak each other's language.)

@bmschmidt
Copy link

@davanstrien Thanks for tackling all this. One small note--all the metadata I could find was of the form'dc:title': 'Vorarlberger Zeitung - 1850-01-15', but I think for typical use cases it's important to drop the date information from that field (which is captured in dcterms:issued) from the back of the title to allow more regular filtering.

@davanstrien
Copy link
Collaborator Author

@davanstrien Thanks for tackling all this. One small note--all the metadata I could find was of the form'dc:title': 'Vorarlberger Zeitung - 1850-01-15', but I think for typical use cases it's important to drop the date information from that field (which is captured in dcterms:issued) from the back of the title to allow more regular filtering.

My current plan was to parse some fields that we consider particularly useful with a big of additional validation and then to shove some other metadata (the full extent is still up for discussion) either into flattened columns or into a bit of a generic metadata dump column. I'm hoping to have a proper draft of this ready on Monday. I'll ping you @cneud @stefan-it to discuss the output

@davanstrien
Copy link
Collaborator Author

A quick update on this:

I have some semi-working code with some rough edges here (https://github.com/davanstrien/altoxml2dataset).

Currently, the code:

  • parses the text from the ALTO XML
  • gets the word confidence and calculates a mean and standard deviation for each page
  • gets the bounding boxes for any illustrations contained on the page.

I'm currently parsing the metadata from the XML metadata dumps on the website. I went back and forth about this but think this might be the best option for now with the possibility to give guidance on accessing the metadata from the API in the documentation for the dataset (more on this below).

For the metadata I currently get:

  • the title of the newspapers
  • the Languages: I'm filtering out ==. If someone finds out that this is significant in some way, we can add it back in.
  • date of publication
  • iiif URL for the item: for this, I didn't add the URLs for all the bounding boxes, again I thought this would be better to include in the documentation.

Questions

  • what (other) metadata do we want to include in a 'flattened' format, i.e. include a specific column for that field with a nice name?
    I plan to also use the dates and the languages (where there is only a single one) in the filenames for the parquet files. We can return to this once we've agreed on the other fields and decide what level of granularity makes sense.

  • I have tried dumping most of the metadata inside a dictionary to a 'complete/additional metadata' column, but at the moment, Arrow gets upset at this. I think there are two options here:

    • flattening the metadata so Arrow stops complaining
    • being a bit more selective about which subset of metadata to put into the 'additional metadata' column
    • putting the link to the full metadata with some example code for retrieving this in the dataset documentation

I'm leaning toward the third option but happy to hear arguments in favour of the others.

An example instance (the filepaths will be tidied to ensure we don't have the parent directories):

{'fname': 'test_data/9200396/BibliographicResource_3000118436002/75.xml',
 'text': "15 * Décembre qu’ils avoient fous leurs yeux qu’il falloit s’attacher, puifque le falut de tous en rélul- toit ; auffi-tôt il les guide , prend une échelle, & monte avec eux fur le toit de cette, grange grange , où ils paffent la nuit à y étouffer les étincelles étincelles & à faire tomber les charbons que l’activité l’activité des flammes lançoit continuellement fur cette couverture de chaume. Ainû votant leur patrimoine fe confumer près d’eux, ils fe dévouèrent généreufement à un travail qui devoir conferver au moins deux cents mai- fons, & qui en effet arrêta les progrès de l’incendie. On écrit de St. Euftache qu’un ouragan plus terrible que celui de 1738 a caufé des dommages confidérables à la Guadeloupe & à Grandiere : tous les vaiffeaux ont été jettés fort avant dans les terres, & 011 défef- pere de pouvoir les remettre en mer. Indépendamment Indépendamment de quelques maifons de pierre, toutes celles qui étoient bâties en bois, ont été abattues par la violence du vent. Cet ouragan étoit accompagné d’une pluie conli- dérable, qui en peu de tems forma une efpece de déluge. On vient d’établir, fous la protection du gouvernement, une manufacture de Aparté- rie au fauxbourg St. Antoine; c’eft une fabrication fabrication de cordages avec la plante que les naturalises appellent gramen fpartcnm. On connoit dans la marine le fparion , cordage de genêt d’Efpagne, d’Afrique & de Murcie; Murcie; d’un bon ufage, l'oit à l’eau de mer, foie à l’eau douce. Hé fleur Bcithc, qui dt 651",
 'mean_ocr': 0.5280378429774902,
 'std_ocr': 0.18456327090617758,
 'bounding_boxes': [],
 'item_id': '9200396/BibliographicResource_3000118436002',
 'metadata_xml_fname': 'test_data/metadata/http%3A%2F%2Fdata.theeuropeanlibrary.org%2FBibliographicResource%2F3000118436002.edm.xml',
 'title': 'Journal historique et littéraire',
 'date': '1776-12-15',
 'languages': ['fr'],
 'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg',
 'multi_language': False}

There are quite a few rough edges/things to finish with the code, but I wanted to get input on these things before proceeding too far down one path.

I have made minimal effort to make any of this code performant (the only considerations on this front are using slotted classes and multiprocessing). Since this code will only be run occasionally, I don't think much optimization is worthwhile here, but I'll take a quick look for any easy improvements later this week.

@bmschmidt
Copy link

bmschmidt commented Aug 9, 2022

Thanks for tackling this.

I don't know what you've got in additional metadata, but the simplest route to shoving 'etc' into a column is to encode as JSON before stuffing it in there. If it's relatively short that might be worth it. One thing to avoid at all costs is a plan that works for arrow-encoding each individual file but ends up with different schemas for different files.

Shorter notes, which are extremely pedantic and I'm sorry for that but I feel like that's the name of the game here.

  1. Rather than item_id which is obscure in both what an item is and what the name space is, I'd prefer a key called issue_uri of the form https://www.europeana.eu/item/9200396/BibliographicResource_3000118436002. Reason being that it's a sin to replace a universal id with a local one, and that it's easy to misread 'item_id' as referring to 'this page' rather than 'the issue containing this page.'
  2. I don't think 'fname' is a meaningful field. 'fname': 'test_data/9200396/BibliographicResource_3000118436002/75.xml' and metadata_xml_fname capture a path that's very dump-dependent.
  3. There is a need for an id for each individual page with the set, probably just called id or @id: I would suggest 9200396/BibliographicResource_3000118436002/75 or 9200396/BibliographicResource_3000118436002$75.
  4. Rather than a local path to metadata_xml_fname, just use the URL buried in there.
  5. I prefer language to languages because it's a dc term in the original items. I don't think 'multi_language' is necessary column.
  6. This one's harder, but I prefer 'issued' to 'date' because it's a dc term in the original items and using 'date' on a column that consists of dates is redundant.
  7. There is an argument for encoding 'date' as an Arrow timestamp rather than a string that is implicitly ISO-8661. Since Arrow cannot automatically import all valid ISO-8661 datestrings (eg, 1945-12 is a valid date) it's probably worth checking if automatic conversion works.
  8. It would be nice to include the city of publication (probably by its present-day country and present-day name, maybe the LOD form) in the metadata.
  9. Perhaps someone at Europeana can confirm is these hash-laden urls are the best we can do? Good lord, they're terrible. 'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg'

@cneud
Copy link

cneud commented Aug 9, 2022

Perhaps someone at Europeana can confirm is these hash-laden urls are the best we can do? Good lord, they're terrible. 'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg'

...calling Europeana's @hugomanguinhas - any ideas or suggestions?

@hugomanguinhas
Copy link

Even though the images are being served via an Europeana domain (via our gateway)... they are actually hosted by the technical partner (PSNC) in the Europeana Newspapers project and the URLs are based on the eCloud infrastructure which has a very complex naming and versioning system.... this is something we dont control and I do agree that the URLs seem unnecessarily long and also make our IIIF output bigger than it could be.

@davanstrien
Copy link
Collaborator Author

@bmschmidt @cneud @hugomanguinhas thanks all; I will try and find time to work on this a bit more next week. I hope to have an initial version of the full output to review by then.

@davanstrien
Copy link
Collaborator Author

@bmschmidt @cneud @hugomanguinhas thanks all; I will try and find time to work on this a bit more next week. I hope to have an initial version of the full output to review by then.

Apologies for the radio silence on this, I got busy with other things. I have blocked out some time to work on this later this week.

@davanstrien
Copy link
Collaborator Author

I finally have the start of a suggested approach for this dataset. The dataset can be found here: https://huggingface.co/datasets/biglam/europeana_newspapers.

This repo includes a sample of parquet files with text and some metadata organised by language and decade i.e. language-decade.parquet.

There is also a loading script which allows you to load a subset of this data either using an existing language configuration which you can list using:

from datasets import get_dataset_config_names
get_dataset_config_names("biglam/europeana_newspapers")
>>> ['sv', 'fi']

or you can pass in a list of languages you want to load:

from datasets import load_dataset

dataset = load_dataset("biglam/europeana_newspapers", languages=["sv","fi"])

There is also an option to filter by min/max decade:

from datasets import load_dataset

dataset = load_dataset("biglam/europeana_newspapers", languages=["sv","fi"], min_decade=1910)

All of these options will only download the data required. If you only as for fr, you won't download any other languages, i.e. you only download what is needed. Although the data size is much reduced from the original XML files, I think it's still desirable to make it easy to filter before downloading as far as practical.

The loading script still has some rough edges, but hopefully, this gives a sense of how it will work.

What about the additional metadata?

I felt that shoving all the extra metadata into a column was getting a bit clunky. Instead, I'm going to outline in the README/Datacard how to do this using the Europeana API. This will give people a sense of how they can grab the metadata they need. I will also create another version of this dataset that includes this extra metadata (also grabbed using the API). We can then see which ends up being downloaded more often, which may give us some sense of how much the potential users of this data value the additional metadata.

### What about the rest of the data?

I will add all of the data for this short but I thought this already gave a sense of how things would work.

cc @cneud @bmschmidt @stefan-it @cakiki

I would be happy to have feedback if you think anything is missing/could be improved

@cneud
Copy link

cneud commented Oct 25, 2022

Thank you @davanstrien! It would be awesome if one could also load subsets e.g. based on their OCR confidence, but since this information is not included in the Europeana metadata but only in the ALTO files I don't think it can be done so easily.

Generally I agree that due to different interests and also size constraints, offloading the fetching of additional metadata to the use of the Europeana API is a reasonable way forward. I will also try to contribute to the README/Datacard.

@davanstrien
Copy link
Collaborator Author

Thank you @davanstrien! It would be awesome if one could also load subsets e.g. based on their OCR confidence, but since this information is not included in the Europeana metadata but only in the ALTO files I don't think it can be done so easily.

There would be a way to filter by ocr confidence as part of the loading script but this would still involve downloading all of the data before then so probably isn't worth it. It might also end up being more efficient to do it once the dataset is loaded since you can then also use multiprocessing i.e. something like

good_ocr_ds = ds.filter(lambda x: x['mean_ocr']>=0.9, num_proc=8)

Generally I agree that due to different interests and also size constraints, offloading the fetching of additional metadata to the use of the Europeana API is a reasonable way forward. I will also try to contribute to the README/Datacard.

That would be great, I'll make a start on that soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Dataset to be added
Development

No branches or pull requests

5 participants