Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: chronicling_america #85

Open
1 task done
davanstrien opened this issue Sep 27, 2022 · 0 comments
Open
1 task done

Add dataset: chronicling_america #85

davanstrien opened this issue Sep 27, 2022 · 0 comments
Labels
dataset Dataset to be added

Comments

@davanstrien
Copy link
Collaborator

A URL for this dataset

https://chroniclingamerica.loc.gov/about/api/#bulk-data

Dataset description

Chronicling America is a Library of Congress project to digitise historic newspapers. The collection contains mostly English but also contains other languages. Breakdown by language: https://public.tableau.com/app/profile/chronicling.america#!/vizhome/ChroniclingAmericaLanguageCoverageBubble/All_Lang

Various ways of accessing this data include bulk downloads and an API. The API may be the most helpful way of accessing this dataset (via dataset loading script) because this dataset is not static (more titles are digitised and added on a rolling basis).

The 'newspapers' API (https://chroniclingamerica.loc.gov/newspapers.json) is probably the best starting point. This starts instead from a list of Newspaper titles for which digital content is held. A title, i.e. https://chroniclingamerica.loc.gov/lccn/sn86072192.json, contains a bunch of metadata.

Screenshot 2022-09-27 at 16 32 26.

This API also contains all the issues for that title. For each issue, you get a set of pages. Each page contains the plain text generated from the OCR for that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1/ocr.txt and a link to the image of that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1.jp2.

My suggested approach to loading this dataset would be to call https://chroniclingamerica.loc.gov/newspapers.json at the start of the script and, depending on some filters defined in the loading script, i.e. start/end date of interest, build up a list of relevant URLs for the text/images for each page.

If you want to work on this dataset, please cc @davanstrien and @albertvillanova!

Dataset modality

Mixed

Dataset licence

Other license

Other licence

https://chroniclingamerica.loc.gov/about/#rights

How can you access this data

Via an open API

size of dataset

10GB

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Dataset to be added
Development

No branches or pull requests

1 participant