You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Various ways of accessing this data include bulk downloads and an API. The API may be the most helpful way of accessing this dataset (via dataset loading script) because this dataset is not static (more titles are digitised and added on a rolling basis).
My suggested approach to loading this dataset would be to call https://chroniclingamerica.loc.gov/newspapers.json at the start of the script and, depending on some filters defined in the loading script, i.e. start/end date of interest, build up a list of relevant URLs for the text/images for each page.
A URL for this dataset
https://chroniclingamerica.loc.gov/about/api/#bulk-data
Dataset description
Chronicling America is a Library of Congress project to digitise historic newspapers. The collection contains mostly English but also contains other languages. Breakdown by language: https://public.tableau.com/app/profile/chronicling.america#!/vizhome/ChroniclingAmericaLanguageCoverageBubble/All_Lang
Various ways of accessing this data include bulk downloads and an API. The API may be the most helpful way of accessing this dataset (via dataset loading script) because this dataset is not static (more titles are digitised and added on a rolling basis).
The 'newspapers' API (https://chroniclingamerica.loc.gov/newspapers.json) is probably the best starting point. This starts instead from a list of Newspaper titles for which digital content is held. A title, i.e. https://chroniclingamerica.loc.gov/lccn/sn86072192.json, contains a bunch of metadata.
.
This API also contains all the issues for that title. For each issue, you get a set of pages. Each page contains the plain text generated from the OCR for that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1/ocr.txt and a link to the image of that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1.jp2.
My suggested approach to loading this dataset would be to call
https://chroniclingamerica.loc.gov/newspapers.json
at the start of the script and, depending on some filters defined in the loading script, i.e. start/end date of interest, build up a list of relevant URLs for the text/images for each page.If you want to work on this dataset, please cc @davanstrien and @albertvillanova!
Dataset modality
Mixed
Dataset licence
Other license
Other licence
https://chroniclingamerica.loc.gov/about/#rights
How can you access this data
Via an open API
size of dataset
Confirm the dataset has an open licence
Contact details for data custodian
No response
The text was updated successfully, but these errors were encountered: