Add dataset: chronicling_america #85

davanstrien · 2022-09-27T15:37:49Z

A URL for this dataset

https://chroniclingamerica.loc.gov/about/api/#bulk-data

Dataset description

Chronicling America is a Library of Congress project to digitise historic newspapers. The collection contains mostly English but also contains other languages. Breakdown by language: https://public.tableau.com/app/profile/chronicling.america#!/vizhome/ChroniclingAmericaLanguageCoverageBubble/All_Lang

Various ways of accessing this data include bulk downloads and an API. The API may be the most helpful way of accessing this dataset (via dataset loading script) because this dataset is not static (more titles are digitised and added on a rolling basis).

The 'newspapers' API (https://chroniclingamerica.loc.gov/newspapers.json) is probably the best starting point. This starts instead from a list of Newspaper titles for which digital content is held. A title, i.e. https://chroniclingamerica.loc.gov/lccn/sn86072192.json, contains a bunch of metadata.

.

This API also contains all the issues for that title. For each issue, you get a set of pages. Each page contains the plain text generated from the OCR for that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1/ocr.txt and a link to the image of that page, e.g. https://chroniclingamerica.loc.gov/lccn/sn82014726/1888-04-07/ed-1/seq-1.jp2.

My suggested approach to loading this dataset would be to call https://chroniclingamerica.loc.gov/newspapers.json at the start of the script and, depending on some filters defined in the loading script, i.e. start/end date of interest, build up a list of relevant URLs for the text/images for each page.

If you want to work on this dataset, please cc @davanstrien and @albertvillanova!

Dataset modality

Mixed

Dataset licence

Other license

Other licence

https://chroniclingamerica.loc.gov/about/#rights

How can you access this data

Via an open API

size of dataset

10GB

Confirm the dataset has an open licence

To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

The text was updated successfully, but these errors were encountered:

davanstrien added candidate-dataset Proposed dataset to be added dataset Dataset to be added and removed candidate-dataset Proposed dataset to be added labels Sep 27, 2022

bigscience-workshop-projects bot moved this to Todo in BigLAM: BigScience Libraries, Archives and Museums Sep 27, 2022

bigscience-workshop-projects bot added this to BigLAM: BigScience Libraries, Archives and Museums Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset: chronicling_america #85

Add dataset: chronicling_america #85

davanstrien commented Sep 27, 2022

Add dataset: chronicling_america #85

Add dataset: chronicling_america #85

Comments

davanstrien commented Sep 27, 2022

A URL for this dataset

Dataset description

Dataset modality

Dataset licence

Other licence

How can you access this data

size of dataset

Confirm the dataset has an open licence

Contact details for data custodian