Add dataset: distantreader #80

davanstrien · 2022-08-05T15:33:10Z

A URL for this dataset

http://library.distantreader.org/

Dataset description

This is a fledging collection of data sets created by the Distant Reader -- a library.

The result of the Distant Reader process is the creation of data sets called "study carrels". The student, research, or scholar can then use a study carrel to read the content of a carrel both closely as well as at a distance. The purpose of the library is to illustrate and demonstrate the breadth & depth of what can be created with the Reader. Presently, there are about 2,000 items in the collection.

Dataset modality

Text

Dataset licence

No response

Other licence

No response

How can you access this data

Other

size of dataset

No response

Confirm the dataset has an open licence

To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

davanstrien · 2022-08-05T15:34:00Z

@ericleasemorgan I thought we could use this issue to discuss further best approach for this dataset :)

davanstrien · 2022-08-05T15:47:43Z

I'll try and take a closer look at this again next week but some initial thoughts below:

Just to re-iterate, the next step is for me to write a little Python script with a specific name and in a specific location. The script is really a set of classes and objects denoting features of my data sets. I then run another script that looks for the script I just wrote, and it will output data files of some sort, and those data files will go to 🌸? Finally, somebody will then be able to run something like above (ds=load_dataset('load_distant_reader', name='homer') to actually load my data sets. Correct?

This would be one way of managing this. I think it depends a little bit on how you think a user might be likely to interact with this dataset. Since you already support quite a few ways of working with the existing collections using your library, it probably doesn't make sense to recreate this behaviour.

It may be interesting to create a loading script that loads all of the texts in the catalogue (possibly allowing the user to specify some filters for defining what is collected from the catalogue.

This could work by starting from the JSON catalogue and then parsing the manifests for an item for the things that might be useful to include for ML research.

If so, then many of the examples in [1] allude to training sets, splits, and testing. To what degree is this required for me to denote? All of my data (plain text) file are saved in a single directory, and it is not divided into training and testing sets.

Since the data would most likely not be used for supervised training, it is okay to put everything into a single training split. This is often how other datasets for language modelling end up looking.

davanstrien added the candidate-dataset Proposed dataset to be added label Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset: distantreader #80

Add dataset: distantreader #80

davanstrien commented Aug 5, 2022

davanstrien commented Aug 5, 2022

davanstrien commented Aug 5, 2022

Add dataset: distantreader #80

Add dataset: distantreader #80

Comments

davanstrien commented Aug 5, 2022

A URL for this dataset

Dataset description

Dataset modality

Dataset licence

Other licence

How can you access this data

size of dataset

Confirm the dataset has an open licence

Contact details for data custodian

davanstrien commented Aug 5, 2022

davanstrien commented Aug 5, 2022